Running a horizontal merge
To specify that you want to run a horizontal merge (see
How a horizontal merge works, do the following:
▪Add a JoinKey parameter to every InputDataSource section in your data management (DMS) script. The JoinKey parameter defines the variable that will be used to join the data source to the other input data sources. Typically, Respondent.Serial is used, but you can specify other variables or specify different variables for different input data sources. However, the variable that is specified must always contain a unique value for each case in the data source.
When the SelectQuery in the input data source is specified as select ... from HData, the horizontal merge is conducted by the HDATA view. LevelID is used as the join key for HDATA Merge. The join key cannot be changed when merging loops, with the same name, into different input data sources.
▪Add a JoinType parameter to the InputDataSource section for the master data source. The JoinType parameter also affects data in the HDATA Merge output loop.
The valid values for JoinType are:
Value of JoinType
|
Description
|
Full
|
All cases on all input data sources are written to the output data. Cases with identical JoinKey values are combined before writing. This is the default behavior if JoinType is not specified.
|
Inner
|
Only cases with identical JoinKey values on all input data sources are written to the output data. The cases will be combined before writing.
|
Left
|
Only cases with JoinKey values that exist in the master data source are written to the output data. Cases with identical JoinKey values are combined before writing.
|
Typically, you use a full join for most of your horizontal merges, because it is the only type of join that ensures that all cases in all input data sources are written to the output data.
Improving the performance of a horizontal merge
By default, a horizontal merge sorts each input data source by the variable defined in its JoinKey parameter before joining the data sources. If your data sources that are already sorted in the correct order, you can improve the performance of the merge by specifying that a data source does not need to be sorted. To do this, add a JoinKeySorted parameter to every InputDataSource section that does not need to be sorted, and set its value to True.
In this example, the inclusion of JoinKeySorted = True implies that the data source is already sorted in SerialNumber order and specifies that it must not be sorted:
JoinKey = "SerialNumber"
JoinKeySorted = True
Example
The following example contains two InputDataSource sections that define a UNICOM Intelligence Data File and a IBM SPSS Statistics SAV file that will be horizontally merged.
The two data sources are joined using the Respondent.Serial system variable on the UNICOM Intelligence Data File and the SerialNo variable on the IBM SPSS Statistics SAV file. A full join is carried out, meaning that all of the cases on both data sources are written to the output data. The JoinKeySorted parameter is included for the second data source to define that the IBM SPSS Statistics SAV file is already sorted in SerialNo order and must not be sorted.
InputDatasource(myMasterDataSource)
ConnectionString = " _
Provider = mrOleDB.Provider.2; _
Data Source = mrDataFileDsc; _
Location = [INSTALL_FOLDER]\IBM\SPSS\DataCollection\7\DDL\Data\Data Collection File\MergeMaster.ddf; _
Initial Catalog = [INSTALL_FOLDER]\IBM\SPSS\DataCollection\7\DDL\Data\Data Collection File\MergeMaster.mdd;"
JoinKey = "Respondent.Serial"
JoinType = "Full"
End InputDatasource
InputDatasource(mySecondInputDataSource)
ConnectionString = " _
Provider = mrOleDB.Provider.2; _
Data Source = mrSavDsc; _
Location = [INSTALL_FOLDER]\IBM\SPSS\DataCollection\7\DDL\Data\Sav\MergeSecond.sav; _
Initial Catalog = [INSTALL_FOLDER]\IBM\SPSS\DataCollection\7\DDL\Data\Sav\MergeSecond.mdd;"
JoinKey = "SerialNo"
JoinKeySorted = True
End InputDatasource
Because a full join will run by default, JoinType = "Full" can be omitted.
See