Data Management performance guidelines white paper > Data Source Components (DSCs) > UNICOM Intelligence Data File (DDF) CDSC
 
UNICOM Intelligence Data File (DDF) CDSC
Introduction
The UNICOM Intelligence Data File (DDF) CDSC reads and writes case data to a file-based database that is stored in a single file with a .ddf extension.
The Data File CDSC has the following features:
Feature
UNICOM Intelligence Data File
Storage type
Single file
Read
Yes
Write/update
Yes
Unbounded loops (levels)
Yes
Native WHERE clause support
Yes
Native expression support includes:
For any top level variables in HDATA, WHERE clauses can include any of the following variable types and operators:
Compressed format
Yes
Multiple user read
Yes
Multiple user write
Yes
WHERE clause variable types and operators
Type
=
<>
>
<
>=
<=
Is NULL
Is Not Null
Long
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Text
Yes
Yes
No
No
No
No
Yes
Yes
Double
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Boolean
Yes
Yes
No
No
No
No
Yes
Yes
Date
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Categorical
No
No
Yes1
No
No
No
Yes
Yes
1 Categorical > support is only implemented for {}.
Examples:
Respondent.Serial = 1
Name = "abc"
Prompted > {}
BrandsAwareness Is Not NULL
Supported subexpressions can be combined with the logical operators And and Or. For example:
Name = "abc" Or Prompted > {}
Data storage
The DDF CDSC is implemented with a SQLite database. Data is stored hierarchically, which helps to reduce the number of columns in any one table. Although SQLite can support many columns, the use of numerous columns is not recommended. Analysis shows that performance is reduced as the number of columns increases. For example, the following table shows SQLite read times for tables with a varying numbers of rows and columns, but the same amount of data:
Rows
Columns
Seconds
200000
250
6
100000
500
16
50000
1000
15
40000
1250
19
31250
1600
21
25000
2000
21
12500
4000
33
10000
5000
38
6250
8000
57
5000
10000
69
The DDF database schema does not use SQL joins when reading data from the SQLite database. To demonstrate how data is read by the DDF DSC, the schema and queries that are used for the Household example are detailed in the following DDF file schema:
The metadata columns for the top level are created in a table named L1. Each of the levels (the loops or grids that are defined in the metadata) are stored in separate tables. For example, the Person level is stored in the L2 table, and the Vehicle level is stored in L5. The levels that are available in a DDF database are records in the Levels table.
Note For more information about the DDF DSC schema, see UNICOM Intelligence Data File schema.
The following queries are run against the SQLite database for the Household data set to process the SELECT * FROM HDATA query in the Data Model:
SQLite database queries
Table
SQL
L1
select [:P0], [LevelId:L], [address:X], [numpersons:L], [numtrips_h:L], [numvehicle:L],
[floorarea:D], [household:L], [numrooms:L], [pets:L], [region:S], [tenure:S], [housetype:S],
[ageofbuilding:S] from L1 order by [:P0]
L2
select [:P1], [:P0], [LevelId:L], [age:L], [gender:S], [numtrips:L], [newspapers:L],
[name:X], [person:L], [weight:D], [occupation:S], [languages:S] from L2 order by [:P1], [:P0]
L3
select [:P2], [:P1], [:P0], [LevelId:L], [country:S], [daysaway:L], [satisfaction:D],
[transportmodes:L], [purpose:S], [trip:L] from L3 order by [:P2], [:P1], [:P0]
L4
select [:P2], [:P1], [:P0], [LevelId:S], [Column:S] from L4 order by [:P2], [:P1], [:P0]
L5
select [:P1], [:P0], [LevelId:L], [color:X], [mileage:L], [maintenance:D], [yearsowned:D],
[vehicle:L], [vehicletype:S], [vehicleage:S], [daysused:S] from L5 order by [:P1], [:P0]
L6
select [:P2], [:P1], [:P0], [LevelId:S], [Column:S] from L6 order by [:P2], [:P1], [:P0]
Note The order by clause for each query is an index on the selected table.
For each record that is read at a parent level, the corresponding child level is automatically advanced to the next parent level ID. For example, when moving from household 2 to 3, the result set in L2 is advanced as follows:
L1 - Top
:P0
address:X
 
1
15B Park Avenue, Harrogate, HG1 4TY
...
↓2
46 Freedom Lane, Brighton, BN2 3YT
 
→3
The Meadows, Clifton Heights, BS34 3EG
 
4
23 Steep Hill, Norfold, CB64 5TY
 
L2 - Person
:P1
:P0
age:L
 
1
1
25
..
↓2
1
45
..
↓2
2
43
..
↓2
3
15
..
↓2
4
12
..
→3
1
72
..
3
2
81
..
4
1
32
..
4
2
9
..
4
3
2
..
Similarly, when advancing to the next record at the person level (L2), the corresponding child levels are also advanced. For example:
L2 - Person
:P1
:P0
age:L
 
1
1
25
..
2
↓1
45
..
2
→2
43
..
2
3
15
..
2
4
12
..
3
1
72
..
3
2
81
..
4
1
32
..
4
2
9
..
4
3
2
..
L4 - TvDays
:P2
:P1
:P0
Column:S
1
1
1
54;
1
1
2
55;
1
1
3
54;
1
1
4
57;
1
1
5
54;
2
1
↓1
56;
2
1
↓2
55;
2
1
↓3
54;
2
1
↓4
54;
2
1
↓5
54;
2
2
→1
58;
2
2
2
58;
2
2
3
58;
While there is some cost in keeping the query result sets in step, it is cheaper than having wide tables (as occurs when the metadata contains many grids or loops).
The DDF DSC does not expose a VDATA view. The Table Services DSC (HVDataDsc) is used to process a SELECT * FROM VDATA query. The DDF DSC constructs each VDATA row by selecting against HDATA and navigating each of the child levels for the current top level record in HDATA. Only a single VDATA row is kept in memory at a time.
Regarding metadata use in the DSC, the DDF DSC contains enough metadata so that it can be used without an MDM document (at least for access from HDATA). The Levels table is used to store the levels structure. The suffix on each of the column names denotes the variable type.
Levels table
MDM variable type
UNICOM Intelligence Data File schema type
Column name suffix
Long
Integer
:L
Double
Real
:D
Text
Text
:X
Date
Real
:T
Boolean
Integer
:B
Categorical
Integer or Text
:Cn or :S
The following table lists the overall storage file size (in MB) for the DDF CDSC:
DDF CDSC storage file size
skidemo
UNICOM Intelligence database
UNICOM Intelligence file
Quantum
Quanvert
SPSS
Total size
19
2.18
0.9
3.25
1.7
Backup
18.1
0.65
*
2
*
The DDF CDSC storage size depends on the custom connection properties Encoding, MaxMultiCol, and PageSize. The connection properties are described in more detail in Issues that are known to impact performance.
The DDF backup file is a compressed file with the *.dzf extension. The compressed file contains the DDF file and a data version file (DataVersionInfo). The compressed file format is intended to make DDF data more portable. It is possible to export directly to the compressed format by specifying the *.dzf extension; compressed files can also be opened directly by the DDF CDSC.
Read performance
The following table provides the DDF CDSC read performance (measured in seconds):
DDF CDSC read performance
skidemo
UNICOM Intelligence database
UNICOM Intelligence file
Quantum
Quanvert
SPSS
1 variable
0.062
0.062
0.109
0.031
0.062
5 variables
0.11
0.063
0.204
0.078
0.078
All variables
1.078
0.562
0.843
0.672
0.61
The DDF CDSC has acceptable single variable read performance, providing acceptable response times for ad hoc tabulation scenarios.
Among the profiled DSCs, the DDF CDSC generally exhibits the fastest read performance for the entire data.
Read performance for a single row, when a WHERE clause that can be evaluated natively in the DDF CDSC is used, is typically an order of magnitude faster than evaluating the WHERE clause in the data model. The following table shows read performance when querying a single row by serial number:
DDF CDSC read performance
skidemo
UNICOM Intelligence database
UNICOM Intelligence file
Quantum
Quanvert
SPSS
All variables
0.01
0.008
0.109
0.359
0.09
The performance improvement from a WHERE clause that is evaluated natively by the DDF DSC is dramatically improved when the columns that are referenced in the WHERE clause are indexed.
The benefit of evaluating WHERE clauses natively reduces as more rows are returned. Native WHERE clause support provides better performance in scenarios where a small subset of the data set is returned or updated.
Write performance
The following table provides DDF CDSC write performance (measured in records per second):
DDF CDSC write performance
skidemo
UNICOM Intelligence database
UNICOM Intelligence file
Quantum
Quanvert
SPSS
Records per second
80
1897
1523
*
1971
The following table provides update performance (measured in seconds) for every value in a single weight column:
DDF CDSC update performance
skidemo
UNICOM Intelligence database
UNICOM Intelligence file
Quantum
Quanvert
SPSS
1 variable
0.719
0.219
0.282
*
0.578
The DDF CDSC has acceptable write performance and exceptional update performance, making it suited for data management operations, such as cleaning, coding, and weighting.
The DDF CDSC is also well-suited for single user UNICOM Intelligence applications, such as UNICOM Intelligence Data Entry or CAPI.
Usage recommendations
The recommended DDF CDSC usage is as follows:
Single-user UNICOM Intelligence. Because the DDF DSC has acceptable write and update performance, it is a good choice for disconnected UNICOM Intelligence applications, such as CAPI or Data Entry. Other DSCs should only be used for disconnected Data Entry in scenarios where data needs to be collected directly to another data format (such as an SPSS SAV file).
Data export to other formats. The DDF CDSC has acceptable read performance for exporting to other DSCs.
Data management. Copying data to the DDF CDSC is fast, and the CDSC provides acceptable update performance when weight or coding variables are updated. The CDSC also includes native WHERE clause support for variables at the top level, which negates any advantages of using the RDB CDSC for data management tasks.
Offline or web-based tabulation/analysis. Because of its acceptable read performance of both individual variables and the entire data set, the DDF CDSC is suited for both ad hoc tabulation and for the population of large table runs. The DDF CDSC provides multiple user read support, and can be used for both offline and web-based analysis.
Levels data. The DDF CDSC supports unbounded loops and can be used as an alternative data format for Quanvert levels projects.
Portable data format. The compressed DDF format is a excellent choice for distributing data between UNICOM Intelligence applications. The major disadvantage with the format is that it is only recognized by UNICOM Intelligence applications.
The DDF CDSC is not suited in the following circumstances:
Multiple user data collection. SQLite, upon which the DDF CDSC is based, uses file locks to control update access to the database. File locking might not always work correctly on all network file systems. The DDF CDSC was not extensively tested for multiple user data collection. You must exercise caution when a database might be updated by multiple users.
Issues that are known to impact performance
The following issues are known to impact DDF CDSC performance:
Natively evaluated WHERE clauses. Using WHERE clauses, that can be evaluated by the CDSC, significantly improves performance. The performance from using a natively evaluated WHERE clause is improved when the referenced column is indexed. An index is automatically added to the Respondent.Serial variable and indexes can be added to other variables that use the Index custom property (for example, Q2 "Q2" [Index = true] ID;).
Using QvToDF to transfer Quanvert datasets to the DDF CDSC. The QvToDF utility is a simple command-line tool that can be used to transfer Quanvert data sets to the DDF CDSC. Using this utility is the easiest way to transfer Quanvert levels projects to the DDF CDSC. It is recommended that the mergex setting be used when transferring Quanvert data sets to DDF. The setting provides approximately 20% better tabulation performance. For example:
qvtodf /qlocation=file.pkd;mergex=1 /dlocation=files.ddf
Custom connection properties. The following custom connection properties (MR Init Custom) can impact performance and file size:
TransactionMode
The property controls whether each insert or update operation is committed immediately, or when the connection is closed. A value of 1 specifies that the DDF CDSC use a single transaction for the connection, which is only committed when the connection is closed. A value of 0 specifies that no transactions are explicitly used, in which case SQLite adds a transaction for each statement. The default value is 1, which results in improved performance when the DDF CDSC is used for data management tasks. A value of 0 must be used for UNICOM Intelligence applications where data needs to be immediately persisted to the database.
Encoding
The property controls the text value encoding and can be set to UTF-8 or UTF-16. The default value is UTF-8, but other databases might benefit when the value is set to UTF-16. The file size is roughly the same as either encoding format when there are wide characters in the column names and data rows. However, when there are no wide characters, the UTF-8 encoding can be between 65% to 95% of the size of the corresponding UTF-16 encoding (depending on the database). Initial tests indicate that reading a UTF-8 file with no wide characters is slightly faster than a file that uses UTF-16 encoding.
PageSize
The property is used to set the SQLite page size. The property can be used only after a DDF file is created. The property can be set to 512, 1024, 2048, 4096, 8192, 16384 or 32768. The DDF CDSC uses the default value of 32768, as the value provides optimal read times. Different page sizes have no obvious effect on write performance. For large databases, a larger page size tends to generate a slightly smaller data file. However, for small databases (<2000 records), a larger page size can create a data file up to 10 times larger.
MaxMultiCol
Categorical variables can either be stored as strings or multiple integer columns in the SQLite database. The MaxMultiCol custom property defines the maximum number of categories that can be stored using the multiple-column integer representation. When the maximum number of categories for a variable exceeds the MaxMultiCol value, the variable is stored using the string representation. The default value is 1, which means that single-response categoricals are stored in an integer column, and multiple-response categoricals are stored in a text column. Storing single-coded categoricals as integer columns in the database, rather than text, generally makes reading slightly faster and the decreases the database size.
Using VACUUM to rebuild the database. Free pages are not automatically removed from the database after data is deleted from a SQLite database. The sqlite3 command line tool can be used to run the VACUUM command to rebuild the DDF database. The VACUUM command eliminates free pages, aligns table data to be contiguous, and otherwise cleans up the database file structure. For example:
sqlite3 skidemo.ddf "PRAGMA VACUUM;"
Note The sqlite3 command line tool is available for download at:
http://www.sqlite.org/download.html
See
Data Source Components (DSCs)