UNICOM Intelligence Data File (DDF) CDSC

White papers > Data Management performance guidelines white paper > Data Source Components (DSCs) > UNICOM Intelligence Data File (DDF) CDSC

Introduction

The UNICOM Intelligence Data File (DDF) CDSC reads and writes case data to a file-based database that is stored in a single file with a .ddf extension.

The Data File CDSC has the following features:

Feature	UNICOM Intelligence Data File
Storage type	Single file
Read	Yes
Write/update	Yes
Unbounded loops (levels)	Yes
Native WHERE clause support	Yes Native expression support includes: ▪For any top level variables in HDATA, WHERE clauses can include any of the following variable types and operators:
Compressed format	Yes
Multiple user read	Yes
Multiple user write	Yes

WHERE clause variable types and operators

Type	=	<>	>	<	>=	<=	Is NULL	Is Not Null
Long	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Text	Yes	Yes	No	No	No	No	Yes	Yes
Double	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Boolean	Yes	Yes	No	No	No	No	Yes	Yes
Date	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Categorical	No	No	Yes1	No	No	No	Yes	Yes

1 Categorical > support is only implemented for {}.

Examples:

Respondent.Serial = 1
Name = "abc"
Prompted > {}
BrandsAwareness Is Not NULL

▪Supported subexpressions can be combined with the logical operators And and Or. For example:

Name = "abc" Or Prompted > {}

Data storage

The DDF CDSC is implemented with a SQLite database. Data is stored hierarchically, which helps to reduce the number of columns in any one table. Although SQLite can support many columns, the use of numerous columns is not recommended. Analysis shows that performance is reduced as the number of columns increases. For example, the following table shows SQLite read times for tables with a varying numbers of rows and columns, but the same amount of data:

Rows	Columns	Seconds
200000	250	6
100000	500	16
50000	1000	15
40000	1250	19
31250	1600	21
25000	2000	21
12500	4000	33
10000	5000	38
6250	8000	57
5000	10000	69

The DDF database schema does not use SQL joins when reading data from the SQLite database. To demonstrate how data is read by the DDF DSC, the schema and queries that are used for the Household example are detailed in the following DDF file schema:

This graphic is described in the surrounding text.

The metadata columns for the top level are created in a table named L1. Each of the levels (the loops or grids that are defined in the metadata) are stored in separate tables. For example, the Person level is stored in the L2 table, and the Vehicle level is stored in L5. The levels that are available in a DDF database are records in the Levels table.

Note For more information about the DDF DSC schema, see UNICOM Intelligence Data File schema.

The following queries are run against the SQLite database for the Household data set to process the SELECT * FROM HDATA query in the Data Model:

SQLite database queries

Table	SQL
L1	select [:P0], [LevelId:L], [address:X], [numpersons:L], [numtrips_h:L], [numvehicle:L], [floorarea:D], [household:L], [numrooms:L], [pets:L], [region:S], [tenure:S], [housetype:S], [ageofbuilding:S] from L1 order by [:P0]
L2	select [:P1], [:P0], [LevelId:L], [age:L], [gender:S], [numtrips:L], [newspapers:L], [name:X], [person:L], [weight:D], [occupation:S], [languages:S] from L2 order by [:P1], [:P0]
L3	select [:P2], [:P1], [:P0], [LevelId:L], [country:S], [daysaway:L], [satisfaction:D], [transportmodes:L], [purpose:S], [trip:L] from L3 order by [:P2], [:P1], [:P0]
L4	select [:P2], [:P1], [:P0], [LevelId:S], [Column:S] from L4 order by [:P2], [:P1], [:P0]
L5	select [:P1], [:P0], [LevelId:L], [color:X], [mileage:L], [maintenance:D], [yearsowned:D], [vehicle:L], [vehicletype:S], [vehicleage:S], [daysused:S] from L5 order by [:P1], [:P0]
L6	select [:P2], [:P1], [:P0], [LevelId:S], [Column:S] from L6 order by [:P2], [:P1], [:P0]

The order by clause for each query is an index on the selected table.

For each record that is read at a parent level, the corresponding child level is automatically advanced to the next parent level ID. For example, when moving from household 2 to 3, the result set in L2 is advanced as follows:

L1 - Top

:P0	address:X
1	15B Park Avenue, Harrogate, HG1 4TY	...
↓2	46 Freedom Lane, Brighton, BN2 3YT
→3	The Meadows, Clifton Heights, BS34 3EG
4	23 Steep Hill, Norfold, CB64 5TY

L2 - Person

:P1	:P0	age:L
1	1	25	..
↓2	1	45	..
↓2	2	43	..
↓2	3	15	..
↓2	4	12	..
→3	1	72	..
3	2	81	..
4	1	32	..
4	2	9	..
4	3	2	..

Similarly, when advancing to the next record at the person level (L2), the corresponding child levels are also advanced. For example:

L2 - Person

:P1	:P0	age:L
1	1	25	..
2	↓1	45	..
2	→2	43	..
2	3	15	..
2	4	12	..
3	1	72	..
3	2	81	..
4	1	32	..
4	2	9	..
4	3	2	..

L4 - TvDays

:P2	:P1	:P0	Column:S
1	1	1	54;
1	1	2	55;
1	1	3	54;
1	1	4	57;
1	1	5	54;
2	1	↓1	56;
2	1	↓2	55;
2	1	↓3	54;
2	1	↓4	54;
2	1	↓5	54;
2	2	→1	58;
2	2	2	58;
2	2	3	58;

While there is some cost in keeping the query result sets in step, it is cheaper than having wide tables (as occurs when the metadata contains many grids or loops).

The DDF DSC does not expose a VDATA view. The Table Services DSC (HVDataDsc) is used to process a SELECT * FROM VDATA query. The DDF DSC constructs each VDATA row by selecting against HDATA and navigating each of the child levels for the current top level record in HDATA. Only a single VDATA row is kept in memory at a time.

Regarding metadata use in the DSC, the DDF DSC contains enough metadata so that it can be used without an MDM document (at least for access from HDATA). The Levels table is used to store the levels structure. The suffix on each of the column names denotes the variable type.

Levels table

MDM variable type	UNICOM Intelligence Data File schema type	Column name suffix
Long	Integer	:L
Double	Real	:D
Text	Text	:X
Date	Real	:T
Boolean	Integer	:B
Categorical	Integer or Text	:Cn or :S

The following table lists the overall storage file size (in MB) for the DDF CDSC:

DDF CDSC storage file size

skidemo	UNICOM Intelligence database	UNICOM Intelligence file	Quantum	Quanvert	SPSS
Total size	19	2.18	0.9	3.25	1.7
Backup	18.1	0.65	*	2	*

The DDF CDSC storage size depends on the custom connection properties Encoding, MaxMultiCol, and PageSize. The connection properties are described in more detail in Issues that are known to impact performance.

The DDF backup file is a compressed file with the *.dzf extension. The compressed file contains the DDF file and a data version file (DataVersionInfo). The compressed file format is intended to make DDF data more portable. It is possible to export directly to the compressed format by specifying the *.dzf extension; compressed files can also be opened directly by the DDF CDSC.

Read performance

The following table provides the DDF CDSC read performance (measured in seconds):

DDF CDSC read performance

skidemo	UNICOM Intelligence database	UNICOM Intelligence file	Quantum	Quanvert	SPSS
1 variable	0.062	0.062	0.109	0.031	0.062
5 variables	0.11	0.063	0.204	0.078	0.078
All variables	1.078	0.562	0.843	0.672	0.61

▪The DDF CDSC has acceptable single variable read performance, providing acceptable response times for ad hoc tabulation scenarios.

▪Among the profiled DSCs, the DDF CDSC generally exhibits the fastest read performance for the entire data.

▪Read performance for a single row, when a WHERE clause that can be evaluated natively in the DDF CDSC is used, is typically an order of magnitude faster than evaluating the WHERE clause in the data model. The following table shows read performance when querying a single row by serial number:

skidemo	UNICOM Intelligence database	UNICOM Intelligence file	Quantum	Quanvert	SPSS
All variables	0.01	0.008	0.109	0.359	0.09

▪The performance improvement from a WHERE clause that is evaluated natively by the DDF DSC is dramatically improved when the columns that are referenced in the WHERE clause are indexed.

▪The benefit of evaluating WHERE clauses natively reduces as more rows are returned. Native WHERE clause support provides better performance in scenarios where a small subset of the data set is returned or updated.

Write performance

The following table provides DDF CDSC write performance (measured in records per second):

skidemo	UNICOM Intelligence database	UNICOM Intelligence file	Quantum	Quanvert	SPSS
Records per second	80	1897	1523	*	1971

The following table provides update performance (measured in seconds) for every value in a single weight column:

skidemo	UNICOM Intelligence database	UNICOM Intelligence file	Quantum	Quanvert	SPSS
1 variable	0.719	0.219	0.282	*	0.578

▪The DDF CDSC has acceptable write performance and exceptional update performance, making it suited for data management operations, such as cleaning, coding, and weighting.

▪The DDF CDSC is also well-suited for single user UNICOM Intelligence applications, such as UNICOM Intelligence Data Entry or CAPI.

Usage recommendations

The recommended DDF CDSC usage is as follows:

▪Single-user UNICOM Intelligence. Because the DDF DSC has acceptable write and update performance, it is a good choice for disconnected UNICOM Intelligence applications, such as CAPI or Data Entry. Other DSCs should only be used for disconnected Data Entry in scenarios where data needs to be collected directly to another data format (such as an SPSS SAV file).

▪Data export to other formats. The DDF CDSC has acceptable read performance for exporting to other DSCs.

▪Data management. Copying data to the DDF CDSC is fast, and the CDSC provides acceptable update performance when weight or coding variables are updated. The CDSC also includes native WHERE clause support for variables at the top level, which negates any advantages of using the RDB CDSC for data management tasks.

▪Offline or web-based tabulation/analysis. Because of its acceptable read performance of both individual variables and the entire data set, the DDF CDSC is suited for both ad hoc tabulation and for the population of large table runs. The DDF CDSC provides multiple user read support, and can be used for both offline and web-based analysis.

▪Levels data. The DDF CDSC supports unbounded loops and can be used as an alternative data format for Quanvert levels projects.

▪Portable data format. The compressed DDF format is a excellent choice for distributing data between UNICOM Intelligence applications. The major disadvantage with the format is that it is only recognized by UNICOM Intelligence applications.

The DDF CDSC is not suited in the following circumstances:

▪Multiple user data collection. SQLite, upon which the DDF CDSC is based, uses file locks to control update access to the database. File locking might not always work correctly on all network file systems. The DDF CDSC was not extensively tested for multiple user data collection. You must exercise caution when a database might be updated by multiple users.

Issues that are known to impact performance

The following issues are known to impact DDF CDSC performance:

▪Natively evaluated WHERE clauses. Using WHERE clauses, that can be evaluated by the CDSC, significantly improves performance. The performance from using a natively evaluated WHERE clause is improved when the referenced column is indexed. An index is automatically added to the Respondent.Serial variable and indexes can be added to other variables that use the Index custom property (for example, Q2 "Q2" [Index = true] ID;).

▪Using QvToDF to transfer Quanvert datasets to the DDF CDSC. The QvToDF utility is a simple command-line tool that can be used to transfer Quanvert data sets to the DDF CDSC. Using this utility is the easiest way to transfer Quanvert levels projects to the DDF CDSC. It is recommended that the mergex setting be used when transferring Quanvert data sets to DDF. The setting provides approximately 20% better tabulation performance. For example:

qvtodf /qlocation=file.pkd;mergex=1 /dlocation=files.ddf

▪Custom connection properties. The following custom connection properties (MR Init Custom) can impact performance and file size:

TransactionMode

The property controls whether each insert or update operation is committed immediately, or when the connection is closed. A value of 1 specifies that the DDF CDSC use a single transaction for the connection, which is only committed when the connection is closed. A value of 0 specifies that no transactions are explicitly used, in which case SQLite adds a transaction for each statement. The default value is 1, which results in improved performance when the DDF CDSC is used for data management tasks. A value of 0 must be used for UNICOM Intelligence applications where data needs to be immediately persisted to the database.

Encoding

The property controls the text value encoding and can be set to UTF-8 or UTF-16. The default value is UTF-8, but other databases might benefit when the value is set to UTF-16. The file size is roughly the same as either encoding format when there are wide characters in the column names and data rows. However, when there are no wide characters, the UTF-8 encoding can be between 65% to 95% of the size of the corresponding UTF-16 encoding (depending on the database). Initial tests indicate that reading a UTF-8 file with no wide characters is slightly faster than a file that uses UTF-16 encoding.

PageSize

The property is used to set the SQLite page size. The property can be used only after a DDF file is created. The property can be set to 512, 1024, 2048, 4096, 8192, 16384 or 32768. The DDF CDSC uses the default value of 32768, as the value provides optimal read times. Different page sizes have no obvious effect on write performance. For large databases, a larger page size tends to generate a slightly smaller data file. However, for small databases (<2000 records), a larger page size can create a data file up to 10 times larger.

MaxMultiCol

Categorical variables can either be stored as strings or multiple integer columns in the SQLite database. The MaxMultiCol custom property defines the maximum number of categories that can be stored using the multiple-column integer representation. When the maximum number of categories for a variable exceeds the MaxMultiCol value, the variable is stored using the string representation. The default value is 1, which means that single-response categoricals are stored in an integer column, and multiple-response categoricals are stored in a text column. Storing single-coded categoricals as integer columns in the database, rather than text, generally makes reading slightly faster and the decreases the database size.

▪Using VACUUM to rebuild the database. Free pages are not automatically removed from the database after data is deleted from a SQLite database. The sqlite3 command line tool can be used to run the VACUUM command to rebuild the DDF database. The VACUUM command eliminates free pages, aligns table data to be contiguous, and otherwise cleans up the database file structure. For example: