I want to use the Data Model to access data, but I am used to SPSS-style data. Can you give me an overview?
For example, when I look at SPSS Statistics .sav file data using the Data Model and a standard tool, the data is not presented as I would expect: numeric categorical variables are shown as strings and when several categorical variables are part of a multiple response set they are "collapsed" into one variable.
Most tools do not use multiple response variables, and to a certain extent this is true for SPSS Statistics. (This is why SPSS Statistics has multiple response sets instead of multiple response variables.) The Data Model is that it enables existing tools (like SPSS Statistics and Excel) to read
Market research data, which typically includes categorical questions, some of which are multiple response. The Data Model has a “native” variable type of Categorical. It returns raw categorical queries as a string for various reasons, the main being that it's not possible to return a multiple response variable as a single numeric value.
For example, the query "SELECT Newspapers FROM VDATA" (where Newspapers is a multiple response categorical variable) might return "{Tribune, NYT, USAToday}". This is not useful to SPSS Statistics. However,UNICOM Intelligence has extended SQL expressions so that when applied to categorical variables they behave as perhaps those who originally defined SQL might have wanted them to behave if SQL had included the concept of categorical variables. In addition, the functions in the UNICOM Intelligence Function Library, which you can call from SQL queries, enable you to extract important information from categorical variables and other data types.
To understand these features, you can run some example queries, by using
DM Query (which is in the UNICOM Intelligence Developer Documentation Library). These example queries use the Data Collection Data (.
ddf) File version of the Museum sample data, which also comes with the DDL. You could try similar queries (changing variable and category names as appropriate) using a
.sav file or a Quanvert database or any other data format for which you have a DSC. To set up DM Query to run the queries using the Museum sample data, see
How to run the example queries in DM Query using the museum sample.
In DM Query, open the Museum sample data. Here are the details of the museums variable:
|
Name
|
Data type/value
|
Label (analysis context)
|
Variable
|
museums
|
Categorical (multiple response)
|
Museums and galleries visited or plans to visit
|
Category 1
|
National_Museum_of_Science
|
25
|
National Museum of Science
|
Category 2
|
Museum_of_Design
|
26
|
Museum of Design
|
Category 3
|
Institute_of_Textiles_and_Fashion
|
27
|
Institute of Textiles and Fashion
|
Category 4
|
Archeological_Museum
|
28
|
Archeological Museum
|
Category 5
|
National_Art_Gallery
|
29
|
National Art Gallery
|
Category 6
|
Northern_Gallery
|
30
|
Northern Gallery
|
Category 7
|
Other
|
18
|
Other
|
Category 8
|
Not_answered
|
11
|
Not answered
|
In addition, we'll use the serial numeric variable so that we can reference individual cases for discussion purposes.
To execute an SQL query, enter it into the text box in DM Query. Start with the following query:
SELECT serial, museums
FROM VDATA
Although you enter the queries on one line in DM Query, they are presented here on more than one line to make them easier to read.
The query returns two columns. Here are the results for the first five cases:
serial museums
1 {National_Museum_of_Science}
2 {National_Museum_of_Science}
3 {Museum_of_Design,National_Art_Gallery}
4 {National_Museum_of_Science}
5 {National_Museum_of_Science,Archeological_Museum,National_Art_Gallery}
serial=3 shows that this really is a multiple response categorical variable: the respondent answered that they had either visited or planned to visit both the Museum of Design and the National Art Gallery.
However, having this returned as the text string {Museum_of_Design,National_Art_Gallery} might nit be useful. Try the following query:
SELECT serial, museums,
museums>={national_museum_of_science} AS MuseumOfScience
FROM VDATA
The query returns three columns. Here are the results for the first five cases:
serial museums MuseumOfScience
1 {National_Museum_of_Science} True
2 {National_Museum_of_Science} True
3 {Museum_of_Design,National_Art_Gallery} False
4 {National_Museum_of_Science} True
5 {National_Museum_of_Science,Archeological_Museum,National_A... True
MuseumOfScience is set to "True" any time the respondent selected the National Museum of Science category, regardless of whether they also selected any other category.
You can return the same information for all of the categories in one query:
SELECT serial, museums,
museums>={national_museum_of_science} AS MuseumOfScience,
museums>={museum_of_design} AS MuseumOfDesign
FROM VDATA
The query returns four columns. Here are the results for the first five cases:
serial museums MuseumOfScience MuseumOfDesign
1 {National_Museum_of_Science} True False
2 {National_Museum_of_Science} True False
3 {Museum_of_Design,National_Art_Gallery} False True
4 {National_Museum_of_Science} True False
5 {National_Museum_of_Science,Archeological_Museum,National_A... True False
The Data Model can return categorical variables as more than just strings. This query demonstrates how the Data Model uses standard comparison operators (in this case ">=") to act in a special way on variables whose data type is Categorical. For information about the operators (logical, comparison, and so on) and details of how they operate on categorical variables in
Expression evaluation. For example, here is an excerpt from
Comparison operators on how ">=" works with categorical variables:
"The >= operator returns True if the category list on the left side is a proper superset of, or is equal to, the category list on the right side."
However, you might not want the query to return "True" or "False". You might want it to return "1" or "0". Here is another query:
SELECT serial, museums,
clong(museums>={national_museum_of_science}) AS MuseumOfScience
FROM VDATA
The query returns three columns. Here are the results for the first five cases:
serial museums MuseumOfScience
1 {National_Museum_of_Science} 1
2 {National_Museum_of_Science} 1
3 {Museum_of_Design,National_Art_Gallery} 0
4 {National_Museum_of_Science} 1
5 {National_Museum_of_Science,Archeological_Museum,National_A... 1
This query demonstrates the use of the extensible UNICOM Intelligence Function Library that comes with the Data Model. The function library enables you to add your own functions, which can be called directly from within these SQL queries.
This example uses a standard function supplied with the Data Model, the
CLong type conversion function.
CLong converts Boolean True to 1, and Boolean False to 0.
This MuseumOfScience variable is what SPSS Statistics requires if you wanted to expand the multiple response categorical variable museums into several SPSS Statistics dichotomous variables. You can explode all of the categorical variables in a single query.
SELECT serial, museums,
CLong(museums>={national_museum_of_science}) AS MuseumOfScience,
CLong(museums>={museum_of_design}) AS MuseumOfDesign
FROM VDATA
The query returns four columns. Here are the results for the first five cases:
serial museums MuseumOfScience MuseumOfDesign
1 {National_Museum_of_Science} 1 0
2 {National_Museum_of_Science} 1 0
3 {Museum_of_Design,National_Art_Gallery} 0 1
4 {National_Museum_of_Science} 1 0
5 {National_Museum_of_Science,Archeological_Museum,National_A... 1 0
If you want to represent the data as a multiple category set and not a multiple dichotomy set, use the following query:
SELECT serial, museums,
museums.GetAnswer(0) AS Answered1st,
museums.GetAnswer(1) AS Answered2nd,
museums.GetAnswer(2) AS Answered3rd
FROM VDATA
The query returns five columns. Here are the results for the first five cases:
serial museums Answered1st Answered2nd Answered3rd
1 {National_Museum_of_Science} {National_Museum_of_Science} {} {}
2 {National_Museum_of_Science} {National_Museum_of_Science} {} {}
3 {Museum_of_Design,National_Art_Gallery} {Museum_of_Design} {National_Art_Gallery} {}
4 {National_Museum_of_Science} {National_Museum_of_Science} {} {}
5 {National_Museum_of_Science,Archeological_Museum,National_A... {National_Museum_of_Science} {Archeological_Museum} {National_Art_Gallery}
This query uses the
GetAnswer function. You can use the function as a traditional function, such as
GetAnswer(museums, 0), or you can use the more object-oriented dotted notation, such as
museums.GetAnswer(0) (you might also want to look at the
Mid function for other query ideas).
GetAnswer(0) returns the first category in a category list. This really is order-dependent: not the order in which the categories were asked, but the order in which the categories were answered. This information can be important when analyzing, for example, unprompted awareness questions.
There are many more interesting queries. You can do everything from finding out how many answers a given respondent gave (using the
Len function) all the way to performing crosstabulations, including crosstabulations on multiple response categorical variables, all using SQL queries. For further examples, see
Basic SQL queries and
Advanced SQL queries.
To display all of the variable names, instantiate an MDM object, and use code similar to the following (written in Visual Basic):
For Each VariableInstance in MDMDocument.VariableInstances
Debug.Print VariableInstance.FullName
Next
To print each category for a given variable:
For Each Element In Variable.Categories
Debug.Print Element.Name
Next
To find out a variable's data type, look at the VariableInstance.DataType property, and so on.
To find out more about the metadata, use
MDM Explorer to open the same metadata file (
museum.mdd). You can then examine the
museums variable from a metadata perspective. By default,
museum.mdd is in the [INSTALL_FOLDER]\IBM\SPSS\DataCollection\7\DDL\Data\Data Collection File\ folder.
In MDM Explorer, you can open a .sav file, Quanvert database, or any data format for which a read-enabled Metadata Data Source Connector (MDSC) exists. Your application does not have to change because the Data Model represents the metadata information in all of these file storage formats in exactly the same way, using the same object model.
You use the Open mDSC command on the File menu to open an .sav file or Quanvert database in MDM Explorer.
See also