Advanced SQL queries
The SELECT statement has a number of extensions to support crosstabulations of categorical variables. These extensions are provided because when you use standard SQL on categorical variables:
▪Categories are excluded from the results if there is no case data for them.
▪On multiple response variables, a separate group is returned for each combination of categories selected by individual respondents.
Syntax
SELECT
GROUPBY.COL[index] [[ AS ] column_alias ]
[, BASE(column_name) [ [ AS ] column_alias ]
[, expression [ [ AS ] column_alias ]... ] ]
FROM
table
[GROUP BY
expression ON expression]
[ HAVING
expression ]
[WITH
([BASESUMMARYROW|EvaluateEmptyIterations])]
[ ORDER BY
{ expression } [ ASC | DESC ]
[,{ expression } [ ASC | DESC ]... ] ]
SQL extensions
GROUPBY.COL[index]
Query the content of the GROUP BY key. The index must match the position in the GROUP BY list, starting from 0.
BASE
Create a base column to show the total number of cases in the row. Generally, the base includes every case for which the value is not NULL. However, if the IsInBase attribute has been set to True, the case is included in the base regardless of the value. Conversely, if the IsInBase attribute has been set to False, the case is excluded from the base regardless of the value.
For more information about the IsInBase attribute, see the Evaluate Object Reference in the Reference section of the UNICOM Intelligence Developer Documentation Library.
ON
Specify a categorical expression on which to base the groups. The SQL Aggregator creates a separate group for each category in the categorical expression, provided the category is contained in the GROUP BY categorical expression.
WITH
When used with BaseSummaryRow, specifies the BaseSummaryRow option, which creates a base row. In nested crosstabulations, it creates a base row for each nested group. When used with EvaluateEmptyIterations, specifies that the categorical grid, which is being upleveld, should return all iterations including empty iterations. If the table selected is not an upleveled categorical gird, this option has no effect.
Examples
These examples use the Museum example Quanvert database, which is based on a survey of visitors to a museum. You can run these queries using the DM Query sample tool, which is supplied with the UNICOM Intelligence Developer Documentation Library. The examples were run using the connection option to return category names rather than category values. You select this option on the Advanced tab in the Data Link Properties dialog. For more information, see
Running the example queries in DM Query.
1. Simple crosstabulation of two single response variables
This is a standard SQL query that returns a simple crosstabulation of two single response variables. There WHERE clause filters out respondents for whom the variable is NULL because they were not asked the question.
SELECT desc_leave AS Description,
SUM(gender = {male}) AS Male,
SUM(gender = {female}) AS Female
FROM vdata
WHERE desc_leave IS NOT NULL
GROUP BY desc_leave
Here is the result set:
Description Male Female
------------------------------------ ------ --------
{not answered} 154 120
{place_to_learn} 12 8
{mention_of_things_on_view} 3 0
{something_for_everyone} 0 1
The query has omitted all of the categories for which there is no case data, as you can see if you run the equivalent query using the UNICOM Intelligence Data Model SQL extensions:
SELECT groupby.col[0] AS Description,
SUM(gender = {male}) AS Male,
SUM(gender = {female}) AS Female
FROM vdata
WHERE desc_leave IS NOT NULL
GROUP BY desc_leave ON desc_leave.DefinedCategories()
Look at the results: this query returns all of the categories regardless of whether there is any case data for them or not:
Description Male Female
--------------------------- ------ --------
{variation_of_museum_name} 0 0
{something_for_everyone} 0 1
{mention_of_things_on_view} 3 0
{comment_on_layoutdesign} 0 0
{relate_specimens_in_theoretical_way} 0 0
{comments_about_building} 0 0
{requires_lot_of_time} 0 0
{place_to_bring_childrenfamily} 0 0
{general_appreciative_comment} 0 0
{place_to_learn} 12 8
{other} 0 0
{not answered} 154 120
It is usually preferable to create crosstabulations using the UNICOM Intelligence Data Model SQL extensions rather than standard SQL because:
▪The SQL extensions return a row for each category, even if there is no case data for the category, whereas standard SQL omits categories for which there is no case data.
▪The SQL Aggregator does not need to sort the output rows when you define the row order using GROUP ON and this makes the query faster.
2. Simple crosstabulation filtered using GROUP ON
This example uses the GROUP ON clause to filter the groups and change the order of the rows returned by the query.
SELECT groupby.col[0] AS Description,
SUM(gender = {male}) AS Male,
SUM(gender = {female}) AS Female
FROM vdata
WHERE desc_leave IS NOT NULL
GROUP BY desc_leave ON {place_to_learn, mention_of_things_on_view, something_for_everyone
Here is the result set:
Description Male Female
--------------------------- ------ --------
{place_to_learn} 12 8
{mention_of_things_on_view} 3 0
{something_for_everyone} 0 1
3. Simple tabulation using BASE keyword
This example uses the BASE keyword to create a base column. When you use Base as a column alias, you must enclose it in single quotation marks because it is a reserved word.
SELECT groupby.col[0] AS Age,
BASE(gender) AS 'Base',
SUM(gender = {male}) AS Male,
SUM(gender = {female}) AS Female
FROM vdata
GROUP BY age ON age.DefinedCategories()
Here is the result set:
Age Base Male Female
--------------- ------ ------ -------
{e1116_years} 38 23 15
{e1720_years} 82 50 32
{e2124_years} 95 51 44
{e2534_years} 192 108 84
{e3544_years} 91 49 42
{e4554_years} 55 32 23
{e5564_years} 33 16 17
{e65_years} 16 10 6
4. Crosstabulation with a multiple response variable on the side axis
SELECT museums AS Museums,
SUM(gender = {male}) AS Male,
SUM(gender = {female}) AS Female
FROM vdata
WHERE museums IS NOT NULL
GROUP BY MUSEUMS
Standard SQL does not generally produce sensible crosstabulation results when the rows are formed from the categories of a multiple response variable, because a separate row is produced for each combination of responses. Here are the first five rows in the result set:
Museums Male Female
-------------------------------------------------------------------- ------- ----------
{national_museum_of_science} 146 82
{museum_of_design,national_art_gallery} 0 1
{national_museum_of_science,archeological_museum,national_art_ga...} 0 1
{archeological_museum} 7 1
{museum_of_design,archeological_museum,national_art_gallery,nort...} 1 1
Generally you want to see one row for each category. You can achieve this using the UNICOM Intelligence Data Model SQL extensions. For example:
SELECT groupby.col[0] AS Museums,
SUM(gender = {male}) as Male,
SUM(gender = {female}) AS Female
FROM vdata
WHERE museums IS NOT NULL
GROUP BY museums ON museums.DefinedCategories()
Here is the result set:
Museums Male Female
------------------------------------- ------- ----------
{national_museum_of_science} 209 124
{museum_of_design} 49 43
{institute_of_textiles_and_fashion} 31 16
{archeological_museum} 18 8
{national_art_gallery} 7 12
{northern_gallery} 11 10
{other} 16 21
{not_answered} 0 0
5. Crosstabulation using BaseSummaryRow
This example uses the BaseSummaryRow keyword to create a base row that is based on the total number of respondents and not on the total number of responses.
SELECT groupby.col[0] AS Museums,
SUM(gender = {male}) AS Male,
SUM(gender = {female}) AS Female
FROM vdata
WHERE museums IS NOT NULL
GROUP BY museums ON museums.DefinedCategories()
WITH (BaseSummaryRow)
Here is the result set:
Museums Male Female
----------------------------------- ------ ----------
255 171
{national_museum_of_science} 209 124
{museum_of_design} 49 43
{institute_of_textiles_and_fashion} 31 16
{archeological_museum} 18 8
{national_art_gallery} 7 12
{northern_gallery} 11 10
{other} 16 21
{not_answered} 0 0
The first row, with the blank text contains the base totals.
The BaseSummaryRow keyword can only be used with queries containing the GROUP ON clause.
6. Crosstabulation with a multiple response variable on the top axis
This example shows how to create a crosstabulation using the categories of a multiple response variable as the columns.
SELECT groupby.col[0] AS Gender,
BASE(museums) AS 'Base',
SUM(museums >= {national_museum_of_science}) AS Science,
SUM(museums >= {museum_of_design}) AS Design,
SUM(museums >= {institute_of_textiles_and_fashion}) AS Textiles,
SUM(museums >= {archeological_museum}) AS Archeology,
SUM(museums >= {national_art_gallery}) AS National,
SUM(museums >= {northern_gallery}) AS Northern,
SUM(museums >= {other}) AS Other,
SUM(museums >= {not_answered}) AS NA
FROM vdata
GROUP BY gender ON gender.definedcategories()
Here is the result set:
Gender Base Science Design Textiles Archeology National Northern Other NA
------ ---- ------- ------ -------- ---------- -------- -------- ----- --
{male} 255 209 49 31 18 7 11 16 0
{female} 171 124 43 16 8 12 10 21 0
7. Nested crosstabulations
This is an example of using the UNICOM Intelligence Data Model SQL extensions to create a nested crosstabulation.
SELECT groupby.col[0] AS Museums,
groupby.col[1] AS Gender,
SUM(age = {e1116_years}) AS '11-16',
SUM(age = {e1720_years}) AS '17-20',
SUM(age = {e2124_years}) AS '21-24',
SUM(age = {e2534_years}) AS '25-34',
SUM(age = {e3544_years}) AS '35-44',
SUM(age = {e4554_years}) AS '45-55',
SUM(age = {e5564_years}) AS '55-64',
SUM(age = {e65_years}) AS '65+'
FROM vdata
WHERE museums IS NOT NULL
GROUP BY museums ON museums.DefinedCategories(),
gender ON gender.DefinedCategories()
Here is the result set:
Museums Gender 11-16 16-20 21-24 25-34 35-44 45-54 55-64 65+
---------------------------------- -------- ----- ----- ----- ----- ----- ----- ----- ---
{national_museum_of_science} {male} 17 35 31 67 27 20 9 3
{national_museum_of_science} {female} 10 12 19 40 24 10 7 2
{museum_of_design} {male} 1 10 6 11 8 8 4 1
{museum_of_design} {female} 5 4 8 10 7 4 4 1
{institute_of_textiles_and_fashion} {male} 4 6 4 8 4 2 2 1
{institute_of_textiles_and_fashion} {female} 0 1 6 4 1 1 3 0
{archeological_museum} {male} 3 5 3 2 2 2 1 0
{archeological_museum} {female} 0 0 1 5 2 0 0 0
{national_art_gallery} {male} 1 2 2 1 1 0 0 0
{national_art_gallery} {female} 0 3 3 3 1 2 0 0
{northern_gallery} {male} 1 1 6 1 1 0 1 0
{northern_gallery} {female} 1 1 1 2 1 3 1 0
{other} {male} 1 4 4 3 4 0 0 0
{other} {female} 0 5 4 6 2 1 2 1
{not_answered} {male} 0 0 0 0 0 0 0 0
{not_answered} {female} 0 0 0 0 0 0 0
8. Nested crosstabulations with base totals
This is an example of a nested crosstabulation containing base totals created using the BaseSummaryRow keyword. In a nested crosstabulation the base totals are produced for each subgroup.
SELECT groupby.col[0] AS Museums,
groupby.col[1] AS Gender,
SUM(age = {e1116_years}) AS '11-16',
SUM(age = {e1720_years}) AS '17-20',
SUM(age = {e2124_years}) AS '21-24',
SUM(age = {e2534_years}) AS '25-34',
SUM(age = {e3544_years}) AS '35-44',
SUM(age = {e4554_years}) AS '45-55',
SUM(age = {e5564_years}) AS '55-64',
SUM(age = {e65_years}) AS '65+'
FROM vdata
WHERE museums IS NOT NULL
GROUP BY museums ON museums.DefinedCategories(),
gender ON gender.DefinedCategories()
WITH (BaseSummaryRow)
Here is the result set:
Museums Gender 11-16 16-20 21-24 25-34 35-44 45-54 55-64 65+
---------------------------------- -------- ----- ----- ----- ----- ----- ----- ----- ---
32 60 72 134 59 38 23 8
{male} 21 43 40 79 33 24 11 4
{female} 11 17 32 55 26 14 12 4
{national_museum_of_science} 27 47 50 107 51 30 16 5
{national_museum_of_science} {female} 10 12 19 40 24 10 7 2
{national_museum_of_science} {male} 17 35 31 67 27 20 9 3
{museum_of_design} 6 14 14 21 15 12 8 2
{museum_of_design} {male} 1 10 6 11 8 8 4 1
{museum_of_design} {female} 5 4 8 10 7 4 4 1
{institute_of_textiles_and_fashion} 4 7 10 12 5 3 5 1
{institute_of_textiles_and_fashion} {male} 4 6 4 8 4 2 2 1
{institute_of_textiles_and_fashion} {female} 0 1 6 4 1 1 3 0
{archeological_museum} 3 5 4 7 4 2 1 0
{archeological_museum} {male} 3 5 3 2 2 2 1 0
{archeological_museum} {female} 0 0 1 5 2 0 0 0
{national_art_gallery} 1 5 5 4 2 2 0 0
{national_art_gallery} {male} 1 2 2 1 1 0 0 0
{national_art_gallery} {female} 0 3 3 3 1 2 0 0
{northern_gallery} 2 2 7 3 2 3 2 0
{northern_gallery} {male} 1 1 6 1 1 0 1 0
{northern_gallery} {female} 1 1 1 2 1 3 1 0
{other} 1 9 8 9 6 1 2 1
{other} {male} 1 4 4 3 4 0 0 0
{other} {female} 0 5 4 6 2 1 2 1
{not_answered} 0 0 0 0 0 0 0 0
{not_answered} {male} 0 0 0 0 0 0 0 0
{not_answered} {female} 0 0 0 0 0 0 0 0
9. Crosstabulation showing row percentages
This is an example of a crosstabulation that shows the number of responses as a percentage of the total for the row.
SELECT groupby.col[0] AS Age,
SUM(gender = {male}) / COUNT(gender) * 100 As Male,
SUM(gender = {female}) / COUNT(gender) * 100 As Female
FROM vdata
GROUP BY age ON age.definedcategories()
WITH (BaseSummaryRow)
Here is the result set:
Age Male Female
--------------- ------------------ ------------------
56.312292358803987 43.687707641196013
{e1116_years} 60.526315789473685 39.473684210526315
{e1720_years} 60.975609756097562 39.024390243902438
{e2124_years} 53.684210526315788 46.315789473684212
{e2534_years} 56.25 43.75
{e3544_years} 53.846153846153847 46.153846153846153
{e4554_years} 58.18181818181818 41.818181818181813
{e5564_years} 48.484848484848484 51.515151515151516
{e65_years} 62.5 37.5
The UNICOM Intelligence Data Model does not support column percentages directly. However, you can create column percentages by post-processing the recordset. Similarly, if you want to define the number of decimal places, you need to post-process the results. If you have the InterviewReporter Accelerator Kit, you can see an example of both of these types of post-processing in the InterviewReporter source code.
10. Summary statistics of numeric variable
This example uses the standard SQL aggregate functions of AVG, MIN, MAX, and SUM to calculate summary statistics of the visits numeric variable. The WHERE clause is used to filter out respondents for whom the variable holds a negative value, which in this Quanvert database indicates that they were not asked the question on which the visits variable is based.
SELECT groupby.col[0] AS Gender,
AVG(visits) AS Average,
MIN(visits) AS Minimum,
MAX(visits) AS Maximum,
SUM(visits) AS 'Sum'
FROM vdata
WHERE visits > 0
GROUP BY gender ON gender.DefinedCategories()
WITH (BaseSummaryRow)
Here is the result set:
Gender Average Minimum Maximum Sum
------ ------------------ --------- --------- ------
6.9868852459016395 1 99 2131
{male} 7.3954802259887007 1 50 1309
{female} 6.421875 1 99 822
11. Use HAVING clause to restrict row groups
This example uses standard SQL syntax and the HAVING clause to select only row groups for which the average number of visits is greater than 5.
SELECT age AS Age,
AVG(visits) AS 'Average number of visits'
FROM vdata
WHERE visits > 0
GROUP BY age
HAVING AVG(visits) > 5
ORDER BY age
Here is the result set:
Age Average number of visits
--------- ------------------------
{e1116_years} 5.9666666666666668
{e2124_years} 5.6511627906976747
{e2534_years} 5.3043478260869561
{e3544-years} 8.3409090909090917
{e4554_years} 6.9375
{e5564_years} 20.9375
{e65_years} 18.125
The 17-20 years age group is missing from the result set. This is because the average number of visits for respondents in this age group is fewer than 5.
12. Grid table
In market research surveys, respondents are frequently asked to choose a rating on a predefined scale for a number of products in a list. This type of data is usually best presented in a grid table that shows the rating categories on one axis and the product list on the other. This example shows how you can create a grid table using a single query.
In the Museum survey, the rating question asked respondents to give a rating to each of the galleries. The responses for each gallery are stored in a separate variable. These variables have names in the form rating[{gallery}].Column, where gallery is the name of the gallery.
SELECT groupby.col[0] As Rating,
SUM(rating[{Dinosaurs}].Column >= groupby.col[0] AND rating[{Dinosaurs}].Column IS NOT NULL) As Dinosaurs,
SUM(rating[{Conservation}].Column >= groupby.col[0] AND rating[{Conservation}].Column IS NOT NULL) As Conservation,
SUM(rating[{Insects}].Column >= groupby.col[0] AND rating[{Insects}].Column IS NOT NULL) As Insects,
SUM(rating[{Whales}].Column >= groupby.col[0] AND rating[{Conservation}].Column IS NOT NULL) As Whales,
SUM(rating[{Botany}].Column >= groupby.col[0] AND rating[{Botany}].Column IS NOT NULL) As Botany
FROM vdata
GROUP BY rating[{Dinosaurs}].Column.DefinedCategories()
ON rating[{Dinosaurs}].Column.DefinedCategories() with (BaseSummaryRow)
This example shows an SQL query that creates a grid table from some of the rating variables. Here is the result set:
Rating Dinosaurs Conservation Insects Whales Botany
------ --------- ------------ ------- ------ ------
298 298 298 298 298
{not_at_all_interested_1} 10 3 13 2 11
{not_particularly_interested_2} 1 2 2 0 3
{no_opinion_3} 27 7 18 14 13
{slightly_interested_4} 113 30 27 28 14
{very_interested_5} 107 14 19 38 14
13. Weighted crosstabulation
This example is similar to the crosstabulation shown in
5. Crosstabulation using BaseSummaryRow, except that it is weighted using the
genbalance weighting variable. The
genbalance variable is a numeric (Double) variable that has already been set up to weight the data to reflect an equal balance between the genders.
SELECT groupby.col[0] AS Museums,
SUM((gender = {male}) * genbalance) AS Male,
SUM((gender = {female}) * genbalance) AS Female
FROM vdata
WHERE museums IS NOT NULL
GROUP BY museums ON museums.DefinedCategories()
WITH (BaseSummaryRow)
Here is the result set:
Museums Male Female
----------------------------------- ------------------ ------------------
226.41594260931012 195.70721983909615
{national_museum_of_science} 185.57228237390515 141.91634654998788
{museum_of_design} 43.507377207279198 49.212926626205522
{institute_of_textiles_and_fashion} 27.525075376033779 18.311786651611364
{archeological_museum} 15.982301831245421 9.15589332580568
{national_art_gallery} 6.2153396010398847 13.733839988708519
{northern_gallery} 9.7669622302055341 11.4448666572571
{other} 14.206490516662596 24.034219980239921
{not_answered} 0 0
Note that if you want to define the number of decimal places, you need to post-process the results.
14. Weighted grid table
This example shows the grid table from Example 12 weighted using the genbalance weighting variable.
SELECT groupby.col[0] As Rating,
SUM((rating[{Dinosaurs}].Column >= groupby.col[0] AND rating[{Dinosaurs}].Column IS NOT NULL) * genbalance) As Dinosaurs,
SUM((rating[{Conservation}].Column >= groupby.col[0] AND rating[{Conservation}].Column IS NOT NULL) * genbalance) As Conservation,
SUM((rating[{Insects}].Column >= groupby.col[0] AND rating[{Insects}].Column IS NOT NULL) * genbalance) As Insects,
SUM((rating[{Whales}].Column >= groupby.col[0] AND rating[{Conservation}].Column IS NOT NULL) * genbalance) As Whales,
SUM((rating[{Botany}].Column >= groupby.col[0] AND rating[{Botany}].Column IS NOT NULL) * genbalance) As Botany
FROM vdata
GROUP BY rating[{Dinosaurs}].Column.DefinedCategories()
ON rating[{Dinosaurs}].Column.DefinedCategories() with (BaseSummaryRow)
Here is the result set:
Rating Dinosaurs Conservation Insects Whales Botany
------------------------------- ------------------ ------------------ ------------------ ------------------ ------------------
297.69483596086508 297.69483596086508 297.69483596086508 297.69483596086508 297.69483596086508
{not_at_all_interested_1} 10.161961615085611 3.4334599971771302 13.852002620697039 2.0323923230171221 11.30644828081132
{not_particularly_interested_2} 1.14448666572571 1.775811314582824 1.775811314582824 0 2.920297980308534
{no_opinion_3} 26.795843839645411 6.471920609474183 18.5481119155884 14.483327269554154 13.082259595394145
{slightly_interested_4} 111.36632263660434 30.229303836822545 27.309005856514009 27.427168488502527 14.226746261119855
{very_interested_5} 107.06521272659306 13.200422227382665 18.15311253070832 39.128616154193921 13.970165252685558
15. Crosstabulation showing number of completes versus timeouts per day
This example is designed for data collected using UNICOM Intelligence Interviewer, which sets the
DataCollection.Status DataCollection.FinishTime system variables. These system variables are not present in a Quanvert database, and so the following query does not run on the Museum sample Quanvert database. However, you could run it on the
The Short Drinks sample.
SELECT groupby.col[0] AS FinishTime,
SUM(DataCollection.Status >= {completed}) AS Completed,
SUM(DataCollection.Status >= {timedout}) AS Timedout
FROM VDATA
WHERE DataCollection.FinishTime > 0
GROUP BY Left(CText(DataCollection.FinishTime),Find(CText(DataCollection.FinishTime),' ') )
ORDER BY DataCollection.FinishTime
Here is the result set when run against the Short Drinks sample database. There is only one row because the Short Drinks sample was collected on one day:
FinishTime Completed Timedout
---------- --------- --------
15/10/2002 112 9
See also