5. Running your first cleaning script
Data cleaning is the process by which you correct errors and anomalies in the case data. Typically you clean the data using mrScriptBasic code in the
OnNextCase Event section. mrScriptBasic is based on Visual Basic Scripting Edition (VBScript), which is in turn based on Visual Basic, and if you are familiar with either of these languages, you will find mrScriptBasic easy to pick up. This section does not go into detail about mrScriptBasic, but rather walks you through a simple cleaning script and shows you how to examine the results. The script is designed as a taster rather than as a “real life” example.
You will look at and run the
MyFirstCleaningScript.dms file, which is similar to the
Cleaning.dms file described in
Example 1: More than one response to a single response question in the Data Cleaning section. However, it has been modified to deliberately introduce some errors into a copy of the Museum XML sample data, which are then “cleaned” in the OnNextCase Event section. Without these modifications, the cleaning script would not change the data because the Museum sample data is generally clean.
1 Open the MyFirstCleaningScript.dms in UNICOM Intelligence Professional. By default, the file is in the [INSTALL_FOLDER]\IBM\SPSS\DataCollection\7\DDL\Scripts\Data Management\DMS folder.
The InputDataSource section contains the following update query:
UpdateQuery = "UPDATE vdata SET interest = {Birds, Fossils}, _
expect = expect + {12, 13, 14}, _
when_decid = when_decid + {129, 130, 131, 132, 133}, _
age = age + {1, 2, 3} _
WHERE Respondent.Serial < 11"
This updates the first 10 case data records in the input data source with additional responses to four questions (interest, expect, when_decid, and age). This makes the answers to these questions incorrect because they are all single response questions and so should have only one response each. More than one response to a single response question is typical of the type of error that data cleaning attempts to correct.
Now look at the OnNextCase Event section in MyFirstCleaningScript.dms to see how these four questions are cleaned.
When an error is encountered in the code when it is validated or run, UNICOM Intelligence Professional displays a message that includes the line number. For example, the message for an error that occurs in the OnNextCase Event section on line 41 would look something like this:
Event(OnNextCase,"Clean the data")mrScriptEngine parse error:
Parser Error(41): ...
Using the option to display the line numbers makes it easy to locate the line on which the error occurred. For more information, see
Debugging mrScriptBasic code.
MyFirstCleaningScript.dms shows several different ways of cleaning single response data that contains more than one response. However, the
AnswerCount function is always used to test whether more than one response has been selected for the question. The
AnswerCount function is part of the
UNICOM Intelligence Function Library, all of whose functions are automatically available to mrScriptBasic.
Lines 43-45 test whether the interest question has more than one response and if so, replaces them with the Not answered response.
Lines 47-49 replace any multiple responses to the expect question with a predefined default.
Lines 51-53 test whether there are multiple responses to the when_decid question, and if so uses the
Ran function to select one of the responses at random and remove the rest.
Lines 55-59 handles multiple responses to the
age question by setting the
DataCleaning.Status system variable (see
System variables) to
Needs review and adding a message to a text string and the
DataCleaning.Note system variable. Line 41 has already set up the text string to contain the respondent's serial number (Respondent.Serial system variable) and line 68 writes the text to a report file.
Lines 63-65 illustrate another way of handling a single response question that has more than one response, and that is to delete the record from the output data source. This is done for the gender question and to illustrate this, the response to this question is made multiple response for one case data record (for which Respondent.Serial has a value of 11) in the OnNextCase Event section in lines 61-62.
2 Run the DMS file.
4 Enter the following query into the box in DM Query:
SELECT Respondent.Serial, interest, expect, when_decid,
age, gender, DataCleaning.Note, DataCleaning.Status
FROM vdata
Here are the results for the first 15 case data records:
Study the first ten rows. These represent the case data records for which the update query inserted additional responses for the interest, expect, when_decid, and age questions. As expected, the multiple responses in the interest and expect columns have been replaced with Not_answered and general_knowledge_and_education respectively. The multiple responses in the when_decid column have been replaced by one response selected randomly. The responses in the age column are unchanged, but the text has been written to the DataCleaning.Note variable and the DataCleaning.Status variable is set to needsreview. Also, as expected, there is no Respondent.Serial with a value of 11, because this is the record for which the gender variable was given two responses and so the case was deleted from the output.
If you open the MyFirstCleaningScript.txt report file, you will see that the Age needs checking text has been written for the first ten respondents. You can open the text file in UNICOM Intelligence Professional or in a text editor, such as Notepad.
5 To find out more about data cleaning, see
Data cleaning, which includes a general introduction to data cleaning, an overview of using a DMS file to clean data, and examples of how to handle many common data cleaning requirements.
Next
See