Professional > Data management scripting > Getting started with Data Management scripting > 5. Running your first cleaning script
 
5. Running your first cleaning script
Data cleaning is the process by which you correct errors and anomalies in the case data. Typically you clean the data using mrScriptBasic code in the OnNextCase Event section. mrScriptBasic is based on Visual Basic Scripting Edition (VBScript), which is in turn based on Visual Basic, and if you are familiar with either of these languages, you will find mrScriptBasic easy to pick up. This section does not go into detail about mrScriptBasic, but rather walks you through a simple cleaning script and shows you how to examine the results. The script is designed as a taster rather than as a “real life” example.
You will look at and run the MyFirstCleaningScript.dms file, which is similar to the Cleaning.dms file described in Example 1: More than one response to a single response question in the Data Cleaning section. However, it has been modified to deliberately introduce some errors into a copy of the Museum XML sample data, which are then “cleaned” in the OnNextCase Event section. Without these modifications, the cleaning script would not change the data because the Museum sample data is generally clean.
1 Open the MyFirstCleaningScript.dms in UNICOM Intelligence Professional. By default, the file is in the [INSTALL_FOLDER]\IBM\SPSS\DataCollection\7\DDL\Scripts\Data Management\DMS folder.
The InputDataSource section contains the following update query:
UpdateQuery = "UPDATE vdata SET interest = {Birds, Fossils}, _
  expect = expect + {12, 13, 14}, _
  when_decid = when_decid + {129, 130, 131, 132, 133}, _
  age = age + {1, 2, 3} _
  WHERE Respondent.Serial < 11"
This updates the first 10 case data records in the input data source with additional responses to four questions (interest, expect, when_decid, and age). This makes the answers to these questions incorrect because they are all single response questions and so should have only one response each. More than one response to a single response question is typical of the type of error that data cleaning attempts to correct.
Now look at the OnNextCase Event section in MyFirstCleaningScript.dms to see how these four questions are cleaned.
(To display the line numbers, see UNICOM Intelligence Professional options.
When an error is encountered in the code when it is validated or run, UNICOM Intelligence Professional displays a message that includes the line number. For example, the message for an error that occurs in the OnNextCase Event section on line 41 would look something like this:
Event(OnNextCase,"Clean the data")mrScriptEngine parse error:
Parser Error(41): ...
Using the option to display the line numbers makes it easy to locate the line on which the error occurred. For more information, see Debugging.
MyFirstCleaningScript.dms shows several different ways of cleaning single response data that contains more than one response. However, the AnswerCount function is always used to test whether more than one response has been selected for the question. The AnswerCount function is part of the UNICOM Intelligence Function Library, all of whose functions are automatically available to mrScriptBasic.
Lines 43-45 test whether the interest question has more than one response and if so, replaces them with the Not answered response.
Lines 47-49 replace any multiple responses to the expect question with a predefined default.
Lines 51-53 test whether there are multiple responses to the when_decid question, and if so uses the Ran function to select one of the responses at random and remove the rest.
Lines 55-59 handles multiple responses to the age question by setting the DataCleaning.Status system variable (see System variables) to Needs review and adding a message to a text string and the DataCleaning.Note system variable. Line 41 has already set up the text string to contain the respondent's serial number (Respondent.Serial system variable) and line 68 writes the text to a report file.
Lines 63-65 illustrate another way of handling a single response question that has more than one response, and that is to delete the record from the output data source. This is done for the gender question and to illustrate this, the response to this question is made multiple response for one case data record (for which Respondent.Serial has a value of 11) in the OnNextCase Event section in lines 61-62.
2 Run the DMS file.
3 Use DM Query to examine the clean data. To set up DM Query to do this, see How to run the example queries in DM Query using the museum sample. However, remember to select the output data source files (MyFirstCleaningScript.mdd and MyFirstCleaningScript.xml) rather than the installed Museum sample files.
4 Enter the following query into the text box in DM Query:
SELECT Respondent.Serial, interest, expect, when_decid,
  age, gender, DataCleaning.Note, DataCleaning.Status
FROM vdata
Here are the results for the first 15 case data records:
Study the first ten rows. These represent the case data records for which the update query inserted additional responses for the interest, expect, when_decid, and age questions. As expected, the multiple responses in the interest and expect columns have been replaced with Not_answered and general_knowledge_and_education respectively. The multiple responses in the when_decid column have been replaced by one response selected randomly. The responses in the age column are unchanged, but we can see that the text has been written to the DataCleaning.Note variable and the DataCleaning.Status variable is set to needsreview. Also, as expected, there is no Respondent.Serial with a value of 11, because this is the record for which the gender variable was given two responses and so the case was deleted from the output.
If you open the MyFirstCleaningScript.txt report file, you will see that the Age needs checking text has been written for the first ten respondents. You can open the text file in UNICOM Intelligence Professional or in a text editor, such as Notepad.
5 To find out more about data cleaning, see Data cleaning, which includes a general introduction to data cleaning, an overview of using a DMS file to clean data, and examples of how to handle many common data cleaning requirements.
Next
The next section gives ideas about how to go about learning mrScriptBasic. See 6. Mastering mrScriptBasic.
See also
Getting started with Data Management scripting