Data cleaning
Data cleaning is the process by which errors and anomalies are removed from the case data so that it can be analyzed efficiently. The following factors affect the types of errors and anomalies that are likely to occur in the case data:
Interviewing medium
In a paper questionnaire there is nothing to stop a respondent selecting more than one response in a single response question or entering their age as 500 years, for example, regardless of the quality of the instructions. You are therefore likely to find more errors and anomalies in paper survey data than in electronic survey data because you can design an electronic questionnaire so that it will not accept more than one response to a single response question or an age that is greater than a specified maximum. However, even data collected using the best-designed electronic questionnaire may contain some errors and anomalies due to respondent confusion, fatigue, or perversity. Indeed sometimes you may want to design the questionnaire in such a way that contradictory information can be entered to alert you to respondents who may be less than candid.
Interviewer or self-completion
Using an interviewer to conduct a survey and record the responses may reduce the number of errors and anomalies, but it is unlikely to totally eliminate any errors that the interviewing medium allows, due to human error.
Mode of data entry
Errors can be introduced when data is entered into the system. For example, operators may make typographical and other mistakes when using a manual data entry system. Other errors are likely to be introduced when paper questionnaires are scanned; for example, when the paper has been torn, marked, or crumpled, or when the scanning software misinterprets hand-written responses. Sometimes data is cleaned as part of the data entry process. For example, during manual data entry or the verification of scanned data, operators will look for and correct invalid responses according to predefined rules. This has the advantage that each response can be easily viewed in the context of the whole questionnaire and the respondent's other responses. However it has the disadvantage that variations in the ways operators interpret and apply the rules invariably lead to inconsistencies. Moreover it means that there is no electronic record of the original responses to refer to when a suspicion arises that errors or other distortions have actually been introduced during cleaning.
Questionnaire design
Sometimes errors and anomalies are introduced due to errors in the design of the questions and the routing logic. Generally most of these errors should be discovered and corrected during testing before the first respondent is interviewed. However, in a complex survey, some potential problems may be inadvertently overlooked.
Aim of cleaning data
The aim when cleaning data is to attempt to interpret what the respondent was trying to express without distorting the data. There are different approaches and conventions for cleaning data, each of which has its merits. For example, suppose a respondent selected both Very good and Fair in response to a single response question that has the following list of possible responses (categories):
▪Excellent
▪Very good
▪Good
▪Fair
▪Poor
▪Don't know
Here are some of the possible solutions:
▪Delete both responses and replace them with Don't know.
▪Delete both responses and replace them with Good, which is halfway between the two chosen responses.
▪Select one of the responses randomly and delete the other one.
▪Use an alternating algorithm so that the first time this scenario is encountered, you select the response that is higher in the scale and delete the lower one and the next time, you do the reverse.
▪Mark the case as requiring review, and record the question name in the data cleaning note system variable.
▪List the problem in a report and record the action taken.
▪List the problem in a report and take no other action.
▪Delete the case.
UNICOM Intelligence Professional can handle all of these approaches and has the advantage that it can implement the chosen solution consistently.
Other common errors
Other common errors that may need correcting are:
▪Text and numeric responses that are outside the specified ranges. For example, a 20-character response in an 8-character postal code field, an age in years that is greater than the specified maximum age.
▪A single-choice special category (such as Don't know, Refused to answer, etc.) that has been selected in combination with a regular category in response to a multiple response question.
▪Responses that violate the routing logic. For example, in the Museum survey there is a question that asks Do you have any qualifications in biology? Only if the respondent answers Yes to this question should he or she answer the next question, What are your qualifications in biology? Otherwise, clearly the response to at least one of the two questions is incorrect.
See also