Data cleaning

Data cleaning is the process by which errors and anomalies are removed from the case data so that it can be analyzed efficiently. The following factors affect the types of errors and anomalies that are likely to occur in the case data:

In a paper questionnaire there is nothing to stop a respondent selecting more than one response in a single response question or entering their age as 500 years, for example, regardless of the quality of the instructions. You are therefore likely to find more errors and anomalies in paper survey data than in electronic survey data because you can design an electronic questionnaire so that it will not accept more than one response to a single response question or an age that is greater than a specified maximum. However, even data collected using the best-designed electronic questionnaire may contain some errors and anomalies due to respondent confusion, fatigue, or perversity. Indeed sometimes you may want to design the questionnaire in such a way that contradictory information can be entered to alert you to respondents who may be less than candid.

Using an interviewer to conduct a survey and record the responses may reduce the number of errors and anomalies, but it is unlikely to totally eliminate any errors that the interviewing medium allows, due to human error.

Errors can be introduced when data is entered into the system. For example, operators may make typographical and other mistakes when using a manual data entry system. Other errors are likely to be introduced when paper questionnaires are scanned; for example, when the paper has been torn, marked, or crumpled, or when the scanning software misinterprets hand-written responses. Sometimes data is cleaned as part of the data entry process. For example, during manual data entry or the verification of scanned data, operators will look for and correct invalid responses according to predefined rules. This has the advantage that each response can be easily viewed in the context of the whole questionnaire and the respondent's other responses. However it has the disadvantage that variations in the ways operators interpret and apply the rules invariably lead to inconsistencies. Moreover it means that there is no electronic record of the original responses to refer to when a suspicion arises that errors or other distortions have actually been introduced during cleaning.

Sometimes errors and anomalies are introduced due to errors in the design of the questions and the routing logic. Generally most of these errors should be discovered and corrected during testing before the first respondent is interviewed. However, in a complex survey, some potential problems may be inadvertently overlooked.

The aim when cleaning data is to attempt to interpret what the respondent was trying to express without distorting the data. There are different approaches and conventions for cleaning data, each of which has its merits. For example, suppose a respondent selected both Very good and Fair in response to a single response question that has the following list of possible responses (categories):

Here are some of the possible solutions:

UNICOM Intelligence Professional can handle all of these approaches and has the advantage that it can implement the chosen solution consistently.

Other common errors that may need correcting are:

▪Responses that violate the routing logic. For example, in the Museum survey there is a question that asks Do you have any qualifications in biology? Only if the respondent answers Yes to this question should he or she answer the next question, What are your qualifications in biology? Otherwise, clearly the response to at least one of the two questions is incorrect.