Appendix C

Processing of Data

Geir Øvensen

Cleaning a data set implies identification of individual entries or combinations of entries which might seem dubious, and to make decisions about what to do with them. In many projects based on information from household interviews this process has been initiated only after the completion of field work. Using SPSS-DE in the local field offices in Gaza and Ramallah, the FAFO living conditions survey managed to integrate the processes of collecting and cleaning data respectively. This approach had several advantages: First, data quality was improved by allowing for corrections of errors while still working in the field. Second, the integration of data cleaning into the field work procedures reduced the time span from the end of the field work until completion of the report.

SPSS-DE offered two main possibilities for identifying questionable entries. The simplest method was to check if values entered for individual variables were within the legal ranges. A more elaborate cleaning procedure was to check if combinations of variable values were consistent.

Valid (Individual) Entries
For each individual variable valid entry, specifications (or "ranges", for short) were defined. Most variables had standard answer alternatives which defined the acceptable ranges. For some variables without standard responses, these ranges had to be determined on the basis of experiences from the FAFO Gaza Pilot Survey (e.g., which would provide guidelines for determining the maximum credible number of persons living in a household, etc.1).

During data punching an audio-visual warning would appear if the puncher entered "illegal" values outside the specified ranges. Following a "beep", the screen message "Value out of range" would tell the puncher that a mistake had been made. (The programme did not, however, technically force the puncher to correct errors). Violations of the legal ranges were checked both automatically when data was entered or changed, and on specific instructions from the office staff (see reference to use of "cleaning passes" below).

Cleaning Rules
As indicated above, cleaning data also implied checking if combinations of values for several variables were credible and consistent. For this purpose FAFO developed a data entry programme containing more than 500 logical rules about acceptable relations between two or more variable values in each interview.

Some rules were logical in the strict sense, i.e. always to be observed (like "a son must be younger than his father"). Other rules were of a kind that would hold true in 95% of the cases, based on evaluation of behavioural patterns in Palestinian society (e.g. a husband who encourages his wife to appear in public without a head scarf is also likely to accept that women are allowed to vote).

In contrast to checking legal ranges of variable values, cleaning rules could not be controlled continuously. (Rules involving two or more variables in different parts of the questionnaire could not be checked until values for all involved variables had been entered). Instead, cleaning specifications were checked by the office staff through so-called "cleaning passes". By using a cleaning pass, all entries in a file would be checked against all ranges and rules concerning the variables in that file. The results could be reproduced by the field office staff in several ways, by exposing either the ranges and rules that had been violated in each case or the cases that had violated ranges and rules. By using the possibility of consistency checks between entries offered by the data entry programme, computerized data quality checks, equalling hours of manual control, could be performed in a few seconds.

Correction of Wrong or Questionable Entries
While SPSS-DE offered comprehensive and effective procedures for identifying entries transgressing legal ranges or logical rules, it did not provide detailed instructions about what to do with improbable or problematic entries. The task of the computer programme was thus confined to facilitating the task of identifying problematic entries, leaving the office staff time and energy to concentrate on finding a fair and honest solution.
Most violations of the legal ranges turned out to be "straightforward" writing or punching errors, which could easily be explained and corrected. Violations of logical rules, however, usually posed greater challenges. For each violation of a logical rule, judgements had to be made about its cause and substance. Some rules which were violated in a substantial number of the initial interviews had to be reformulated. Most violations of rules, however, were caused by various types of non-sampling errors.

Notes

  1. The FAFO Gaza Strip Pilot Survey interviewed 300 households in August 1991, employing a questionnaire 90% similar to the one used in the present survey.
----------------

al@mashriq                       960428/960710