FAFO Report 151

6 Sampling and Non-sampling Errors

The statistical quality or reliability of a survey may obviously be influenced by the errors that for various reasons affect the observations. Error components are commonly divided into two major categories: Sampling and non-sampling errors. In sampling literature the terms "variable errors" and "bias" are also frequently used, though having a precise meaning which is slightly different from the former concepts. The total error of a survey statistic is labeled the mean square error, being the sum of variable errors and all biases. In this section we will give a fairly general and brief description of the most common error components related to household sample surveys, and discuss their presence in and impacts on this particular survey. Secondly, we will go into more detail as to those components which can be assessed numerically.

Error Components and their Presence in the Survey
(1) Sampling errors are related to the sample design itself and the estimators used, and may be seen as a consequence of surveying only a random sample of, and not the complete, population. Within the family of probability sample designs - that is designs enabling the establishment of inclusion probabilities (random samples) - sampling errors can be estimated. The most common measure for the sampling error is the variance of an estimate, or derivatives thereof. The derivative mostly used is the standard error, which is simply the square root of the variance.

The variance or the standard error does not tell us exactly how great the error is in each particular case. It should rather be interpreted as a measure of uncertainty, i.e. how much the estimate is likely to vary if repeatedly selected samples (with the same design and of the same size) had been surveyed. The variance is discussed in more detail in section 6.2.

(2) Non-sampling errors is a "basket" comprising all errors which are not sampling errors. These type of errors may induce systematic bias in the estimates, as opposed to random errors caused by sampling errors. The category may be further divided into subgroups according to the various origins of the error components:

Imperfections in the sampling frame, i.e. when the population frame from which the sample is selected does not comprise the complete population under study, or include foreign elements. Exclusion of certain groups of the population from the sampling frame is one example. As described in the Gaza section, it was decided to exclude "outside localities" from being surveyed for cost reasons. It was maintained that the exclusion would have negligible effects on survey results.
Errors imposed by implementary deviations from the theoretical sample design and field work procedures. Examples: non-response, "wrong" households selected or visited, "wrong" persons interviewed, etc. Except for non-response, which will be further discussed subsequently, there were some cases in the present survey in which the standard instructions for "enumeration walks" had to be modified in order to make sampling feasible. Any departure from the standard rules has been particularly considered within the context of inclusion probabilities. None of the practical solutions adopted imply substantial alterations of the theoretical probabilities described in the previous sections.
The field work procedures themselves may imply unforeseen systematic biases in the sample selection. In the present survey one procedure has been given particular consideration as a potential source of error: the practical modification of choosing road crossing corners - instead of any randomly selected spot - as starting points for the enumeration walks. This choice might impose systematic biases as to the kind of households being sampled. However, numerous inspection trials in the field proved it highly unlikely that such bias would occur. According to the field work instructions, the starting points themselves were never to be included in the sample. Such inclusion would have implied a systematic over-representation of road corner households, and thus may have caused biases for certain variables. (Instead, road corner households may now be slightly under-represented in so far as they as starting points are excluded from the sample. Possible bias induced by this under-representation is, however, negligible compared to the potential bias accompanying the former alternative.)
Improper wording of questions, misquotations by the interviewer, misinterpretations and other factors that may cause failure in obtaining the intended response. "Fake response" (questions being answered by the interviewer himself/herself) may also be included in this group of possible errors. Irregularities of this kind are generally difficult to detect. The best ways of preventing them is to have well trained data collectors, to apply various verification measures, and to introduce the internal control mechanisms by letting data collectors work in pairs - possibly supplemented by the presence of the supervisor. A substantial part of the training of supervisors and data collectors was devoted to such measures. Verification interviews were carried out by the supervisors among a 10% randomly selected subsample. No fake interviews were detected. However, a few additional re-interviews were carried out, on suspicion of misunderstandings and incorrect responses.
Data processing errors include errors arising incidentally during the stages of response recording, data entry and programming. In this survey the data entry programme used included consistency controls wherever possible, aiming at correcting any logical contradictions in the data. Furthermore, verification punch work was applied in order to correct mis-entries not detected by the consistency control, implying that each and all questionnaires have been punched twice.

Sampling Error - Variance of an Estimate
Generally, the prime objective of sample designing is to keep sampling error at the lowest level possible (within a given budget). There is thus a unique theoretical correspondence between the sampling strategy and the sampling error, which can be expressed mathematically by the variance of the estimator applied. Unfortunately, design complexity very soon implies variance expressions to be mathematically uncomfortable and sometimes practically "impossible" to handle. Therefore, approximations are frequently applied in order to achieve interpretable expressions of the theoretical variance itself, and even more to estimate it.

In real life practical shortcomings frequently challenge mathematical comfort. Absence of sampling frames or other prior information forces one to use mathematically complex strategies in order to find feasible solutions. The design of the present survey - stratified, 4-5 stage sampling with varying inclusion probabilities - is probably among the extremes in this respect, implying that the variance of the estimator (5.2) will be of the utmost complexity - as will be seen subsequently.
The (approximate) variance of the estimator (5.2) is in its simplest form:

The variances and covariances on the right hand side of (6.1) may be expressed in terms of the stratum variances and covariances:

Proceeding one step further the stratum variance may be expressed as follows⁹:

where we have introduced the notation ps (k) = P1 (s, k). The ps (k, l) is the joint probability of inclusion for PSU (s,k) and PSU (s,l), and the variance of the PSU (s,k) unbiased estimate . The variance of is obtained similarly by substituting x with N in the above formula. The stratum covariance formula is somewhat more complicated and is not expressed here.

The PSU (s,k) variance components in the latter formula have a structure similar to the stratum one, as is realized by regarding the PSUs as separate "strata" and the cells as "PSUs". Again, another variance component emerges for each of the cells, the structure of which is similar to the preceding one. In order to arrive at the "ultimate" variance expression yet another two or three similar stages have to be passed. It should be realized that the final variance formula is extremely complicated, even if simplifying modifications and approximations may reduce the complexities stemming from the 2nd - 5th sampling stages.

It should also be understood that attempts to estimate this variance properly and exhaustively (unbiased or close to unbiased) would be beyond any realistic effort. Furthermore, for such estimation to be accomplished certain preconditions have to be met. Some of these conditions cannot, however, be satisfied (for instance: at least two PSUs have to be selected from each stratum comprising more than one PSU). We thus have to apply a more simple method for appraising the uncertainty of our estimates.

Any sampling strategy (sample selection approach and estimator) may be characterized by its performance relative to a simple random sampling (SRS) design, applying the sample average as the estimator for proportions. The design factor of a strategy is thus defined as the fraction between the variances of the two estimators. If the design factor is, for instance, less than 1, the strategy under consideration would be better than SRS. Usually, multi-stage strategies are inferior to SRS, implying the design factor being greater than 1.

The design factor is usually determined empirically. Although there is no overwhelming evidence in its favour, a factor of 1.5 is frequently used for stratified, multi-stage designs. (The design factor may vary among survey variables). The rough approximate variance estimator is thus:

where p is the estimate produced by (5.2) and nT is the number of observations underlying the estimate (the "100%"). Although this formula oversimplifies the variance, it still takes care of some of the basic features of the real variance; the variance decreases by increasing sample size (n), and tends to be larger for proportions around 50% than at the tails (0% or 100%).
The square root of the variance, i.e. or briefly s, is called the standard error, and is tabulated in table A.12 for various values of p and n.

Table A.12 Standard error estimates for proportions (s and p are specified as percentages).

Number of obs. Estimated proportion (p %)

(n) 5/95 10/90 20/80 30/70 40/60 50

10 8.4 11.6 15.5 17.7 19.0 19.4

20 6.0 8.2 11.0 12.5 13.4 13.7

50 3.8 5.2 6.9 7.9 8.5 8.7

75 3.1 4.2 5.7 6.5 6.9 7.1

100 2.7 3.7 4.9 5.6 6.0 6.1

150 2.2 3.0 4.0 4.6 4.9 5.0

200 1.9 2.6 3.5 4.0 4.2 4.3

250 1.7 2.3 3.1 3.5 3.8 3.9

300 1.5 2.1 2.8 3.2 3.5 3.5

350 1.4 2.0 2.6 3.0 3.2 3.3

400 1.3 1.8 2.5 2.8 3.0 3.1

500 1.2 1.6 2.2 2.5 2.7 2.7

700 1.0 1.4 1.9 2.1 2.3 2.3

1000 0.8 1.2 1.5 1.8 1.9 1.9

1500 0.7 0.9 1.3 1.4 1.5 1.6

2000 0.6 0.8 1.1 1.3 1.3 1.4

2500 0.5 0.7 1.0 1.2 1.2 1.2

Confidence Intervals
The sample which has been surveyed is one specific outcome of an "infinite" number of random selections which might have been done within the sample design. Other sample selections would most certainly have yielded survey results slightly different from the present ones. The survey estimates should thus not be interpreted as accurately as the figures themselves indicate.

A confidence interval is a formal measure for assessing the variability of survey estimates from such hypothetically repeated sample selections. The confidence interval is usually derived from the survey estimate itself and its standard error:

Confidence interval: [p - c s, p + c s] where the c is a constant which must be determined by the choice of a confidence coefficient, fixing the probability of the interval including the true, but unknown, population proportion for which p is an estimate. For instance, c=1 corresponds to a confidence probability of 67%, i.e. one will expect that 67 out of 100 intervals will include the true proportion if repeated surveys are carried out. In most situations, however, a chance of one out of three to arrive at a wrong conclusion is not considered satisfactory. Usually, confidence coefficients of 90% or 95% are preferred, 95% corresponding to approximately c=2. Although our assessment as to the location of the true population proportion thus becomes less uncertain, the assessment itself, however, becomes less precise as the length of the interval increases.

Comparisons between groups
Comparing the occurrence of an attribute between different sub-groups of the population is probably the most frequently used method for making inference from survey data. For illustration of the problems involved in such comparisons, let us consider two separate sub-groups for which the estimated proportions sharing the attribute are , respectively, while the unknown true proportions are denoted p₁ and p₂. The corresponding standard error estimates are s₁ and s₂. The problem of inference is thus to evaluate the significance of the difference between the two sub-group estimates: Can the observed difference be caused by sampling error alone, or is it so great that there must be more substantive reasons for it?

We will assume that the estimate is the larger of the two proportions observed. Our problem of judgement will thus be equivalent to testing the following hypothesis:

    Hypothesis:     p1 = p2

    Alternative:    p1 > p2

In case the test rejects the hypothesis we will accept the alternative as a "significant" statement, and thus conclude that the observed difference between the two estimates is too great to be caused by randomness alone. However, as is the true nature of statistical inference, one can (almost) never draw absolutely certain conclusions. The uncertainty of the test is indicated by the choice of a "significance level", which is the probability of making a wrong decision by rejecting a true hypothesis. This probability should obviously be as small as possible. Usually it is set at 2.5% or 5% - depending on the risk or loss involved in drawing wrong conclusions.

The test implies that the hypothesis is rejected if

where the constant c depends on the choice of significance level:

        Significance level      c-value
        ------------------      -------
              2.5%                2.0
              5.0%                1.6
             10.0%                1.3

As is seen, the test criteria comprise the two standard error estimates and thus imply some calculation. It is also seen that smaller significance levels imply the requirement of larger observed differences between sub-groups in order to arrive at significant conclusions. One should be aware that the non-rejection of a hypothesis leaves one with no conclusions at all, rather than the acceptance of the hypothesis itself.

Non-response
Non-response occurs when one fails to obtain an interview with a properly pre-selected individual (unit non-response). The most frequent reasons for this kind of non-response are refusals and absence ("not-at-homes"). Item non-response occurs when a single question is left unanswered.
Non-response is generally the most important single source of bias in surveys. Most exposed to non-response bias are variables related to the very phenomenon of being a (frequent) "not-at-homer" or not (example: cinema attendance). In Western societies non-response rates of 15-30% are normal.
Various measures have been undertaken to keep non-response at the lowest level possible. Most of all confidence-building has been of concern, implying contacts with local community representatives have been made in order to enlist their support and approval. Furthermore, many hours have been spent explaining the scope of the survey to respondents and anyone else wanting to know, assuring that the survey makers neither would impose taxes on people nor demolish their homes, or - equally important for the reliability of the survey - bring direct material aid.

Furthermore, up to 4 call-backs were applied if selected respondents were not at home. Usually the data collectors were able to get an appointment for a subsequent visit at the first attempt, so that only one revisit was required in most cases. Unit non-response thus comprises refusals and those not being at home at all after four attempts.

Table A.13 shows the net number of respondents and non-responses in each of the three parts of the survey. The initial sizes of the various samples are deduced from the table by adding responses and non-responses. For the household and RSI samples, the total size was 2,518 units, while the female sample size was 1,247. It is seen from the bottom line that the non-response rates are outstandingly small compared to the "normal" magnitudes of 10 - 20% in similar surveys. Consequently, there should be fairly good evidence for maintaining that the effects of non-response in this survey are insignificant.

Table A.13 Number of (net) respondents and non-respondents in the tree parts of the survey

Households RSIs Women

Region Resp. Non-resp. Resp. Non-resp. Resp. Non-resp.

Gaza 970 8 968 10 482 4

West Bank 1,023 16 1,004 35 502 14

Arab Jerusalem 486 15 478 23 240 5

Total 2,479 39 2,450 68 1,224 23

Non-response rate 1.5% 2.7% 1.8%

al@mashriq                       960428/960710