Social science research relies heavily on opinion polls to collect individual-level information for hypothesis testing and theory building. To evaluate the quality of survey data, researchers have developed various measures of data validity and reliability (de Vaus, 2001; Maxim, 1999; Tudd et al., 1991). There are three conventional measures of validity: content, criterion-related, and construct validity; in regards to reliability, the commonly used measures are test-retest, alternative form, split-halves, and internal consistency. These measures, however, are not always feasible in practice, because of the limited time and budget of polls. Moreover, those measures are mainly used to test individual survey items, rather than evaluating the overall quality of a data set. A cost-effective method for evaluating the quality of the entire data set is still absent. This paper attempts to fill these research gaps.
Liu & Chen (2004) attempted to use the “interviewer assessment” (also known as “interviewer perception”) as a possible complement to the conventional measures of data quality. Their idea is noteworthy, but their analysis is preliminary – based on only three surveys from the 2002 and 2003 Taiwan’s Election and Democratization Study (TEDS) – and consequently their research finding is inconclusive. In this paper, we use 27 surveys to re-examine Liu and Chen’s idea and offer more empirical evidence for the usefulness of the interviewer assessment as a measure of data quality. In short, this paper examines interviewers’ assessments of their completed interviews, which may be a useful indicator of the overall data quality.
In the following sections, we first discuss why data quality is an important issue for opinion polls. Then, we discuss, from both theoretical and practical perspectives, why the interviewer assessments can be a useful supplementary indicator of data quality. An empirical examination of our argument is carried out based on the TEDS surveys from 2002-2017. This paper concludes with some remarks about our findings and suggestions for future studies.
Why is Data Quality an Important Issue for Opinion Polls?
Despite state-of-the-art methodology and technology, scientific opinion polls seldom claim to be error-free; on the contrary, an opinion poll is scientific mainly because it acknowledges errors and endeavors to control and correct errors (Lavrakas, 2013). Evaluation of data quality is therefore a fundamental basis of modern survey methodology and research.
Survey data are potentially subject to errors due to both the nature of public opinion and the designs of opinion polls. In regards to public opinion, it is unstable in nature and hence difficult to measure without error. Over fifty years ago, Converse (1964) noticed that survey respondents did not answer related questions in an interview consistently, and their answers change apparently randomly from interview to interview. Converse considers his research findings as evidence that the mass public has no genuine attitude toward most of issues of society. Opinion polls are inevitably subject to errors, because respondents do not admit to their non-attitudes but tend to randomly make up “doorstep opinions” at the moment of the interview.
Achen (1975), in contrast, emphasizes the imperfect designs of opinion polls as a source of errors in survey data. He argues that the mass public does have genuine attitudes toward issues of society, though the attitudes tend to be vague. Consequently, public opinion is not fixed at a point but a distribution of points around some central position. Therefore, better survey designs – particularly the questionnaire designs that take the vagueness of attitudes into account – are essential to capturing public opinion and reducing errors in opinion polls. Overall, Achen and Converse, though debating over the existence of genuine attitudes, both concur implicitly that public opinion is difficult to measure accurately, hence the importance of evaluation of survey data quality.
Furthermore, studies on the formation of public opinion also provide theoretical accounts of why public opinion is unstable and difficult to measure. Sniderman et al. (1991) argue that an individual takes into account the “evaluatively distinct dimensions of judgments” in interpreting events or in making decisions. When the number of distinctive dimensions involved increases, the number of considerations needed is increased, which complicates judgment and results in opinion instability.
Zaller (1992) also maintains that an individual possesses numerous inconsistent considerations relating to a particular issue. However, he argues that, rather than taking all considerations into account, the individual forms his or her survey response to that issue based on only a few of the considerations that are at the top of his or her mind at the moment of the interview. Given that the considerations are inconsistent and their relative salience varies with time, public opinion (more specifically, the survey-measured opinion) is unstable by nature.
Similarly, Alvarez & Brehm (2002) argue that public opinion is structured by a set of diverse predispositions. If an individual has consistent opinions related to a particular issue, his or her opinion toward that issue will be stable. In contrast, if the individual holds multiple predispositions, his or her opinion will become ambivalent, equivocal, and uncertain.
Taken together, although these classic works do not entirely agree with each other about the mechanism of opinion instability and to what extent public opinion is unstable. There appears a consensus that public opinion is indeed unstable, which implies the difficulty in measuring public opinion without error. It is therefore important to evaluate the quality of survey data.
Assessment of Overall Data Quality
Public opinion is variable and dynamic. Conventional measures of data quality, especially those based on response consistency (e.g. the test-retest reliability), are thus not always adequate to provide a clear evaluation of data quality (Johnson et al., 2001). Moreover, those measures are mainly designed to evaluate individual survey items rather than the entire data set. Surely, if such an evaluation is carried out for a substantial proportion of items in a questionnaire, the aggregation of individual evaluation results might serve as an indicator of the overall quality of a data set. The problem is, most opinion polls can only afford to evaluate a limited number of items, and those items are often chosen subjectively. The evaluation results are thus not always well representative of the quality of the entire data. In some cases, there is no evaluation of any individual item use as an indicator of the overall data quality. For example, opinion polls that evaluate the test-retest reliability of individual items, e.g. TEDS, are now under greater pressure to abandon such evaluation, as survey interviews are becoming increasingly costly. Taken together, all these considerations stress the need for a more cost-effective method for evaluating the overall quality of survey data.
It is important to clarify that we are not arguing against using conventional measures of reliability and validity as an indicator of the overall data quality. Instead, we are arguing for making the use of supplementary information for evaluation of survey data. One potential source of such information is interviewers’ personal assessments of their completed interviews. If interviewers’ assessments are highly correlated with the traditional indicators of reliability and validity, the evaluations of respondents by interviewers would be a cheap and efficient method to provide information about the overall data quality.
Interviewer Assessment as a Measure of Data Quality
Table 1 summarizes all TEDS face-to-face surveys to date. Since 2002, TEDS has required every interviewer to complete a short questionnaire right after each completed interview. The questionnaire consists of two parts (except TEDS 2017). The first part is designed to record special events that occurred in the interview (e.g. the respondent’s comments about the survey), and the second part is comprised of several Likert items to assess of the interview. Three items are of particular interest to our analysis: (1) how cooperative the respondent was, (2) how well the respondent understood the questions, and (3) how trustworthy the respondent’s answers are.
In the following pages, we explain the rationale for the use of the interviewer assessment as a measure of data quality, and then we empirically examine whether the TEDS interviewer assessment is, as the rationale suggests, informative to the evaluation of data quality.
The literature on survey non-response and measurement error provides some support for the use of the interviewer assessment as an indicator of data quality. It has been established that survey participation and response accuracy are connected to some extent (Olson, 2006; Peytchev et al., 2010; Tourangeau et al., 2010). People with a low willingness to participate in surveys tend to decline the interview when contacted, but if those people participate in interview, they – the so-called “reluctant respondents” – tend to have poor interview behavior, and most crucially, they tend to give poor responses and, as a consequence, compromise data quality. From this theoretical perspective, we argue that interviewers’ assessments of respondents’ interview behavior (e.g. uncooperativeness, comprehension, and untrustworthiness) should be informative to the evaluation of data quality.
In addition to this theoretical consideration, the interviewer assessment in TEDS has three features that have practical value for evaluating survey data. First, the interviewer assessment is aimed at providing an overall evaluation of the interview rather than individual survey items. Second, whereas the conventional measures focus on the preparatory work for interviews (e.g.. questionnaire design) or the end result of interviews (i.e., survey responses), interviewers’ assessments take the real context of interviews into account, through their observations and interactions with respondents. Third, compared to some commonly used measures that require repeated interviews or measurements, the interviewer assessment is a more affordable, convenient, and hence, practical approach to evaluation of data quality. These three features make the interviewer assessment a nice complement to the conventional measures of reliability and validity.
Interviewer assessment results. We begin the analysis with a summary of the TEDS interviewer assessment results, as shown in Figure 1 (see the figure legend for more details about variable coding and meanings). It should be noted that TEDS includes both national and local surveys. On the one hand, the fluctuation of interviewer assessments over the years can be shown in one concise graph. On the other hand, it may not be appropriate to compare interviewer assessments of different surveys without considering the survey context. There are two findings worth mentioning.
The first is that of the three issues noted by interviewers, that is, comprehension of survey questions, cooperation of respondents, and trustworthiness of respondents, comprehension seems to be the most urgent issue for TEDS to address. The proportion of respondents flagged as not understanding the questions is three times as large as the proportion of respondents flagged as uncooperative. Although untrustworthy respondents appeared to be as significant of an issue as comprehension in early TEDS, the situation has changed since 2013. The proportion of respondents whose answers are considered untrustworthy has deceased and has remained at comparatively low levels, whereas the proportion of respondents with comprehension difficulties has remained relatively high and variable.
The second noteworthy finding concerns the two surveys for Yun-lin County: TEDS 2005M-YL and 2009M-YL. These two surveys interviewed different respondents by different interviewers using different questionnaires for different magistrate elections in different years and contexts. Despite these differences, the two surveys have one similar finding– they have the worst interviewer assessments. The assessments from TEDS 2009M-YL are slightly better than TEDS 2005M-YL, but worse than the other 23 surveys. Certainly, this may be just a coincidence, but if TEDS is planning to conduct another survey for the Yun-lin Country magistrate election, the TEDS team should give more thought to this problem.
Interviewer assessment and reliability. We now examine whether the TEDS interviewer assessment is informative about the evaluation of data quality. As stressed repeatedly, we consider the interviewer assessment a supplement to, rather than a replacement for, conventional measures of reliability and validity. Accordingly, we take the conventional measures as benchmarks for examination of the interviewer assessment. A strong correlation between the interviewer assessment results and the conventional measures is then regarded as evidence that the interviewer assessment is a good indicator of data quality. We focus on reliability here, and will move on to validity in the next section
As can be seen from Table 1, most TEDS surveys (except for 2013 and 2017) have a so-called “retest-interview” component. That component involves selecting a random subset of respondents from the sample of the main interview and then asking them to answer some questions that are chosen from the main interview questionnaire. The consistency between those respondents’ answers in the main and retest interviews gives an indication of reliability. We use this test-retest reliability as the benchmark to examine the interviewer assessment.
Figure 2 shows that the interviewer assessment appears to be a sensitive indicator of the data reliability. The correlation between the overall test-retest reliability and each of three interviewer assessments, i.e. the percentages of respondents classified into the categories of uncooperative, unable to understand, and untrustworthy, respectively, is rather strong and in the expected direction.
The higher the proportion of respondents flagged as uncooperative, unable to understand, and untrustworthy in a survey, the poorer the data quality of that survey in terms of the overall test-retest reliability.
However, when analyzing factual and attitudinal questions separately, we find that the interviewer assessment serves as a good indicator of the reliability of attitudinal questions but not factual questions. Factual questions such as whether the respondents voted are usually easier to answer and less sensitive than attitudinal questions such as respondents’ comments on the result of elections. Respondents are more willing to answer factual questions even though they are uncooperative. The answers they provide are more likely to be correct and stable even though interviewers may consider them in general as unable to understand or untrustworthy. Therefore, test-retest reliability of factual variables may be higher than attitudinal variables, which may be one of the possible explanations for the weaker correlation between reliability of factual variables and the interviewer assessment.
Interviewer assessment and validity. TEDS usually conducts validity evaluation based on measures such as content validity or face validity. Specifically, TEDS conducts pre-test interviews and then uses the results to evaluate whether the questionnaire is well constructed and whether it appears to measure what it is supposed to measure. The procedures and data from this evaluation are not (and, by their very nature, cannot be) standardized as much as the test-retest reliability evaluation. Therefore, we are unable to use that validity evaluation as a benchmark to examine the interviewer assessment.
Three alternative benchmarks are employed. The first two benchmarks are the validity of the turnout measure. For TEDS – a study centering on political and electoral subjects – voter turnout is undoubtedly one of the most fundamentally crucial measures. This measure is available in almost every TEDS survey. Most importantly, this is one of very few measures with known true population parameters (i.e., official turnout rates). By assessing the extent of turnout error, we obtain the conventional criterion-related validity for the turnout measure of every TEDS survey (i.e. the difference between the official turnout rates and TEDS results), and then we use it as the first benchmark to examine the interviewer assessment.
Three charts in the upper panel of Figure 3 suggest that the interviewer assessment is not a good indicator of validity in terms of the accuracy of turnout rates. Correlation between each interviewer assessment and the degree of error is weak and opposite of the expected direction.
The higher the proportion of respondents flagged as uncooperative, unable to understand, and untrustworthy in a survey, the smaller the error of the turnout rate.
Second, we use the item nonresponse rate as another benchmark. Turnout is a factual question, and TEDS usually carries out interviews promptly after every target election. Not surprisingly, nonresponse to the turnout item (including the answer of “forget” and refusal to answer) is very rare in all TEDS surveys (0.8%-1.5%). Nonetheless, the nonresponse rate of the turnout item still varies over TEDS surveys. We strongly suspect that non-respondents to the turnout item are the reluctant respondents, who are, as the literature describes, a major source of invalid responses. We therefore take a relatively high nonresponse rate to the turnout item as a sign of low validity of that item. This is essentially an application of the conventional face validity, and we use it as the second benchmark to examine the interviewer assessment. As shown in three charts in the lower panel of Figure 3, the interviewer assessment appears to be a fairly reasonable indicator of validity in terms of the response rate of the turnout item. Surveys that have more non-respondents to the turnout question tend to receive poorer assessments from interviewers as well.
Finally, by detecting a nonsensical response pattern, we construct another face validity to be the third benchmark. TEDS (except 2004LA) includes the following questions to measure respondents’ party identification:
Q1. Do you usually think of yourself as close to any particular party?
Q2. Do you feel yourself a little closer to one of the political parties than the others?
Q3. Which party do you feel closest to?
Q4. Do you feel very close to this party, somewhat close, or not very close?
In early TEDS surveys, respondents who did not answer ‘yes’ in Question 1 were not required to answer Question 4. This set-up has changed since TEDS 2008L. Now, respondents who do not think of themselves as close any particular party (Question 1) will still proceed to answer Question 4, as long as they feel themselves a little closer to a political party than the others (Question 2).
We found that, among all respondents, 0.2% to 1.0% of respondents who did not say “yes” in Question 1 said “very close” in Question 4. This response pattern is obviously nonsensical and invalid. We suspect that those who gave such a nonsensical answer are also the reluctant respondents. Therefore, the higher the proportion of those respondents in a survey, the lower the face validity of the survey. We use this as the third benchmark to examine the interviewer assessment. Figure 4, however, shows that the interviewer assessment not only fails to catch this face validity but also gives opposite results. Surveys that have more respondents who gave nonsensical answers to the party identification question receive better assessments from interviewers.
This paper examines whether interviewers’ assessments of their completed interviews serve as a useful indicator of the overall data quality. To answer this research question, we compare the interviewer assessment with some commonly used measures of data quality based on the TEDS. We found that the interviewer assessment is a fair indicator of the overall reliability of attitudinal questions in TEDS surveys. However, , the interviewer assessment is uninformative about the reliability evaluation of factual questions. Regarding the evaluation of data validity, the interviewer assessment fails to give a correct indication of survey error and nonsensical responses to important items in TEDS surveys, though the interviewer assessment is sensitive to the non-response problem. Taken together, our findings suggest that the interviewer assessment, provides some useful information about data quality, but it is more appropriate to use that information to add to the evaluation of data reliability rather than to validity.
These research findings have a substantive implication. Survey interviews are becoming increasingly difficult and costly to conduct. Given a limited project budget and fieldwork time, often opinion polls have no choice but to abandon the retest interview and hence the test-retest reliability evaluation. (We suspect that this is one of reasons why TEDS 2013 and 2017 did not conduct the retest interview.) According to our findings, the use of interviews’ assessments as a cost-effective alternative (or supplement) to the test-retest reliability evaluation may be a solution to this difficult situation.
Despite these findings and implications, this study is inarguably still preliminary. Several issues need further investigation. Why does the interviewer assessment fail to serve as a good indicator of the reliability of factual questions and data validity? Can other kinds of interviewer assessments provide information for data evaluation (e.g. interviewers’ assessments of respondents’ knowledge and interest with respect to survey questions)? How can we evaluate and improve the quality of the interviewer assessment itself, and so forth? In future work, we will attempt to understand these issues.
Chi-lin Tsai is a postdoctoral research fellow in the Department of Political Science at National Cheng-chi University, Taiwan. He received his PhD in political science from the University of Essex in the UK in 2017. His research interests lie in the intersection of survey methodology, electoral studies and comparative politics.
He can be reached at NO.64, Sec.2, ZhiNan Rd., Wenshan District, Taipei City 11605, Taiwan or by e-mail at email@example.com.
Tsung-Wei Liu is an associate professor at the Department and Graduate Institute of Political Science, National Chung-Cheng University, Taiwan. His research interests include methodology of political science, political party, and electoral studies.
He can be reached at No.168, Sec. 1, University Rd., Minhsiung, Chiayi 62102, Taiwan or by e-mail at firstname.lastname@example.org.
Yi-ju Chen is an assistant professor at the Department of Medical Sociology and Social Work, Chung-Shan Medical University and a research consultant of the Social Service Section, Chung Shan Medical University Hospital, Taiwan. Her research interests include social welfare policy, elder’s social work, and research methods in social science.
She can be reached at No.110, Sec.1, Jianguo N.Rd., Taichung City 40201, Taiwan or by e-mail at email@example.com.
All correspondence concerning this article should be addressed to Yi-ju Chen at No.110, Sec.1, Jianguo N.Rd., Taichung City 40201, Taiwan or by e-mail at firstname.lastname@example.org.
Date of Submission: 2018-12-10
Date of the Review Results: 2019-01-26
Date of the Decision: 2019-02-15