Response Options

Warning

This chapter is still a work in progress.

There are several decisions to make that involve response options. How many response options should you use? Should you use an even or odd number of response options? Should you label them? This section contains a summary of best practices that one can use to address these questions.

Number of Response Options

The question of how many response options to use centers around two main concerns. The first is that more options means you can obtain a more fine-grained assessment of the characteristic that is being evaluated (e.g., attitude). In other words, your assessment is more precise. However, the question is how the number of options affects the reliability and the validity of the measurement. With more options, it becomes more difficult for people to distinguish between the different options (e.g., is “Strongly agree” reliably different from “Very strongly agree”?).

Table 1 shows an overview of various studies in which the topic of response options was addressed. The studies vary in many ways, so the final conclusion should be a holistic interpretation of the results, rather than a simple tallying of the results. Note also that only empirical studies are included and not simulation studies. Simulation studies seem limited because they cannot address the plausible psychological limitation of people being unable to distinguish between many options.

Table 1: Overview of empirical studies on the topic of response options.
Source	Comparisons	Topic	Outcome	Conclusion
Donnellan & Rakhshani (2020)	2- to 7-, and 11-point Likert	Self-esteem	Reliability; distribution; validity; quality	5-point Likert or higher
Simms et al. (2019)	2- to 11-point Likert + VAS	Personality	Reliability; validity	6-point Likert
Sung & Wu (2018)	5-point Likert and VAS-RRP	Career interest	Reliability	VAS-RPP
Cox et al. (2017)	2- and 4-point Likert	Personality	Reliability; validity; duration	Mixed
Lewis (2017)	7-, 11-point Likert and VAS	Software usability	Reliability; distribution; validity	No difference
Kuhlmann et al. (2017)	5-point Likert and VAS	Personality	Reliability; distribution; validity	No difference
Hilbert (2016)	2- and 5-point and VAS	Personality	Reliability; validity; quality	It depends
Capik & Gozum (2015)	2- and 5-point Likert	Health	Reliability; validity	No difference
Eutsler & Lang (2015)	5-, 7-, 9-, and 11-point Likert	Judgment	Distribution; power	7-point Likert
Finn et al. (2015)	2- and 4-point Likert			4-point Likert
Revilla et al. (2014)	5-, 7-, 11-point Likert			5-point Likert
Cox et al. (2012)	2- and 4-point Likert			4-point Likert
Janhunen (2012)	7-point Likert and 30-point VAR			VAR
Dawes (2008)	5-, 7-, and 10-point Likert			No difference
Weng (2004)	3- to 9-point Likert			5-point or higher
Preston & Colman (2000)	2- to 11-point Likert and VAS			7-, 9-, or 10-point Likert
Alwin (1997)	7- and 11-point Likert			11-point Likert
Jaeschke et al. (1990)	7-point Likert and VAS			No difference (slightly favor 7-point Likert)
Flamer (1983)	2- and 9-point Likert			9-point Likert
Matell & Jacoby (1971)	2-point to 18-point Likert			No difference
Bendig (1954)	2-, 3-, 5-, 7-, and 9-point Likert			No difference (maybe 3-point or higher)
Rhemtulla et al. (2012)	2- to 7-point Likert			5-point Likert maybe good, 6- or 7-point best

There are also several review papers on the topic. Krosnick & Presser (2010) suggest that 7-point Likert scales are probably optimal. Lietz (2010) concludes a desirable Likert-scale consists of 5 to 8 response options. Similarly, Cox III (1980) recommends to use between 5 and 9 response options. Symonds (1924), in 1924, claims the optimum number is 7. Gehlbach & Brinkworth (2011) recommends using 5-points for unipolar items and 7-point for biopolar items.

There are also statistical arguments for why a particular number of response options is preferred. With more response options, the assumption of normality is more likely to be tenable. Some of the papers included in Table 1 (e.g., Rhemtulla et al. (2012)) are about this concern.

Besides psychometric properties it may also be worth taking into account respondent preference. This involves ease of use of the scale and whether the response options allow for sufficient variation for respondents to express their view. Preston & Colman (2000) found that respondents found scales with 5, 7, and 10 points easy to use (compared to fewer options and a VAS) and that they preferred scales with more response options to allow them to express themselves (7 or more). Other studies also show that respondents favor more options (Cox et al., 2017).

Note that if time is of the essence, fewer response options are preferred.

Another relevant factor is whether the scale is bipolar or unipolar. Bipolar scales are symmetrical which means the number of options naturally increase as they need to match both sides of the spectrum. Unipolar items are only about one side, usually ranging from the absence of something to the presence of something (to a certain degree). Since it is harder to label a larger number of options for a unipolar scale, the number of options are likely to be smaller.

Conclusion: It appears that few response options (2 or 3) should definitely be avoided. More response options therefore seems better, but benefits seem to quickly level off. Given other concerns, such as ease of use and interpretability, a 7-point Likert scale seems to be preferred for bipolar scales and a 5-point Likert scale for unipolar scales.

Odd vs. Even Response Options

The middle option of a scale can have an ambiguous meaning. Participants may use it to indicate a moderate standing on the issue (Rugg and Cantril, 1944), a lack of an opinion (Nadler, Weston, and Voyles, 2014), ambivalence (Klopfer and Madden, 1980; Schaeﬀer and Presser, 2003; Nadler, Weston, and Voyles, 2014), indifference (Schaeﬀer and Presser, 2003; Nadler, Weston, and Voyles, 2014), uncertainty (Baka, Figgou, and Triga, 2012; Nadler, Weston, and Voyles, 2014), confusion, or to signal context dependence (e.g., “it depends” or disputing the question, see Baka, Figgou, and Triga, 2012).

The middle option may also be used for certain response styles, such as socially desirable responding (Sturgis, Roberts, and Smith, 2012) or satisficing (Krosnick, 1991), although there is not much research showing it actually leads to satisficing Wang & Krosnick (2020).

Af a middle alternative is explicitly oﬀered, the proportion endorsing it increases dramatically (e.g. Ayidiya & McClendon, 1990; Bishop, 1987; Bishop, Hippler, Schwarz, & Strack, 1988; Kalton, Collins, & Brook, 1978; Kalton, Roberts, & Holt, 1980; Rugg & Cantril, 1944).

Some studies show that not including a middle option decreases validity and increases measurement error (O’Muircheartaigh, Krosnick, and Helic, 1999; Kahn, and Dhar, 2002)

Recent study on this: Wang & Krosnick (2020)

An alternative approach to this issue is to use branching. Respondents could first be asked whether they fall at the midpoint or on one side, followed by a question about their extremity on a side. This approach was found to be more reliable and valid than using a 7-point scale (Krosnick and Berent, 1993; Malhotra, Krosnick, and Thomas, 2009).

Conclusion: If it is possible that respondents may have a moderate view, it seems crucial for it to be possible to capture this view. Limitations of a middle option could then be addressed in other ways (e.g., clear questions).

Response Option Labeling

There are several studies that show all response options should be labelled, rather than only labeling the end points (Krosnick & Berent, 1993; Weng, 2004).

For an example of biopolar labels for a 2- to 11-point Likert scale, see Table 1.

Table 1: Likert response labels from Simms et al. (2019)
Label	2-point	3-point	4-point	5-point	6-point	7-point	8-point	9-point	10-point	11-point
Very strongly disagree							x	x	x	x
Strongly disagree			x	x	x	x	x	x	x	x
Disagree	x	x	x	x	x	x	x	x	x	x
Mostly disagree									x	x
Slightly disagree					x	x	x	x	x	x
Neither agree nor disagree		x		x		x		x		x
Slightly agree					x	x	x	x	x	x
Mostly agree									x	x
Agree	x	x	x	x	x	x	x	x	x	x
Strongly agree			x	x	x	x	x	x	x	x
Very strongly agree							x	x	x	x

It is also recommended to avoid agree-disagree response labels because asking respondents to rate their level of agreement is a cognitively demanding task that increases respondent error and reduces responding effort (Gehlbach & Brinkworth, 2011).

Possible labels, from CampusLabs:

Agreement: Strongly agree, Moderately agree, Neither agree nor disagree, Moderately disagree, Strongly disagree (another version removes the “moderately” qualifier and/or uses “neutral”)

Comparison: Much X, Slightly X, About the same, Slightly (opposite of X), Much (opposite of X)

Ease: Very easy, Moderately easy, Neither easy nor difficult, Moderately difficult, Very difficult

Expectations: Exceeds expectations, Fully meets expectations, Does not fully meet expectations, Does not meet expectations at all

Extent (5 pt): A great deal (Completely, if appropriate), Considerably, Moderately, Slightly, Not at all

Extent (4 pt): Significantly, Moderately, Slightly, Not at all

Frequency (no set time): Always, Often, Occasionally, Rarely, Never

Frequency (general): Daily, Weekly, Monthly, Once a semester, Once a year, Never

Frequency (based on time frame): More than 5 times, 4 - 5 times, 2 - 3 times, 1 time, Less than 1 time, Never

Frequency (extended): More than once a week, Once a week, Once a month, Once a semester, Once a year, Less than once a year, Never

Helpfulness: Extremely helpful, Very helpful, Moderately helpful, Slightly helpful, Not at all helpful

Importance: Extremely important, Very important, Moderately important, Slightly important, Not at all important

Interest: Extremely interested, Very interested, Moderately interested, Slightly interested, Not at all interested

Likelihood: Very likely, Moderately likely, Neither likely nor unlikely, Moderately unlikely, Very unlikely

Numeric Scales: Less than #, About the same, More than #

Probability: Definitely would, Probably would, Probably wouldn’t, Definitely wouldn’t

Proficiency: Beginner, Developing, Competent, Advanced, Expert (typical for Rubrics)

Quality: Excellent, Good, Average, Below average, Poor

Satisfaction: Very satisfied, Moderately satisfied, Neither satisfied nor dissatisfied, Moderately dissatisfied, Very dissatisfied (another version removes the “moderately” qualifier and/or uses “neutral”)

Taken from https://baselinesupport.campuslabs.com/hc/en-us/articles/204305485-Recommended-Scales

References

Alwin, D. F. (1997). Feeling thermometers versus 7-point scales: Which are better? Sociological Methods & Research, 25(3), 318–340. https://doi.org/10.1177/0049124197025003003

Bendig, A. W. (1954). Reliability and the number of rating-scale categories. Journal of Applied Psychology, 38(1), 38–40. https://doi.org/10.1037/h0055647

Capik, C., & Gozum, S. (2015). Psychometric features of an assessment instrument with Likert and dichotomous response formats. Public Health Nursing, 32(1), 81–86. https://doi.org/10.1111/phn.12156

Cox, A., Courrégé, S. C., Feder, A. H., & Weed, N. C. (2017). Effects of augmenting response options of the MMPI-2-RF: An extension of previous findings. Cogent Psychology, 4(1), 1323988. https://doi.org/10.1080/23311908.2017.1323988

Cox, A., Pant, H., Gilson, A. N., Rodriguez, J. L., Young, K. R., Kwon, S., & Weed, N. C. (2012). Effects of augmenting response options on MMPI2 RC scale psychometrics. Journal of Personality Assessment, 94(6), 613–619. https://doi.org/10.1080/00223891.2012.700464

Cox III, E. P. (1980). The optimal number of response alternatives for a scale: A review. Journal of Marketing Research, 17(4), 407. https://doi.org/10.2307/3150495

Dawes, J. (2008). Do data characteristics change according to the number of scale points used? An experiment using 5-point, 7-point and 10-point scales. International Journal of Market Research, 50(1), 61–104. https://doi.org/10.1177/147078530805000106

Donnellan, B., & Rakhshani, A. (2020). How does the number of response options impact the psychometric properties of the rosenberg self-esteem scale? https://doi.org/10.31234/osf.io/fnywz

Eutsler, J., & Lang, B. (2015). Rating scales in accounting research: The impact of scale points and labels. Behavioral Research in Accounting, 27(2), 35–51. https://doi.org/10.2308/bria-51219

Finn, J. A., Ben-Porath, Y. S., & Tellegen, A. (2015). Dichotomous versus polytomous response options in psychopathology assessment: method or meaningful variance? Psychological Assessment, 27(1), 184–193. https://doi.org/10.1037/pas0000044

Flamer, S. (1983). Assessment of the multitrait-multimethod matrix validity of Likert scales via confirmatory factor analysis. Multivariate Behavioral Research, 18(3), 275–306. https://doi.org/10.1207/s15327906mbr1803_3

Gehlbach, H., & Brinkworth, M. E. (2011). Measure twice, cut down error: A process for enhancing the validity of survey scales. Review of General Psychology, 15(4), 380–387. https://doi.org/10.1037/a0025704

Hilbert, S. (2016). The influence of the response format in a personality questionnaire: An analysis of a dichotomous, a Likert-type, and a visual analogue scale. TPM - Testing, Psychometrics, Methodology in Applied Psychology, 1, 3–24. https://doi.org/10.4473/TPM23.1.1

Jaeschke, R., Singer, J., & Guyatt, G. H. (1990). A comparison of seven-point and visual analogue scales: Data from a randomized trial. Controlled Clinical Trials, 11(1), 43–51. https://doi.org/10.1016/0197-2456(90)90031-V

Janhunen, K. (2012). A comparison of Likert-type rating and visually-aided rating in a simple moral judgment experiment. Quality & Quantity, 46(5), 1471–1477. https://doi.org/10.1007/s11135-011-9461-x

Krosnick, J. A., & Berent, M. K. (1993). Comparisons of party identification and policy preferences: The impact of survey question format. American Journal of Political Science, 37(3), 941. https://doi.org/10.2307/2111580

Krosnick, J. A., & Presser, S. (2010). Question and questionnaire design (Second edition). Emerald.

Kuhlmann, T., Dantlgraber, M., & Reips, U.-D. (2017). Investigating measurement equivalence of visual analogue scales and Likert-type scales in Internet-based personality questionnaires. Behavior Research Methods, 49(6), 2173–2181. https://doi.org/10.3758/s13428-016-0850-x

Lewis, J. R. (2017). User experience rating scales with 7, 11, or 101 points: Does it matter? Journal of Usability Studies, 12(2), 19.

Lietz, P. (2010). Research into questionnaire design: A summary of the literature. International Journal of Market Research, 52(2), 249–272. https://doi.org/10.2501/S147078530920120X

Matell, M. S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert scale items? Study I: Reliability and validity. Educational and Psychological Measurement, 31(3), 657–674. https://doi.org/10.1177/001316447103100307

Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences. Acta Psychologica, 104(1), 1–15. https://doi.org/10.1016/S0001-6918(99)00050-5

Revilla, M. A., Saris, W. E., & Krosnick, J. A. (2014). Choosing the number of categories in agreedisagree scales. Sociological Methods & Research, 43(1), 73–97. https://doi.org/10.1177/0049124113509605

Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17(3), 354–373. https://doi.org/10.1037/a0029315

Simms, L. J., Zelazny, K., Williams, T. F., & Bernstein, L. (2019). Does the number of response options matter? Psychometric perspectives using personality questionnaire data. Psychological Assessment, 31(4), 557–566. https://doi.org/10.1037/pas0000648

Sung, Y.-T., & Wu, J.-S. (2018). The Visual Analogue Scale for Rating, Ranking and Paired-Comparison (VAS-RRP): A new technique for psychological measurement. Behavior Research Methods, 50(4), 1694–1715. https://doi.org/10.3758/s13428-018-1041-8

Symonds, P. M. (1924). On the loss of reliability in ratings due to coarseness of the scale. Journal of Experimental Psychology, 7(6), 456–461. https://doi.org/10.1037/h0074469

Wang, R., & Krosnick, J. A. (2020). Middle alternatives and measurement validity: A recommendation for survey researchers. International Journal of Social Research Methodology, 23(2), 169–184. https://doi.org/10.1080/13645579.2019.1645384

Weng, L.-J. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test-retest reliability. Educational and Psychological Measurement, 64(6), 956–972. https://doi.org/10.1177/0013164404268674