Item Development · How to Science

This step is an important step in the development of a scale as serious problems with the item pool will reverberate through all subsequent data analyses and scale construction efforts.

items should be written that are (i) relevant to the constructs to be measured, and (ii) representative of all potentially important aspects of the target construct. Having formal construct definitions is particularly important here, as such definitions should guide the item writing process.

Besides including items to cover all the different facets of a particular construct, it’s also important that the item pool includes items reflecting all levels of the construct.

Item writing guidelines (Simms, 2008):

Write items using simple and straightforward language that is appropriate for the reading level of the measure’s target population.
Avoid writing complex or convoluted items that are difficult to read and understand (e.g., double-barreled items such as ‘My outgoing nature would make me a good salesperson’, since they confound different characteristics – in this case, being outgoing and being a good salesperson – that may not covary in some individuals.
Avoid using slang and colloquial expressions that may quickly become obsolete. Be careful that phrasing does not affect responses in unexpected ways (e.g., including ‘worry’ in an item nearly guarantees that the it will have a neuroticism component).
To the extent possible, write a mix of positively and negatively worded items to guard against response sets.
Phrase items generally enough that most or all targeted respondents can provide a reasonably appropriate response (e.g., write ‘I get tired after I exercise’ rather than ‘I get tired after playing soccer’).
To increase the likelihood of truthful responding, phrase items asking about sensitive issues using straightforward, matter-of-fact, and nonpejorative language.

Pilot testing

After the initial item pool is complete, it makes sense to pilot test the items before running a large-scaled exploratory study.

Factor loadings can be improved by using multiple response (Likert-type) items, as they generally result in higher loadings than two-choice items (Comrey & Montag, 1982; Oswald & Velicer, 1980; Velicer, DiClemente, & Corriveau, 1984; Velicer, Govia, Cherico, & Corriveau, 1985; Velicer & Stevenson, 1978). Likewise, the quality of item writing can affect the size of the loadings, that is, the expression of an item in simple language, restricting the item to a single idea, or using content that is appropriate to a majority of respondents are all ways of improving items.

Reverse-scored items

Reverse-scored or reverse worded items can be included to determine whether participants are paying attention and don’t just select the same response on each item. However, there is some evidence that reverse-scored items reduce the reliability of the scale or produce an unexpected factor structure (swain2008?).

Another important consideration is that reverse-worded items can affect the model fit. Factor analyses of scales with some RW items frequently indicate the presence of method covariance obscuring or confounding substantive covariance (e.g., Brown, 2003; Roszkowski & Soven, 2010).

Number of items

There are no hard-and-fast rules guiding this decision, but keeping a measure short is an effective means of minimizing response biases caused by boredom or fatigue (Schmitt & Stults, 1985; Schriesheim & Eisenbach, 1990). Additional items also demand more time in both the development and administration of a measure (Carmines & Zeller, 1979). Harvey, Billings, and Nilan (1985) suggest that at least four items per scale are needed to test the homogeneity of items within each latent construct. Adequate internal consistency reliabilities can be obtained with as few as three items (Cook et al., 1981), and adding items indefinitely makes progressively less impact on scale reliability (Carmines & Zeller, 1979). It is difficult to improve on the internal consistency reliabilities of five appropriate items by adding items to a scale (Hinkin, 1985; Hinkin & Schriesheim, 1989; Schriesheim & Hinkin, 1990). Cortina (1993) found that scales with many

items may have high internal consistency reliabilities even if item intercorrelations are low, an argument in favor of shorter scales with high internal consistency. It is also important to assure that the domain has been adequately sampled, as inadequate sampling is a primary source of measurement error (Churchill, 1979). As Thurstone (1947) points out, scales should possess simple structure, or parsimony. Not only should any one measure have the simplest possible factor constitution, but any scale should require the contribution of a minimum number of items that adequately tap the domain of interest. These findings would suggest that the eventual goal will be the retention of four to six items for most constructs, but the final determination must be made only with accumulated evidence in support of the construct validity of the measure. It should be anticipated that approximately one half of the created items will be retained for use in the final scales, so at least twice as many items as will be needed in the final scales should be generated to be administered in a survey questionnaire.

https://twitter.com/dingding_peng/status/1481683536499331079

https://psyarxiv.com/4kra2/

Simms, L. J. (2008). Classical and modern methods of psychological scale construction: Scale construction. Social and Personality Psychology Compass, 2(1), 414–433. https://doi.org/10.1111/j.1751-9004.2007.00044.x