Reliability and validity

Reliability

A necessary but insufficient condition of measurement accuracy is producing consistent readings when measuring the same phenomenon.

Self-report scales must likewise be consistent, which psychometricians call reliability.

psychometricians use various tests that examine consistency across individual items (internal reliability), random subsets of items (split-half reliability), scale versions (alternate-form reliability), perspectives (inter-rater reliability), time (test–retest reliability), and so forth (Anastasi & Urbina, 1997; Kazdin, 1998).

Cronbach’s alpha

The most commonly examined type of reliability is internal reliability,

Most scale builders also use the same arbitrary threshold of   .70 to indicate scale reliability, though other more nuanced standards have been suggested. For example, DeVellis (2003, pp. 95–96) describes standards for what is unacceptable (  .60), undesirable (.60    .65), minimally acceptable (.65    .70), respectable (.70    .80), very good (.80    .90), and unnecessarily high such that one should consider shortening one’s scale (.90  )

See McNeish 2018

standardized  is a simple enough calculation:   kr/(1  r[k  1]). Both denominator and numerator are functions of the number of items in a scale (k) and mean interitem correlation (r),

Because  is determined exclusively by the number of items in a scale and the degree to which they covary, DeVellis (2003) accurately describes it as a proportion of covariance among items. John and Soto (2007) illustrate ‘s dependence on these two attributes by noting  meets DeVellis’ (2003) standard of very good at .87 for both a six-item scale with a mean interitem correlation of .52 and a nine-item scale with a mean interitem correlation of .42.

This dependence on the number of items in a scale and the degree to which they covary also means that _ does not indicate validity

The reader might imagine a fictional nine-item scale involving a five-item set concerning gender and an orthogonal four-item set concerning hair color. If interitem correlations averaged .95 within sets and .00 between sets, the average correlation across all nine items would be .42, _ would be very good at .87, yet the scale would measure nothing.

Validity

validity concerns the degree to which the scale builder is measuring what she claims to be measuring. A bathroom scale can be consistent, for example, and consistently wrong.

Types of validity are even more undifferentiated and numerous than types of reliability. They include construct, content, concurrent, predictive, criterion, face, factorial, convergent, and discriminant validity (e.g., John & Soto, 2007; Kazdin, 1998).

Content validity concerns the degree to which items denote the right construct, the entire construct, and nothing else.

Predictive validity, sometimes called criterion-related validity, concerns the degree to which scale scores occupy the right spot in the nomological net or, as DeVellis (2003, p. 50) puts it, having the right empirical associations.