SECTION

4 Evaluating Existing Measures

The Importance of Psychometric Properties

Measurement is an extremely important aspect of science, research, and evaluation. To understand the relationship between factors or a factor’s impact on an outcome, one must be able to accurately and reliably measure the factor as well as the outcome. In the physical sciences (including physics, astronomy, chemistry and earth sciences) the ability to accurately measure important factors are typically dictated by the rules of physics, physiology, or biology. Consider measurement related to assessing blood pressure: Blood pressure is reliably assessed with some degree of confidence because a sphygmomanometer is assessing a biological event that is extremely predictable. Although blood pressure varies somewhat, both within and between people, the event that is being measured (blood flow through veins and arteries) is the same process for everyone.

However, in the social and behavioral sciences (fields that include psychology, sociology, anthropology, politics, education, and economics), such predictable rules of nature do not apply. Social and behavioral scientists are faced with the task of attempting to assess abstract and amorphous concepts and to assess environments that may not be static or experienced the same by all.

At the person-centered level, behavioral scientists attempt to measure peoples’ perceptions, attitudes, beliefs, and values. In considering how to measure one’s perceptions of the food environment, those perceptions may or may not be grounded in reality and may be highly fluid, even within individuals, based on their most recent, or significant, experiences with the environment. How do investigators or practitioners assess something that may be highly individual, possibly very changeable, and impossible to objectively quantify? Attempts to assess the influence of the social environment encounter equally complex questions: Who makes up one’s social environment and what aspects of it are important to measure? Does it include all social norms that individuals are exposed to through the media and larger culture or is it more important to focus on a more proximal sphere of influence? How stable is one’s social environment and how does it change as youth get older?

It might seem that elements of the physical environment should be easier to assess because their elements are more tangible and concrete than are perceptions of the environment or evaluations of social influence. One would expect that the presence of a grocery store in a neighborhood, the amount of shelf space available for fruits and vegetables, or the absence or presence of promotional materials should be evident to all and fairly straightforward to assess. But even here, challenges abound. The physical environment is not static: grocery stores open and close, a store owner makes changes in how shelf space is allocated, and what one data collector calls a promotional material is just a price sign for another data collector.

Understanding the qualities and robustness of measures is extremely important when choosing measurement tools. The “psychometric properties” of a measure are considered as indicators of overall measure quality and generally fall into two categories: reliability and validity.

Reliability

In general terms, reliability is the extent to which a measure is consistent or stable over time. Reliability helps to assess the quality of questions and instructions in a measurement tool as well as the stability of the abstract concepts that the measurement tool is trying to assess.²³ Table 2 provides a definition of three types of reliability that are typically assessed, how each is measured, how they are applied to environmental measures, and examples from the food environment field. Briefly:

Inter-rater reliability evaluates the degree to which two or more data collectors assess the same environment in the same way. Inter-rater reliability is testing both the clarity of the instrument as well as the consistency and quality of training of data collectors. Testing an instrument for its inter-rater reliability properties happens during pretesting of the instrument but may also occur during the data collection as a quality assurance check. Poor inter-rater reliability during data collection phase indicates that retraining of staff is needed or may indicate that the environment being assessed has changed in significant ways, requiring some adaptation of the tool.
Test-retest reliability assesses consistency between a single respondent’s answers over time and is typically tested in the pilot phase of questionnaire development to evaluate clarity of questions and directions. Test-retest assessments typically occur within about 2 to 4 weeks of each other and are looking for inconsistencies in responses that may occur because the questions or instructions are not clear. Note, however, that test-retest does not evaluate the measurement tool’s ability to detect actual change. For questionnaires that are attempting to assess perceptions or attitudes, one would expect little actual change over a short period of time; rather, a low correlation on test-retest likely identifies questions that are not clear to the respondent.
Internal consistency is important to assess when a number of questions are developed to try to understand the same attitude. As an example, a researcher may want to assess the extent to which perceived barriers are influencing an individual’s food choices in a community center. The researcher creates a 10-item scale made up of questions related to barriers to choosing healthy foods in that community center. One would expect that those 10 items are correlated or “internally consistent.” To test the internal consistency of those items, the researcher would pilot the questionnaire and then use the pilot data to assess the level of consistency using a Cronbach’s alpha. Items that are not correlated might be eliminated from the scale to improve the consistency of the rest of the items in the scale or new items may need to be created and tested.

Table 2: Reliability Definitions, Measures, Applications to, and Examples from Food Environment Measurement

Validity

Validity is another important psychometric property. Validity refers to the ability of a measure to assess what it intends to assess. We often think of four kinds of validity: face validity, criterion validity, content validity, and construct validity. Table 3 provides definitions on types of validity, how each type is evaluated, application to the food environment field, and examples from the food environment field. Briefly:

Face validity is the weakest of the validity measures and involves having others, besides the developers of the instrument, review the instrument to provide feedback on whether they believe that the instrument is asking the “right questions” or whether the questions are asked in a way that would be meaningful and relevant to the target population. Based on feedback from those providing an assessment of face validity, the tool developer might modify the questions.
Criterion validity involves comparing the developed instrument with some “gold standard” that may not be practical to use because of cost or logistical considerations. If criterion validity can be established with the new instrument, the researchers can anticipate that it will respond in the same way as the criterion and be useful as a less expensive or burdensome proxy.
Content validity refers to an assessment of the degree to which the measurement tool captures the important elements of the factor. Content validity may be assessed by an external review that provides feedback on the comprehensiveness of the questions included to capture the important elements of the factor of interest. In addition, content validity could be assessed using factor analysis. Using the perceived barriers in the community environment as an example, the questionnaire writer might include a list of potential barriers that are specific to cost, taste preference, access, and social norms. If those four content areas are adequately captured through the measurement tool, a factor analysis should reveal four separate factors in the data.
Construct validity is a measure of the association between the factor of interest and an outcome of interest. A measure might be highly reliable and have strong face, criterion, and content validity, but if it is not associated with an important health-related outcome, its utility is questionable. Section 9 talks more about the importance and challenges of construct validity in this field.

Table 3: Validity Definitions, Measures, Applications to, and Examples from Food Environment Measurement

The Measures Registry allows users to see and compare measures of reliability and validity that have been reported on for the measures that may fit users’ needs. Psychometric properties are not available on all measures included in the Registry, but when they are, they provide important insight to the quality of the measure.

Measures Registry USER GUIDES