SECTION

4 Evaluating Existing Measures

A tool’s measurement properties should always be considered when evaluating the acceptability of a measure. The most important properties to consider, namely the tool’s reliability and validity, as well as other relevant measurement properties, are described in this section. Each measure and publication in the Measures Registry includes a tab that provides detailed information on the measure’s reliability and validity, when available.

Reliability and Validity

Reliability refers to the consistency with which something is measured, whereas validity refers to the accuracy with which the measure captures the construct to be measured. Validity can be further broken down into criterion validity, which is when the tool is compared to a gold standard measure of the same construct (i.e., truth), and construct validity, which is when the tool is positively correlated with a measure of a theoretically-related construct. For example, we expect neighborhood walkability to be positively correlated with physical activity, especially active transportation. The types of reliability and validity that are relevant to built environment assessment differ across the methods of assessment. Acceptable measures would ideally have evidence of both reliability and validity, although in environmental research it can be difficult (e.g., with GIS) or unnecessary (e.g., with direct observation and environmental perceptions) to assess criterion validity because of the lack of gold standard comparison measures. Thus, many environmental measures rely on evidence of reliability (i.e., do raters agree with each other and is the same respondent consistent over time), face validity (i.e., is the measure perceived by experts as consistent with the concept it is intended to measure), and construct validity (e.g., is the measure related as expected to physical activity) to evaluate their quality or utility.

GIS-based Measures

A primary component comprising reliable GIS-based measures is that variable computations should be replicable by multiple analysts (see Section 7 for more detail on variable computations).^67,68 Few existing measures provide this evidence, but this type of inter-rater reliability can be maximized by using well-defined variables and a detailed scoring protocol. Another key component of reliability of GIS-based measures is temporal match, which refers to the match between the time period when the GIS data were collected and the time period when other variables (e.g., participant physical activity) were collected. Ideally, the GIS and participant variables would be collected within one year of one another. Although GIS variables can be stable over multiple years, some areas can change rapidly, such as those undergoing redevelopment, so local knowledge of the area is useful. A poor temporal match would make GIS variables that change rapidly less useful for explaining physical activity. Criterion validity in GIS-based measures is primarily affected by the completeness and accuracy of the GIS databases used. Unfortunately, it is often impossible to know whether errors exist in public geo-databases, and evidence is lacking on how data incompleteness and inaccuracies affect physical activity research. It is important to investigate the quality of the data source when possible; for example, by directly observing a small sample of GIS variables to determine accuracy. Construct validity is important and relevant to GIS-based measures, with the key consideration being that variables used should have evidence or rationale for associations with physical activity. Evidence of construct validity can be found in reviews of built environment and physical activity research.^19,30,59

Observational Measures

Inter-rater reliability is the most commonly assessed measurement property of observational measures and is a critical component in determining a tool’s quality. Inter-rater reliability involves multiple raters completing the audit tool for the same locations independently and comparing their responses for discrepancies. This is typically done for a sample of at least 30 to 40 instances of the environment being captured (e.g., street segments). Key metrics to consider include percent absolute agreement, Kappa for use with yes/no checklist audits and categorical data,⁶⁹ and intraclass correlation coefficients (ICCs) for use with continuous responses (e.g., 1-5 Likert scales).⁷⁰ Commonly used thresholds to represent good percent agreement are ≥80 percent, and for Kappas and ICCs ≥0.80; Kappa and ICC values between 0.60 and 0.80 are often considered acceptable.⁶⁹ It is important to note that a measure cannot have acceptable validity if it does not have acceptable inter-rater reliability.⁷¹ Criterion validity testing of observational measures is not typically necessary because direct observation is considered a gold standard objective method. Construct validity is commonly assessed for observational audits, and tools or constructs that have shown associations with participants’ physical activity are interpreted to have good construct validity.

Questionnaires

The primary measurement property needed to support questionnaires is test-retest reliability, which involves administering the tool to the same participants at two time points, such as two weeks apart. Similar to inter-rater reliability, percent agreement, Kappa statistics, and ICCs are used to interpret reliability.

Criterion validity can be assessed for presence or absence and tally-based report measures by comparing a participant’s responses to the same tool completed by a researcher using an observational audit, but this is not commonly done. Criterion validity is not typically assessed for perceptive evaluation report measures because evaluations are subjective and do not have a gold standard comparison. Similar to observational measures, construct validity should be established by testing associations with physical activity. Users also should note whether the measurement properties were established in a similar population as the user intends to study. When using a questionnaire in a new population or comparing across populations, the measure would ideally have evidence of invariance across subgroups (e.g., the measure performs similarly in men and women).⁷²

Single Items, Scales, and Indices

Measure developers have used a variety of methods to reduce measures with many items to a small number of scales and indices. However, there is a trade-off between greater feasibility but lower reliability and validity of shorter scales or single-item indicators. Both indices and scales can be used to reduce a large number of variables to a small number of useful metrics. Although the terms “scale” and “index” are sometimes used interchangeably, they have differences. Scales comprise inter-related items capturing a narrow (usually unobservable) construct, typically a perception or attitude. One advantage of scales is that they often improve reliability over single items. In contrast, items comprising an index do not need to be inter-related and often capture a broad concept. Environmental measures typically assess a wide range of features, but consensus is growing that no single feature is the most important for physical activity. One advantage of indices is that they can be used to investigate additive effects of single items or features, particularly when composed of a sum of dichotomous items. A growing number of studies show that multi-item indices are most strongly related to physical activity outcomes.^60,64

GIS-based Measures

Walkability indices, which involve a composite score derived from multiple environmental attributes, are commonly used in GIS-based measures. Such indices can be computed by summing dichotomous (e.g., yes = 1, no = 0) variables across a number of environmental attributes. For example, a neighborhood that has connected streets (1), access to shops and restaurants (1), and low residential density (0) would have a score of 2 on an index ranging from 0-3. When creating indices from continuous variables representing different units of measurement, a common procedure is to transform each indicator or attribute score into a standardized z-score based on the sample mean and then take a sum or average across the z-scores to derive the index (e.g., Frank’s walkability index).⁴³ These indices are advantageous because they can reduce a large number of variables to a manageable number of metrics and represent the combination of attributes sometimes viewed as more important than any single attribute. Another reason to use scales and indices is that environmental variables often are correlated with each other. Because components of walkability can be inter-correlated, it is sometimes not possible to include all components in the same analytic model. Running models with one walkability component at a time will underestimate associations with outcomes, but the multi-component index should yield a more accurate estimate of the association of the pattern of the built environment with the outcome. Limitations of walkability indices are that one variable may be unknowingly driving an association, and the metrics are difficult to interpret outside of the area or region used to generate the standardized scores. Thus, single variable attributes with easily interpretable metrics (e.g., number of intersections per square km, used to capture street connectivity) are sometimes more desirable for planning and decision making than are indices.

Observational Measures

Indices lend themselves well for use in observational assessment, given that many features are rated dichotomously (e.g., presence or absence), but indices are unfortunately underused in this area. The Microscale Audit of Pedestrian Streetscapes (MAPS) is an example of an observational tool that uses indices.⁶⁴ Indices can be used in observational measures in the same manner as in GIS-based measures and should be considered by users.

Questionnaires

Many questionnaires include multiple scales to group items according to specific attributes. For example, a perceived safety from traffic scale may include items covering perceptions of sidewalks, street crossings, traffic volume, and traffic speed, each rated on a Likert response scale. This type of data reduction simplifies analyses and can improve reliability properties of the measure. A disadvantage of scales is that multiple items are needed to assess each construct, so measures assessing multiple constructs can become lengthy. Some tools, such as the Physical Activity Neighborhood Environment Scale (PANES),⁷³ use one or two items to assess a construct and cover multiple constructs. Such tools can have acceptable measurement properties and should be considered when survey length is a concern, but in general, reliability properties are most favorable when multiple items are used to assess each construct.

Other Measurement Issues

Response scales. Response scales are used in both questionnaires and observational measures. Response scales in environmental assessment tools typically include anywhere from 2 (e.g., yes/no) to 10 or even more response options. Likert-type response scales can include three or more response options, with all or some options including anchor points, such as “none,” “some,” and “many” or “strongly agree” and “strongly disagree.” Anchors need to be balanced so that the intervals between every two sequential numbers are roughly equivalent. Response scales that include an odd number of response options often include a “neutral” or middle category, which can be problematic if responses will later be dichotomized (e.g., agree vs. disagree). An advantage of continuous response scales is that they typically result in greater variability and thus provide more power for detecting associations than dichotomous response scales. However, dichotomous response scales such as yes/no and agree/disagree are more easily interpretable and sometimes sufficient.

Stability vs. sensitivity to change. Measures that have good inter-rater or test-retest reliability focus on environmental attributes that are stable and show little change over short time periods. Although these traits are beneficial for establishing reliable measures, such tools may have limited use for assessing changes over time, such as evaluating interventions. GIS-based measures, in particular, have low sensitivity to change or have not yet been evaluated for sensitivity to change because macro-level community design features can require several years or even decades to change. For example, changes to increase population density or mixed land use within a neighborhood dominated by single-family homes would require significant policy changes and extensive redevelopment.

Although most observational measures and reports are generally not designed specifically to capture constructs that change over time, some may be useful for assessing changes over somewhat longer periods of time (e.g., ≥1 year) or when interventions target the constructs being assessed. For example, a cross-walk audit tool could be useful for capturing changes if an intervention specifically targets cross-walk improvements. If the user’s goal is to assess changes over time, special consideration is needed when identifying appropriate measures. Development of a new measure or new items for an existing measure may be needed to evaluate changes in the specific variables targeted by interventions. In some situations, it may be beneficial to use mixed-methods (i.e., both qualitative and quantitative data) to capture environmental changes. For example, a key informant with knowledge of the area and/or environmental changes could be interviewed to provide additional information on the environmental changes and the process by which they were achieved.

Measures Registry USER GUIDES