arnold on Oct 17, 2008
Cronbach (1971) described validation as the process by which a test developed or test user collects evidence to support the types of inferences that are to be drawn from test scores.
Nunnally and Bernstein (1994) say the term validity denotes the scientific utility of a measuring instrument, broadly statable in terms of how well it measures what it purports to measure. Validity usually is a matter of degree rather than an all-or-none property, and validation is an unending process.
Cambell and Fiske (1959) write that validity is represented in the agreement between two attempts to measure the same trait through maximally different methods.
The literature on validity is almost overwhelming. Simply speaking, the only thing a researcher is trying to show with validity, is how well an instrument measures what it is supposed to. The following summary of validity is a simple synthesis of the work of Crocker and Algina (1986), Nunnally and Bernstein (1994) and Campbell and Fiske (1959). For a look at each individual type of validation, it might be helpful to click on the terms to the right of this summary.
Validity has been given three major meanings:
1)Construct validity: measuring psychological attributes
2)Predictive validity: establishing a statistical relationship with a particular criterion
3)Content validity: sampling form a pool of required content.
Planning a Validation Study:
According to Crocker and Algina (1986), to plan a validation study, the desired inference must be clearly identified. Then an empirical study is designed to gather evidence of the usefulness of the scores for such inferences. The three major types of validation studies will be discussed in no particular order.
1) Content Validity: (Content validity is also called intrinsic validity, circular validity , relevance and representativeness (Nunnally and Bernstein, 1994).
This is for situations where the test user desires to draw an inference from the examinee’s test score to a larger domain of items similar to those on the test itself. For example, if a test giver gives a person an addition test, it is unimportant whether the test taker only knows the equations on that particular test. The test giver is interested in the test takers knowledge of addition problems like these. It would be unrealistic to give test takers the enter population of possible addition problems so a test giver must choose a sample and then prove that the content used in the sample is a valid representation. Hence the term, content validation.
Content Validity as described by Crocker and Algina (1986).
In order to prove content validity, there are a series of activities that must take place after the measurement instrument has been developed. According to Crocker and Algina (1986) there are a minimum of 4 steps to show content validation.
1) Defining the performance domain of interest (Ability to do addition)
2) Selecting a panel of qualified experts in the content domain (Math teachers?)
3) Providing a structured framework for the process of matching items to the performance domain.
4) Collecting and Summarizing the data from the matching process.
Content Validity as described by Nunnally and Bernstein (1994).
Content Validity: when the validity depends greatly on the adequacy with which a specified domain content is sampled. More simply, it is how well the material was sampled. If most potential users of the test agree that the plan was sound and well carried out, the test has a high degree of content validity.
Two major standards for ensuring content validity:
1) Representative collection of items and
2) Sensible methods for test construction.
One would expect at least a moderate level of internal consistency among test items. Another type of evidence for content validity is obtained from correlating scores on different tests purporting to measure much of the same thing. However, both tests may measure the same wrong thing. Content validity relates to a rather direct issue in scientific generalization – the extent to which one can generalize from a particular collection of items to all possible items in a broader domain of items.
2) Predictive Validity: (also called empirical validity , statistical validity , criterion-related validity).
Criterion related validity as described by Crocker and Algina (1986).
Criterion related validation is for situations where the test user desires to draw an inference from the examinee’s test score to performance on some real behavioral variable for practical importance (Crocker and Algina, 1986). For example, a GMAT test to gain entrance into a graduate business program only measures certain skills such as problem solving, writing and quantitative ability. Graduate programs aren’t interested in only these skills, but they want to know if they can infer from the GMAT, how well a potential candidate to the program should perform based on their GMAT score. In order for admission’s personnel to use the scores of the GMAT to make a decision about admission, they first need to be provided evidence that there is a relationship between the GMAT test scores and success in a graduate program (the criterion to be referenced). This evidence comes from a criterion-related validation study. According to Crocker and Algina (1986) there are a minimum of 5 steps to show criterion-related validation.
1) Identify a suitable criterion behavior and a method for measuring it. (Grad school G.P.A, Degree?)
2) Identify an appropriate sample of examinees representative of those for whom the test will ultimately be used. (Grad school students).
3) Administer the test and keep a record of each examinee’s score.
4) When the criterion data are available, obtain a measure of performance on the criterion for each examinee.
5) Determine the strength of the relationship between test scores and criterion performance.
There are two types of criterion-related validation, Predictive and Concurrent. Predictive validity refers to the degree to which the test scores predict the criterion measurement, which is what the GMAT does for the success in college criterion. Concurrent validity refers to the relationship between the test scores and a criterion measurement made at the time of the test. For example, a DMV could conceivable give a written test about driving right before doing an actual road test. A high correlation of the two tests could justify the use of only the written test. (Although unlikely in this example).
Some commonly accounted problems with criterion-related validity studies from Crocker and Algina (1986) are:
a) Identification of a suitable criterion
b) insufficient sample size
c) criterion contamination
d) restriction of range
e) unreliability of the predictor or criterion scores
There are a couple important things to mention when it comes to reporting and interpreting results of a criterion-related validation. If the criterion is continuously distributed, you can use the Pearson product moment correlation coefficient between the test score and criterion measure to compute the validity coefficient. (For a categorical predictor and criterion you would calculate a statistic such as the phi coefficient). The square of the validity coefficient is called the coefficient of determination. The coefficient of determination indicates the percentage of variance in the criterion that is accounted for by the test.
Predictive Validity as described by Nunnally and Bernstein (1994).
Predictive Validity: concerns using an instrument to estimate some criterion behavior that is external to the measuring instrument itself. In a statistical sense, predictive validity is determined by, and only by, the degree of correspondence between predictors and criterion. Sound theory and common sense are not standards used for the correspondence between predictors and criterion. They are however useful in selecting predictor instruments.
Problems with predictive validity:
1) Obtaining a good criterion may be more difficult than obtaining a good predictor. The criterion measure (deciding what to measure) is a core problem associated with many predictive validity situations.
2) Range restrictions: something may occur to eliminate or minimize relevant differences on the predictor or criterion. Picking students to measure cognitive ability makes the group fairly homogeneous so it is hard to find the differences.
3) Part of the problem in selecting a criterion is that any criterion is influenced to a certain degree by random error and is therefore only partially reliable.
4) Many times the measures available to you as a criterion are a composite of two separate attributes.
Nunnally concludes that investigators can rarely have faith in their criterion measures, regardless of the area in which they work. But, using predictive validity in contrast to construct validity is to assume that the criterion is appropriate. The correlation between the predictor test and the criterion variable is called the validity coefficient and it specifies the degree of validity of that generalization. Often times, correlations are used. These correlations are usually pretty low. (.3 to .4). People are far too complex to permit a highly accurate estimate of their proficiency in most performance-related situations from any practical collection of test materials.
3. Construct Validity (also called trait validity and factorial validity).
Nunnally and Bernstein (1994) spend the majority of their discussion on Construct Validation as they find a lot of problems with content and predictive validity.
A construct is literally something that scientists construct and which does not exist as an observable dimension of behavior. A construct reflects a hypothesis that a variety of behaviors will correlate with one another in studies of individual differences.
There are three major aspects of construct validation:
1) Specifying the domain of observables related to the construct.
2) Determining the extent to which observables tend to measure the same thing, several different things, or many different things from empirical research and statistical analyses.
3) Performing subsequent individual differences studies and/or experiments to determine the extent to which supposed measures of the construct are consistent with “best guesses” about the construct.
In the end, it is hoped that this process produces a construct that is:
1) Well defined through a variety of observables
2) Is well represented by alternative measures, and
3) Related strongly to other constructs of interest.
Internal consistency is necessary but not sufficient for construct validity. To determine construct validity, a measure must fit a theory about the construct. Campbell and Fiske (1959) published an article on construct validation. They viewed reliability and validity as points along a continuum rather than as sharply distinguished ideas.
They introduced four keys points:
1) Validation is typically convergent because it is concerned with demonstrating that two independent methods of inferring an attribute lead to similar ends.
2) In order to justify novel measures of attributes, a measure should have divergent validity in the sense of measuring something different from existing methods.
3) A measure is jointly defined by a method and attribute-related content. Two measures may differ in method, content, or both.
4) At least two attributes, each measured by at least two methods are required to examine discriminant validity.
Construct Validation as described by Crocker and Algina:
Construct validation is for situations where “no criterion or universe of content is accepted as entirely adequate to define the quality to be measured…” (Cronbach and Meehl, 1955), but the test user desires to draw an inference from the test score to performances that can be grouped under the label of a particular psychological construct. The construct must be defined on two levels. First is must be operationally defined which is usually done by specifying the procedures used to measure the construct. In addition measures of the construct must be syntactically defined by the postulation of relationships with measures of other constructs in the theoretical system and measures of specific real-world criteria. Basically, the critical issue is whether or not subtests or tests, which are supposed to measure the same construct, are empirically identified as measuring a common factor.
There are four steps generally used:
1) Formulation of one or more hypotheses about how those who differ on the construct are expected to differ on demographic characteristics, performance criteria, or measures of other constructs that are related. The hypothesis should be based on explicitly stated theory that underlies the construct. (For the GMAT example you might hypothesize that the quantitative section of the score
2) Selection of a measurement instrument which consists of items representing behaviors that are specific, concrete manifestations of the construct.
3) Gathering of empirical data which will permit testing of the hypothesized relationships.
4) Determining if the data are consistent with the hypotheses and considering the extent to which the observed findings could be explained by rival theories or alternative explanations.
There are also four procedures used for construct validation.
1) Correlations between test scores and designated criterion variables (sometimes constructs aren’t identical but they are related. For example, IQ test and school achievement. Correlation will help potential test users evaluate the strength of the evidence present for the construct validity.)
2) Differentiation between groups (Contrasting the mean scores of different groups to see if the differ in the hypothesized direction. No differences would raise doubts).
3) Factor analysis: (to identify some reduced number of underlying variables (factors) to see if the constructs, which are empirically identified through factor analysis, correspond to the theoretical constructs which the test developer hypothesized they would when developing the test.)
4) Multitrait-multimethod matrix analysis: With this approach the researcher must think of two or more ways to measure the construct of interest and identify other, distinctly different constructs which can be appropriately measured by the same methods applied to the construct of interest.
From Crocker and Algina and Nunnally and Bernstein:
There are four types of correlations in a multitrait-multimethod matrix (Crocker and Algina).
1) Reliability coefficients: measure internal consistency
2) Heterotrait-monomethod correlation: correlation between two measures that have a common method but assess different attributes.
3) Monotrait-heteromethod correlation: correlation between two measures of the same trait using different methods.
4) Heterotrait-heteromethod correlation: correlation between different attributes using different methods.
Three types from Nunnally and Bernstein (1994):
1)Reliability coefficients: should be high as they are correlations between measures of the same construct using the same measurement method.
2)Convergent Validity coefficients: These should be high as will because it is the correlation between measures of the same construct using different measurement methods.
3)Discriminant Validity coefficients: Should be lower because they are the correlation between measures of different constructs using the same measurement method or the correlation between measures of different constructs using different measurement methods.
Both of these authors are using different terms for the same thing. For an expanded example of what a multitrait-multimethod matrix is please click here.