What is Reliability in terms of classical test theory?

coefficient of equivalence

arnold on Sep 28, 2008

Explain reliability in terms of classical test theory:

Nunnally (1967) defined reliability as “the extent to which [measurements] are repeatable and that any random influence which tends to make measurements different from occasion to occasion is a source of measurement error” (p. 206). There are many factors can prevent measurements from being repeated perfectly.

Crocker and Algina define reliability as “The desired consistency (or reproducibility) of test scores” or “the degree to which individuals’ deviation scores, or z-scores, remain relatively consistent over repeated administration of the same test or alternate test”

There are two types of errors, random and systematic which can make a test score unreliable. Systematic measurement errors are those which consistently affect and individuals score because of some particular characteristic of the person or the test that has nothing to do with the construct being measured (Crocker and Algina, 1986, p. 105). This causes test scores to be inaccurate. Random errors of measurement affect an individual’s score because of purely chance happenings. For example, guessing, distractions and fluctuations in the individuals state (Crocker and Algina, 1986, p. 106). This reduces both the consistency and usefulness of the test scores.

The classical true score model is: Observed test score is the true score plus the error score:
X = T + E
The true score is: the average of the observed score obtained over an infinite number of repeated testings.
An error score: is the discrepancy between an examinee’s observed test score and his or her true score.
According to classical true score theory, two tests are defined as parallel when:
1. Each examinee has the same true score on both forms of the test, and
2. The error variance for the two forms are equal.

The reliability coefficient can be defined as the correlation between scores on parallel test forms.

Although alpha is sometimes referred to as “the” estimate of reliability, it is not the only estimate of reliability (Cortina, 1993). The reason for this is that there are many error producing factors which affect which particular estimate of reliability you may use.

3 Major Error sources:
1. Content sampling from form to form
2. Change in Examinee over time
3. Content sampling, or flawed items

Different procedure requiring two test administrations to same group:
1. Alternate form method: To reduce possibility of cheating, similar tests need to be given over time (i.e. board exam). The errors of measurement that concerns the test user are those due to differences in content of the test forms. A correlation coefficient should be used to see how different the tests are. This is called the coefficient of equivalence. Usually between .8 and .9.
2. A. Test-Retest Method: If you are concerned with error factors related to the passing of time then you want to know how consistently examinees respond to this form at different time. Administer, wait, and then re-administer. The correlation coefficient from this procedure is called the coefficient of stability.
B. Test-Retest with Alternate forms: Administer form 1 of test, wait then administer form 2. The correlation coefficient is known as the coefficient of stability and equivalence.

3. Content sampling or flawed items
Procedures Requiring a Single Test Administration:
Sometimes you want to make sure that examinee’s performed consistently across different items on the same test. For example if you are measuring from the same content area on different sections of the test. Procedures used to measure this are called internal consistency methods which will give you an internal consistency coefficient. When examinees perform consistently across subsets of items within a test, the test is said to have item homogeneity which is what you want.

There are two broad classes of methods for estimating the reliability coefficient.
1. Split Half Methods: Test developer administers a group of items by splitting it in half and administering it to two groups who each are tested on half the items. If there are 20 items then the group of examinees would be split in half and each would be tested on 10 items. The correlation coefficient is calculated and this is called the coefficient of equivalence for two halves of the test. However, different methods of splitting the test yield different reliability estimates. To overcome this, you can calculate the coefficient alpha which is the average of all the split half coefficients that would be obtained if the test were divided into all possible combinations.
2. Analysis of the Variance-Covariance structure of the item responses: These methods yield an index of the internal consistency of the examinees’ responses to the items within a single test form.

Kr-20: Can be used with dichotomously scored items
Cronbach’s alpha: Alpha can be used to estimate the internal consistency of items which are dichotomously scored or have a wide range.
Hoyt’s analysis of variance: Based on Analysis of Variance.

There are factors that affect reliability coefficients.
1. Group Homogeneity
Reliability is a property of the scores on a test for a particular group of examinees. Potential test users need to determine whether reliability estimates reported in test manuals are based on samples similar in composition and variability to the group for whom the test will be used. (ie. Giving a math anxiety test to math majors then to high school students)
2. Time Limit
When a test has a rigid time limit such that some examinees finish but others do not, an examinee’s working rate will systematically influence his or her performance on all repeated forms of the test. The reliability of a speeded test must be interpreted with caution. Variances in the rates at which examinees work becomes part of the true score variance.
3. Test Length:
Increases in test reliability obtained from increasing test length follow the law of diminishing returns. At some point, the small increases in reliability obtained by adding more items will probably not justify the increased costs of item writing and testing time.

The Ultimate