GADYSA & GELBINA

TEST RELIABILITY

In Bahasa Inggris on November 26, 2011 at 12:28 pm

Reliability is one of the most important elements of test quality. It has to do with the consistency, or reproducibility, of an examinee’s performance on the test. Reliability has to do with accuracy of measurement. This kind of accuracy is reflected in the obtaining of similar results when measurements is repeated on the different occasions or different instruments or by the different person. This characteristic of reliability is sometimes termed consistency.Reliability is a necessary characteristic of any good test. If the test is administered to the same learners on the different occasion (with no language practice taking place between these occasion), then to the extent that it produces similar result, it is considered reliable. In short, in order to be reliable, a test must be consistent in its measurements.   For example, if you were to administer a test with high reliability to an examinee on two occasions, you would be very likely to reach the same conclusions about the examinee’s performance both times. A test with poor reliability, on the other hand, might result in very different scores for the examinee across the two test administrations. If a test yields inconsistent scores, it may be unethical to take any substantive actions on the basis of the test.

Hennings infers that reliability is a measure of accuracy, consistency, dependability, or fairness of scores resulting from administration of a particular examination.  He further introduces some threats to reliability in testing, including fluctuation in the learners, fluctuation in scoring, fluctuation in test administration, test characteristic affecting reliability,  and treat to reliability arising from the characteristic of the responses of the examinees.

There are some common methods for the computation of reliability which involve statistic calculation, correlation and variances of score, which needs extra work from teachers or test constructors in providing the score and calculating them. These methods are test-retest method, parallel forms methods, inter-rater reliability, Split-Half reliability, and Kuder-Richardson 20 and Kuder-Richardson 21. For many criterion-referenced tests decision consistency is often an appropriate choice. If a learning measure is reliable, it is consistent over time. If learning has taken place, a reliable measure will yield the same student score on a second administration. The evaluator will want to do whatever possible to ensure testing measures are reliable, so that the scores for one test administration can be compared to those of subsequent administrations. Test reliability (consistency) is an essential requirement for test validity. 

1.     Test-Retest Reliability

A method of establishing a correlation coefficient is determined by comparing the scores of the same measuring device administered to the same people on two different occasions. Comparing test results over time allows the test developer to see how stable the test is over time. This is calculated by means of product-moment correlation of two sets of scores for the same persons. The same test is administered to the same people following an interval of no more then two weeks. To estimate test-retest reliability, you must administer a test form to a single group of examinees on two separate occasions. Typically, the two separate administrations are only a few days or a few weeks apart; the time should be short enough so that the examinees’ skills in the area being assessed have not changed through additional learning. The relationship between the examinees’ scores from the two different administrations is estimated, through statistical correlation, to determine how similar the scores are. This type of reliability demonstrates the extent to which a test is able to produce stable, consistent scores across time.

2.     Parallel Forms Reliability

This method requires that an equivalent form of a test be administered to the same individuals. It is done by administering two versions of the same test to ensure that both tests are equivalent.

Two test are administered to the same sample of person and the result are correlated using product moment correlation. However the test must satisfy rigid requirements of equivalence.  Many exam programs develop multiple, parallel forms of an exam to help provide test security. These parallel forms are all constructed to match the test blueprint, and the parallel test forms are constructed to be similar in average item difficulty. Parallel forms reliability is estimated by administering both forms of the exam to the same group of examinees. While the time between the two test administrations should be short, it does need to be long enough so that examinees’ scores are not affected by fatigue. The examinees’ scores on the two test forms are correlated in order to determine how similarly the two test forms function. This reliability estimate is a measure of how consistent examinees’ scores can be expected to be across test forms.

Decision Consistency

In the descriptions of test-retest and parallel forms reliability given above, the consistency or dependability of the test scores was emphasized. For many criterion referenced tests (CRTs) a more useful way to think about reliability may be in terms of examinees’ classifications. For example, a typical CRT will result in an examinee being classified as either a master or non-master; the examinee will either pass or fail the test. It is the reliability of this classification decision that is estimated in decision consistency reliability. If an examinee is classified as a master on both test administrations, or as a non-master on both occasions, the test is producing consistent decisions. This approach can be used either with parallel forms or with a single form administered twice in test-retest fashion.

Internal Consistency

The internal consistency measure of reliability is frequently used for norm referenced tests (NRTs). This method has the advantage of being able to be conducted using a single form given at a single administration. The internal consistency method estimates how well the set of items on a test correlate with one another; that is, how similar the items on a test form are to one another. Many test analysis software programs produce this reliability estimate automatically. However, two common differences between NRTs and CRTs make this method of reliability estimation less useful for CRTs. First, because CRTs are typically designed to have a much narrower range of item difficulty, and examinee scores, the value of the reliability estimate will tend to be lower. Additionally, CRTs are often designed to measure a broader range of content; this results in a set of items that are not necessarily closely related to each other. This aspect of CRT test design will also produce a lower reliability estimate than would be seen on a typical NRT.

3.     Inter-Rater Reliability

This is used when scores on the test are independent estimates by two or more judges or rater. Reliability is estimated as the correlation of the ratings of one judge with those of another, especially in measuring speaking or writing ability.  All of the methods for estimating reliability discussed thus far are intended to be used for objective tests. When a test includes performance tasks, or other items that need to be scored by human raters, then the reliability of those raters must be estimated. This reliability method asks the question, “If multiple raters scored a single examinee’s performance, would the examinee receive the same score. Inter-rater reliability provides a measure of the dependability or consistency of scores that might be expected across raters.

4.    Split-Half Reliability

 A test given and divided into halves and are scored separately, then the score of one half of test are compared to the score of the remaining half to test the reliability (Kaplan & Saccuzzo, 2001). Split-Half Reliability is a useful measure when impractical or undesirable to assess reliability with two tests or to have two test administrations (because of limited time or money) (Cohen & Swerdlik, 2001).

How do I use Split-Half ? First: divide test into halves. The most commonly used way to do this would be to assign odd numbered items to one half of the test and even numbered items to the other, this is called, Odd-Even reliability. Second: Find the correlation of scores between the two halves by using the Pearson r formula. Third : Adjust or reevaluate correlation using Spearman-Brown formula which increases the estimate reliability even more. The longer the test the more reliable it is so it is necessary to apply the Spearman-Brown formula to a test that has been shortened, as we do in split-half reliability (Kaplan & Saccuzzo, 2001).

5.    Kuder-Richardson 20 and Kuder-Richardson 21

Kuder-Richardson 20  permits us to arrive at the same final estimate of reliability without having to compute reliability estimates for every possible split combination. Kuder-Richardson 21 is less accurate as it is slightly underestimates the actual reliability. Both tend to provide distorted stimates when only a few items occur in the test.

Reliability and Test Design

Sources of error due to the way in which the test is designed, which produce unreliable measures, include:

• Differences in the interpretation of the results of the instrument

• The length of the instrument

These errors can be reduced by designing tests that can be scored objectively and ensuring that the instrument is long enough, because short tests are usually not very reliable (Hopkins 1998). Once a test is designed, the evaluator can use one of the methods to determine its reliability.  Selecting the method to test for reliability may depend on the type and administration of the measure. For example, if the measure is one that requires an instructor assessment, the test designer or evaluator conducts a test for inter-rater reliability to make sure that the measure is not sensitive to the subjectivity of the rater. Ideally, if two different raters measure the same individual, they would arrive at identical, or acceptably close, results. If the measure is a written test, the split-half or inter-item methods for testing for reliability may be the most appropriate. The designer or evaluator can use the test-retest method successfully in corporate settings where subject matter experts are brought in at one point to take the test and again in two weeks to take the same test.

Conclusion

Reliability is one of the most important elements of test quality. It has to do with the consistency, or reproducibility, of an examinee’s performance on the test. Reliability has to do with accuracy of measurement. Reliability is a necessary characteristic of any good test. If the test is administered to the same learners on the different occasion (with no language practice taking place between these occasion), then to the extent that it produces similar result, it is considered reliable. In short, in order to be reliable, a test must be consistent in its measurements.

Hennings infers that reliability is a measure of accuracy, consistency, dependability, or fairness of scores resulting from administration of a particular examination.  He further introduces some threats to reliability in testing, including fluctuation in the learners, fluctuation in scoring, fluctuation in test administration, test characteristic affecting reliability,  and treat to reliability arising from the characteristic of the responses of the examinees.

There are some common methods for the computation of reliability which involve statistic calculation, correlation and variances of score, which needs extra work from teachers or test constructors in providing the score and calculating them. These methods are test-retest method, parallel forms methods, interrater reliability, Split-Half reliability, and Kuder-Richardson 20 and Kuder-Richardson 21. If a learning measure is reliable, it is consistent over time. If learning has taken place, a reliable measure will yield the same student score on a second administration. The evaluator will want to do whatever possible to ensure testing measures are reliable, so that the scores for one test administration can be compared to those of subsequent administrations.

References

Bachman, L. F. 1990. Fundamental Consideration in Language Testing. Oxford: Oxford University Press.

Brown, J.D. 1997. “Computers in Language Testing: Present Research and some Future Direction.” In Language Learning and Technology.” http://www.computer in language Testing.htm.

Jabu, B. 2008. English Language Testing. Published by: The UNM Publisher.

Widdowson, H. G. 1985. Teaching Language as Communication. Oxford: Oxford University Press.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 39 other followers