03 Apr Assessing the reliability of a scale focuses on measuring its stability (Kline, 2005). Stability can mean different things in a measure: test-retest looks at
Assessing the reliability of a scale focuses on measuring its stability (Kline, 2005). Stability can mean different things in a measure: test-retest looks at stability of scores over time, internal consistency hones in on the stability of item scores across the test, and interrater reliability ensures the stability of ratings between raters of the scales.
Test-retest reliability is a good option for measuring reliability as it tests participants at various times. An example of this would be giving participants a personality assessment on two different dates. Since personality is more stable, retesting at a separate time is an effective way to measure the consistency and stability of the measure. Once the tests are completed, a zero-order correlation between the test scores is conducted to view the stability between the two attempts (Kline, 2005). This method of reliability is best suited for traits that are relatively consistent over time. Additionally, the text notes that the time between attempts should be the maximal interval possible without due cost. This method uses one test which helps save money as well as ensures the construct is measured the same way for both attempts. Some downsides to this method are that participants can become sensitized to the information, participants might not complete the retest, tests can be expensive to administer, and the time interval might be inappropriate and result in altering the reliability index.
Alternative forms reliability builds on test-retest reliability but uses different forms of the test that are administered to participants. For example, if a researcher wanted to measure intelligence, they would use similar but unique tests for each attempt and then compare the scores. Kline (2005) notes that this form of reliability helps with carryover effects and situational changes, but it has its own complications. It can be hard to make two tests that are comparable enough to one another that they could be used to confidently correlate the two tests. Kline (2005) note that researchers can use Item response Theory (IRT) to determine item and test equivalency. Like with test-retest, it can be costly to perform, and attrition is still a concern.
Testing the reliability of raters is important as everyone brings their own biases into situations which can impact how one rates or codes. Instances involving raters can range in seriousness from a professor grading a journal assignment to a professional diagnosing a mental illness. The appeal of this type of reliability is that it does not put everything on one rater as they might be wrong, miscode something, or be biased in some way which all alters the outcomes. Having consistency across raters helps to ensure the confidence in the findings. There are many options for testing the reliability of raters. Hilton et al. (2024) used intraclass correlations and Pearson correlations to test this reliability. The authors used these to compare police officers and researchers as raters and the findings yielded comparable results. The authors noted that there was fair-to-good interrater reliability which helps to inform the findings of the study and give context for when police officers are coding behaviors for people they interact with. Kline (2005) suggests using observer agreement percentages (the percentage of observations agreed on both judges), Kendall’s coefficient (used when judges rank order), and Cohen’s Kappa coefficient (judges place stimuli into nominal categories) among others.
Reliability is a crucial factor and there needs to be consistency for there to be confidence in the findings. Researchers need to be clear on what assessments they use to measure reliability. This transparency will help give the readers confidence as well in what they are reading and the findings of the study. Discussing the diverse types of reliability is interesting as we learn distinct aspects of consistency as well as diverse ways to view consistency.
References
Hilton, N. Z., Hanson, R. K., Campbell, M. A., & Jung, S. (2024). Police and researcher use of the Ontario Domestic Assault Risk Assessment (ODARA): Interrater agreement and examination of published norms. Journal of Threat Assessment and Management. https://dx.doi.org/10.1037/ tam0000239
Kline, T. J. B. (2005). Psychological testing: a practical approach to design and evaluation. Sage Publications.