I have previously shared several posts written by and with Greg Pope, Analytics and Psychometrics Manager for Questionmark. What I really like about Greg is his ability to communicate statistics and psychometrics in a manner that all of us can understand. For example, today's post is about interpreting test scores, but he also applies the same thought process to polling data that we see every night on the news or in the daily newspaper.
I think it is extremely important that we take great care when interpreting test scores to examinees and parents so I hope you will take a few minutes to read the following post by Greg:
I have always been a fan of confidence intervals. Some people are fans of sports teams, for me, it’s confidence intervals! I find them really useful in assessment reporting contexts, all the way from item and test analysis psychometrics to participant reports.
Many of us get exposure to the practical use of confidence intervals via the media, when survey results are quoted. For example: “Of the 1,000 people surveyed, 55% said they will vote for John Doe. The margin of error for the survey was plus or minus 5% 95 times out of 100.” This is saying that the “observed” percentage of people who say they will vote for Mr. Doe is 55% and there is a 95% chance that the “true” percentage of people who will vote for John Doe is somewhere between 50-60%.
Sample size is a big factor in the margin of error: generally, the larger the sample the smaller the margin of error as we get closer to representing the population. (We can’t survey approximately all 307,006,550 people in the US now, can we!) So if the sample was 10,000 instead of 1,000 we would expect that the margin of error would be smaller than plus or minus 5%.
These concepts are relevant in an assessment context as well. You may remember my previous post on Classical Test Theory and reliability in which I explained that an observed test score (the score a participant achieves on an assessment) is composed of a true score and error. In other words, the observed score that a participant achieves is not 100% accurate; there is always error in the measurement. What this means practically is that if a participant achieves 50% on an exam their true score could actually be somewhere between say 44% and 56%.
This notion that observed scores are not absolute has implications for verifying what participants know and can do. For example, a participant who achieves 50% on a crane certification exam (on which the pass score is 50%) would pass the exam and be able to hop into a crane, moving stuff up and down and around. However, achieving a score right on the borderline means this person may not, in fact, know enough to pass the exam if he or she were to take it again and then be certified on crane operation. His/her supervisor might not feel very confident about letting this person operate that crane!
To deal with the inherent uncertainty around observed scores, some organizations factor this margin of error in when setting the cut score…but this is another fun topic that I touched on in another post. I believe a best practice is to incorporate a confidence interval into the reporting of scores for participants in order to recognize that the score is not an “absolute truth” and is an estimate of what a person knows and can do. A simple example of a participant report I created to demonstrate this shows a diamond that encapsulates the participant score; the vertical height of the diamond represents the confidence interval around the participant’s score.
In some of my previous posts I talked about how sample size affects the robustness of item level statistics like p-values and item-total correlation coefficients and provided graphics showing the confidence interval ranges for the statistics based on sample sizes. I believe confidence intervals are also very useful in this psychometric context of evaluating the performance of items and tests. For example, often when we see a p-value for a question of 0.600 we incorrectly accept this as the “truth” that 60% of participants got the question right. In actual fact, this p-value of 0.600 is an observation and the “true” p-value could actually be between 0.500 and 0.700, a big difference when we are carefully choosing questions to shape our assessment!
With the holiday season fast approaching, perhaps Santa has a confidence interval in his sack for you and your organization to apply to your assessment results reporting and analysis!
Related posts:
Standard Setting: An Introduction
Should I include really easy or really hard questions on my assessments?
How the sample of participants being tested affects item analysis information