Working on Evidence for ESSA has been a wonderful educational experience for those of us who are doing it. The problem is that the Every Student Succeeds Act (ESSA) evidence standards allow evidence supporting a given program to be considered “strong,” “moderate,” or “promising” based on positive effects in a single well-conducted study. While these standards are a major step forward in general, this definition allows the awful possibility that programs could be validated based on flawed studies, and specifically based on outcomes on measures that are not valid for this purpose. For this reason, we’ve excluded measures that are too closely aligned with the experimental program, such as measures made by developers or researchers.
In the course of doing the research reviews behind Evidence for ESSA, we’ve realized that there is a special category of measures that is also problematic. These are measures that are not made by researchers or developers, but are easy to teach to. A good example from the old days is verbal analogies on the Scholastic Achievement Test (e.g., Big: Small::Wet: _?_). (Answer: dry)
These are no longer used on the SAT, presumably because the format is unusual and can be taught, giving an advantage to students who are coached on the SAT rather than to ones who actually know useful content of use in post-secondary education.
One key example of over-teachable measures are tests given in one minute, such as DIBELS. These are popular because they are inexpensive and brief, and may be useful as benchmark tests. But as measures for highly consequential purposes, such as anything relating to student or teacher accountability, or program evaluation for the ESSA evidence standards, one-minute tests are not appropriate, even if they correlate well with longer, well-validated measures. The problem is that such measures are overly teachable. In the case of DIBELS, for example, students can be drilled to read (ignoring punctuation or meaning) to correctly pronounce as many words as possible in a minute.
Another good example is elision tests in early reading (how would you say “bird” without the /b/?). Elision tests focus on an unusual situation that children are unlikely to have seen unless their teacher is specifically preparing them for an elision test.
Using over-teachable measures is a bit like holding teachers or schools or programs accountable for their children’s ability to do headstands. For PE, measures of time taken to run a 100-yeard dash, ability to lift weights, or percent of successful free throws in basketball would be legitimate, because these are normal parts of a PE program, and they assess general strength, speed, and muscular control. But headstands are relatively unusual as a focus in PE, and are easily taught and practiced. A PE program should not be able to meet ESSA evidence standards based on students’ ability to do headstands because this is not a crucial skill and because the control group would not be likely to spend much time on it.
Over-teachable measures have an odd but interesting aspect. When they are first introduced, there is nothing wrong with them. They may be reliable, valid, correlated with other measures, and show other solid psychometric properties. But the day the very same measure is used to evaluate students, teachers, schools, or programs, the measure’s over-teachability may come into play, and its reliability for this particular purpose may no longer be acceptable.
There’s nothing wrong with headstands per se. They have long been a part of gymnastics. But when the ability to do headstands becomes a major component of overall PE evaluation, it turns the whole concept… well, it turns the whole concept of evaluation on its head. The same can be said about other over-teachable measures.
This blog is sponsored by the Laura and John Arnold Foundation