Faithful readers of this blog, and followers of the Best Evidence Encyclopedia (BEE), will know that I am always cautioning readers of program evaluations to pay no attention to findings from measures overly aligned with the experimental but not the control treatment. For example, when researchers teach a set of vocabulary words to the experimental students (but not the controls), it is not surprising to find strong impacts. Unfortunately this happens all too often, but we carefully winnow such measures out of our BEE reviews.
In a recent paper written with my colleague Alan Cheung, we looked at 645 studies accepted across all BEE reviews done so far to find out which methodological factors are associated with excessive, improbable effect sizes. In an earlier blog I wrote about the profound impact of sample size: small studies get (improbably) big effect sizes.
Another important factor, however, was the use of experimenter-made measures. Even after our careful, conservative weeding out of studies with over-aligned measures, we were surprised to find out that effect sizes on measures made by experimenters were twice as high as effect sizes on measures made by someone else (usually standardized tests).
It may be going too far to suggest that no one should ever use or accept experimenter-made measures, no matter how fair they appear to be to the experimental and control groups. However, what it does say is that we need to be very cautious in accepting experimenter-made measures. Standardized tests are far from perfect, but they are almost always fair to experimental and control groups, as control teachers can be assumed to be trying as hard as experimental teachers to improve outcomes on these measures. This may not be so on experimental-made tests.
I’m all for do-it-yourself cooking, home repairs, and other projects. But when it comes to do-it-yourself educational measurement, let the reader beware!