How Biased Measures Lead to False Conclusions


One hopeful development in evidence-based reform in education is the improvement in the quality of evaluations of educational programs. Because of policies and funding provided by the Institute of Education Sciences (IES) and Investing in Innovation (i3) in the U.S. and by the Education Endowment Foundation (EEF) in the U.K., most evaluations of educational programs today use far better procedures than was true as recently as five years ago. Experiments are likely to be large, to use random assignment or careful matching, and to be carried out by third-party evaluators, all of which give (or should give) educators and policy makers greater confidence that evaluations are unbiased and that their findings are meaningful.

Despite these positive developments, there remain serious problems in some evaluations. One of these relates to measures that give the experimental group an unfair advantage.

There are several ways in which measures can unfairly favor the experimental group. The most common is where measures are made by the creator of the program and are precisely aligned with the curriculum taught in the experimental group but not the control group. For example, a developer might reason that a new curriculum represents what students should be taught in, say, science or math, so it’s all right to use a measure aligned with the experimental program. However, use of such measures gives a huge advantage to the experimental group. In an article published in the Journal of Research on Educational Effectiveness, Nancy Madden and I looked at effect sizes for such over-aligned measures among studies accepted by the What Works Clearinghouse (WWC). In reading, we found an average effect size of +0.51 for over-aligned measures, compared to an average of +0.06 for measures that were fair to the content taught in experimental and control groups. In math, the difference was +0.45 for over-aligned measures, -0.03 for fair ones. These are huge differences.

A special case of over-aligned measures takes place when content is introduced earlier than usual in students’ progression through school in the experimental group, but not the control group. For example, if students are taught first-grade math skills in kindergarten, they will of course do better on a first grade test (in kindergarten) than will students not taught these skills in kindergarten. But will the students still be better off by the end of first grade, when all have been taught first grade skills? It’s unlikely.

One more special case of over-alignment takes place in relatively brief studies when students are pre-tested, taught a given topic, and then post-tested, say, eight weeks later. The control group, however, might have been taught that topic earlier or later than that eight-week period, or might have spent much less than 8 weeks on it. In a recent review of elementary science programs, we found many examples of this, including situations in which experimental groups were taught a topic such as electricity during an experiment, while the control group was not taught about electricity at all during that period. Not surprisingly, these studies produce very large but meaningless effect sizes.

As evidence becomes more important in educational policy and practice, we researchers need to get our own house in order. Insisting on the use of measures that are not biased in favor of experimental groups is a major necessity in building a body of evidence that educators can rely on.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s