In the 1984 mockumentary This is Spinal Tap, there is a running joke about a hapless band, Spinal Tap, which proudly bills itself “Britain’s Loudest Band.” A pesky reporter keeps asking the band’s leader, “But how can you prove that you are Britain’s loudest band?” The band leader explains, with declining patience, that while ordinary amplifiers’ sound controls only go up to 10, Spinal Tap’s go up to 11. “But those numbers are arbitrary,” says the reporter. “They don’t mean a thing!” “Don’t you get it?” asks the band leader. “ELEVEN is more than TEN! Anyone can see that!”
In educational research, we have an ongoing debate reminiscent of Spinal Tap. Educational researchers speaking to other researchers invariably express the impact of educational treatments as effect sizes (the difference in adjusted means for the experimental and control groups divided by the unadjusted standard deviation). All else being equal, higher effect sizes are better than lower ones.
However, educators who are not trained in statistics often despise effect sizes. “What do they mean?” they ask. “Tell us how much difference the treatment makes in student learning!”
Researchers want to be understood, so they try to translate effect sizes into more educator-friendly equivalents. The problem is that the friendlier the units, the more statistically problematic they are. The friendliest of all is “additional months of learning.” Researchers or educators can look on a chart and, for any particular effect size, they can find the number of “additional months of learning.” The Education Endowment Foundation in England, which funds and reports on rigorous experiments, reports both effect sizes and additional months of learning, and provides tables to help people make the conversion. But here’s the rub. A recent article by Baird & Pane (2019) compared additional months of learning to three other translations of effect sizes. Additional months of learning was rated highest in ease of use, but lowest in four other categories, such as transparency and consistency. For example, a month of learning clearly has a different meaning in kindergarten than it does in tenth grade.
The other translations rated higher by Baird and Pane were, at least to me, just as hard to understand as effect sizes. For example, the What Works Clearinghouse presents, along with effect sizes, an “improvement index” that has the virtue of being equally incomprehensible to researchers and educators alike.
On one hand, arguing about outcome metrics is as silly as arguing the relative virtues of Fahrenheit and Celsius. If they can be directly transformed into the other unit, who cares?
However, additional months of learning is often used to cover up very low effect sizes. I recently ran into an example of this in a series of studies by the Stanford Center for Research on Education Outcomes (CREDO), in which disadvantaged urban African American students gained 59 more “days of learning” than matched students not in charters in math, and 44 more days in reading. These numbers were cited in an editorial praising charter schools in the May 29 Washington Post.
However, these “days of learning” are misleading. The effect size for this same comparison was only +0.08 for math, and +0.06 for reading. Any researcher will tell you that these are very small effects. They were only made to look big by reporting the gains in days. These not only magnify the apparent differences, but they also make them unstable. Would it interest you to know that White students in urban charter schools performed 36 days a year worse than matched students in math (ES= -0.05) and 14 days worse in reading (ES= -0.02)? How about Native American students in urban charter schools, whose scores were 70 days worse than matched students in non-charters in math (ES= -0.10), and equal in reading. I wrote about charter school studies in a recent blog. In the blog, I did not argue that charter schools are effective for disadvantaged African Americans but harmful for Whites and Native Americans. That seems unlikely. What I did argue is that the effects of charter schools are so small that the directions of the effects are unstable. The overall effects across all urban schools studied were only 40 days (ES=+0.055) in math and 28 days (ES=+0.04) in reading. These effects look big because of the “days of learning” transformation, but they are not.
In This is Spinal Tap, the argument about whether or not Spinal Tap is Britain’s loudest band is absurd. Any band can turn its amplifiers to the top and blow out everyone’s eardrums, whether the top is marked eleven or ten. In education, however, it does matter a great deal that educators are taking evidence into account in their decisions about educational programs. Using effect sizes, perhaps supplemented by additional months of learning, is one way to help readers understand outcomes of educational experiments. Using “days of learning,” however, is misleading, making very small impacts look important. Why not additional hours or minutes of learning, while we’re at it? Spinal Tap would be proud.
References
Baird, M., & Paine, J. (2019). Translating standardized effects of education programs into more interpretable metrics. Educational Researcher. Advance online publication. doi.org/10.3102/0013189X19848729
CREDO (2015). Overview of the Urban Charter School Study. Stanford, CA: Author.
Washington Post: Denying poor children a chance. [Editorial]. (May 29, 2019). The Washington Post, A16.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
One thought on “Effect Sizes and Additional Months of Gain: Can’t We Just Agree That More is Better?”