One of the most popular exhibits in the Smithsonian Museum of Natural History is the Hope Diamond, one of the largest and most valuable in the world. It’s always fun to see all the kids flow past it saying how wonderful it would be to own the Hope Diamond, how beautiful it is, and how such a small thing could make you rich and powerful.
The diamonds are at the end of the Hall of Minerals, which is crammed full of exotic minerals from all over the world. These are beautiful, rare, and amazing in themselves, yet most kids rush past them to get to the diamonds. But no one, ever, evaluates the minerals against one another according to their size. No one ever says, “you can have your Hope Diamond, but I’d rather have this giant malachite or feldspar.” Just getting into the Smithsonian, kids go by boulders on the National Mall far larger than anything in the Hall of Minerals, perhaps climbing on them but otherwise ignoring them completely.
Yet in educational research, we often focus on the size of study effects without considering their value. In a recent blog, I presented data from a paper with my colleague Alan Cheung analyzing effect sizes from 611 studies evaluating reading, math, and science programs, K-12, that met the inclusion standards of our Best Evidence Encyclopedia. One major finding was that in randomized evaluations with sample sizes of 250 students (10 classes) or more, the average effect size across 87 studies was only +0.11. Smaller randomized studies had effect sizes averaging +0.22, large matched quasi-experiments +0.17, and small quasi-experiments, +0.32. In this blog, I want to say more about how these findings should make us think differently about effect sizes as we increasingly apply evidence to policy and practice.
Large randomized experiments (RCTs) with significant positive outcomes are the diamonds of educational research: rare, often flawed, but incredibly valuable. The reason they are so valuable is that such studies are the closest indication of what will happen when a given program goes out into the real world of replication. Randomization removes the possibility that self-selection may account for program effects. The larger the sample size, the less likely it is that the experimenter or developer could closely monitor each class and mentor each teacher beyond what would be feasible in real-life scale up. Most large-scale RCTs use clustering, which usually means that the treatment and randomization take place at the level of the whole school. A cluster randomized experiment at the school level might require recruiting 40 to 50 schools, perhaps serving 20,000 to 25,000 students. Yet such studies might nevertheless be too “small” to detect an effect size of, say, 0.15, because it is the number of clusters, not the number of students, that matters most!
The problem is that we have been looking for much larger effect sizes, and all too often not finding them. Traditionally, researchers recruit enough schools or classes to reliably detect an effect size as small as +0.20. This means that many studies report effect sizes that turn out to be larger than average for large RCTs, but are not statistically significant (at p<.05), because they are less than +0.20. If researchers did recruit samples of schools large enough to detect an effect size of +0.15, this would greatly increase the costs of such studies. Large RCTs are already very expensive, so substantially increasing sample sizes could end up requiring resources far beyond what educational research is likely to see any time in the near future or greatly reducing the number of studies that are funded.
These issues have taken on greater importance recently due to the passage of the Every Student Succeeds Act, or ESSA, which encourages use of programs that meet strong, moderate, or promising levels of evidence. The “strong” category requires that a program have at least one randomized experiment that found a significant positive effect. Such programs are rare.
If educational researchers were mineralogists, we’d be pretty good at finding really big diamonds, but the little, million-dollar diamonds, not so much. This makes no sense in diamonds, and no sense in educational research.
So what do we do? I’m glad you asked. Here are several ways we could proceed to increase the number of programs successfully evaluated in RCTs.
1. For cluster randomized experiments at the school level, something has to give. I’d suggest that for such studies, the p value should be increased to .10 or even .20. A p value of .05 is a long-established convention, indicating that there is only one chance in 20 that the outcomes are due to luck. Yet one chance in 10 (p=.10) may be sufficient in studies likely to have tens of thousands of students.
2. For studies in the past, as well as in the future, replication should be considered the same as large sample size. For example, imagine that two studies of Program X each have 30 schools. Each gets a respectable effect size of +0.20, which would not be significant in either case. Put the two studies together, however, and voila! The combined study of 60 schools would be highly significant, even at p=.05.
3. Government or foundation funders might fund evaluations in stages. The first stage might involve a cluster randomized experiment of, say, 20 schools, which is very unlikely to produce a significant difference. But if the effect size were perhaps 0.20 or more, the funders might fund a second stage of 30 schools. The two samples together, 50 schools, would be enough to detect a small but important effect.
One might well ask why we should be interested in programs that only produce effect sizes of 0.15? Aren’t these effects too small to matter?
The answer is that they are not. Over time, I hope we will learn how to routinely produce better outcomes. Already, we know that much larger impacts are found in studies of certain approaches emphasizing professional development (e.g., cooperative learning, meta cognitive skills) and certain forms of technology. I hope and expect that over time, more studies will evaluate programs using methods like those that have been proven to work, and fewer will evaluate those that do not, thereby raising the average effects we find. But even as they are, small but reliable effect sizes are making meaningful differences in the lives of children, and will make much more meaningful differences as we learn from efforts at the Institute of Education Sciences (IES) and the Investing in Innovation (i3)/Education Innovation and Research (EIR) programs.
Small effect sizes from large randomized experiments are the Hope Diamonds of our profession. They also are the best hope for evidence-based improvements for all students.