Seeking Jewels, Not Boulders: Learning to Value Small, Well-Justified Effect Sizes

One of the most popular exhibits in the Smithsonian Museum of Natural History is the Hope Diamond, one of the largest and most valuable in the world. It’s always fun to see all the kids flow past it saying how wonderful it would be to own the Hope Diamond, how beautiful it is, and how such a small thing could make you rich and powerful.

The diamonds are at the end of the Hall of Minerals, which is crammed full of exotic minerals from all over the world. These are beautiful, rare, and amazing in themselves, yet most kids rush past them to get to the diamonds. But no one, ever, evaluates the minerals against one another according to their size. No one ever says, “you can have your Hope Diamond, but I’d rather have this giant malachite or feldspar.” Just getting into the Smithsonian, kids go by boulders on the National Mall far larger than anything in the Hall of Minerals, perhaps climbing on them but otherwise ignoring them completely.

Yet in educational research, we often focus on the size of study effects without considering their value. In a recent blog, I presented data from a paper with my colleague Alan Cheung analyzing effect sizes from 611 studies evaluating reading, math, and science programs, K-12, that met the inclusion standards of our Best Evidence Encyclopedia. One major finding was that in randomized evaluations with sample sizes of 250 students (10 classes) or more, the average effect size across 87 studies was only +0.11. Smaller randomized studies had effect sizes averaging +0.22, large matched quasi-experiments +0.17, and small quasi-experiments, +0.32. In this blog, I want to say more about how these findings should make us think differently about effect sizes as we increasingly apply evidence to policy and practice.

Large randomized experiments (RCTs) with significant positive outcomes are the diamonds of educational research: rare, often flawed, but incredibly valuable. The reason they are so valuable is that such studies are the closest indication of what will happen when a given program goes out into the real world of replication. Randomization removes the possibility that self-selection may account for program effects. The larger the sample size, the less likely it is that the experimenter or developer could closely monitor each class and mentor each teacher beyond what would be feasible in real-life scale up. Most large-scale RCTs use clustering, which usually means that the treatment and randomization take place at the level of the whole school. A cluster randomized experiment at the school level might require recruiting 40 to 50 schools, perhaps serving 20,000 to 25,000 students. Yet such studies might nevertheless be too “small” to detect an effect size of, say, 0.15, because it is the number of clusters, not the number of students, that matters most!

The problem is that we have been looking for much larger effect sizes, and all too often not finding them. Traditionally, researchers recruit enough schools or classes to reliably detect an effect size as small as +0.20. This means that many studies report effect sizes that turn out to be larger than average for large RCTs, but are not statistically significant (at p<.05), because they are less than +0.20. If researchers did recruit samples of schools large enough to detect an effect size of +0.15, this would greatly increase the costs of such studies. Large RCTs are already very expensive, so substantially increasing sample sizes could end up requiring resources far beyond what educational research is likely to see any time in the near future or greatly reducing the number of studies that are funded.

These issues have taken on greater importance recently due to the passage of the Every Student Succeeds Act, or ESSA, which encourages use of programs that meet strong, moderate, or promising levels of evidence. The “strong” category requires that a program have at least one randomized experiment that found a significant positive effect. Such programs are rare.

If educational researchers were mineralogists, we’d be pretty good at finding really big diamonds, but the little, million-dollar diamonds, not so much. This makes no sense in diamonds, and no sense in educational research.

So what do we do? I’m glad you asked. Here are several ways we could proceed to increase the number of programs successfully evaluated in RCTs.

1. For cluster randomized experiments at the school level, something has to give. I’d suggest that for such studies, the p value should be increased to .10 or even .20. A p value of .05 is a long-established convention, indicating that there is only one chance in 20 that the outcomes are due to luck. Yet one chance in 10 (p=.10) may be sufficient in studies likely to have tens of thousands of students.

2. For studies in the past, as well as in the future, replication should be considered the same as large sample size. For example, imagine that two studies of Program X each have 30 schools. Each gets a respectable effect size of +0.20, which would not be significant in either case. Put the two studies together, however, and voila! The combined study of 60 schools would be highly significant, even at p=.05.

3. Government or foundation funders might fund evaluations in stages. The first stage might involve a cluster randomized experiment of, say, 20 schools, which is very unlikely to produce a significant difference. But if the effect size were perhaps 0.20 or more, the funders might fund a second stage of 30 schools. The two samples together, 50 schools, would be enough to detect a small but important effect.

One might well ask why we should be interested in programs that only produce effect sizes of 0.15? Aren’t these effects too small to matter?

The answer is that they are not. Over time, I hope we will learn how to routinely produce better outcomes. Already, we know that much larger impacts are found in studies of certain approaches emphasizing professional development (e.g., cooperative learning, meta cognitive skills) and certain forms of technology. I hope and expect that over time, more studies will evaluate programs using methods like those that have been proven to work, and fewer will evaluate those that do not, thereby raising the average effects we find. But even as they are, small but reliable effect sizes are making meaningful differences in the lives of children, and will make much more meaningful differences as we learn from efforts at the Institute of Education Sciences (IES) and the Investing in Innovation (i3)/Education Innovation and Research (EIR) programs.

Small effect sizes from large randomized experiments are the Hope Diamonds of our profession. They also are the best hope for evidence-based improvements for all students.

The Wonderful Reputation of Educational Research

Back in 1993, Carl Kaestle memorably wrote about the “awful reputation of educational research.” At the time, he was right. But that was 23 years ago. In the interim, educational research has made extraordinary advances. It is now admired by researchers in many other fields and by policy makers in many areas of government. As indicated by the importance of evidence in the Every Student Succeeds Act (ESSA), evidence is starting to make more of a difference in policy and practice. There is still a long, long way to go, but the trend is hugely positive.

In a recent article for the Brookings Institution, Ruth Curran Neild, acting director of the Institute of Education Sciences (IES), argued that educational research is on the right track. The one thing it lacks, she says, is adequate funding. I totally agree. Of course there are improvements that could be made to education policies and practices, but the part of the education field working on using science to improve outcomes for children is very much going in the right direction. Many are frustrated that it is not getting there fast enough, but we need more wind in our sails, not a change of course.

I was listening recently to an NPR broadcast about a new center for research on immunological treatments for cancer. The interviewer asked how their center could possibly make much difference with a grant of only $250 million. The director sheepishly agreed this was a problem, but hoped they could nevertheless make a contribution. If only we in education had conversations like this – ever!

What has radically changed over the past 15 years is that there is now far more support than there once was for randomized evaluations of replicable programs and practices, and as a result we are collectively building a strong set of studies that use the kinds of designs common in medicine and agriculture but not, until recently, in education. My colleagues and I constantly update reviews of research on educational interventions in the main areas of practice at the Best Evidence Encyclopedia website. Where once randomized studies were rare, they are becoming the norm. We recently published a review of research on early childhood programs, in which we located 32 studies of 22 different programs. Twenty-nine of the studies used randomized designs, thanks primarily to funding and leadership from a federal investment called Preschool Curriculum Evaluation Research (PCER). We are working on a review of research on secondary reading programs. Due to the federal Striving Readers program, which invested in evaluations of a wide variety of school interventions, our review is now dominated by randomized studies. Studies of programs for struggling elementary readers are now overwhelmingly randomized. The Investing in Innovation (i3) program requires randomized evaluations in its validation and scale-up grants and encourages them in its development grants, and this is increasing the prevalence of randomized studies across all studies of programs for students from grades pre-K to 12. The National Science Foundation has begun to fund scale-up projects that require random assignment, as have a few private foundations.

Random assignment is the hallmark of rigorous science. From a methodological standpoint, random assignment is crucial because only when students, teachers, or schools are randomly assigned to treatment or control conditions can readers be sure that any differences observed at posttest are truly the result of the treatments, and not of self-selection or other bias. But more than this, use of random assignment establishes a field as serious about its science. Studies that use random assignment are called “gold standard,” because there is no better design in existence. Yes, there are better and worse randomized studies, better and worse measures, and so on. Mixed methods studies can usefully add insight to the numbers. Replication is very important in establishing effectiveness. And there are certainly circumstances in which randomization is impossible or impractical, and a well-done quasi-experiment will do. But all this being said, the use of randomization moves the science of education forward and gives educational leaders reliable information on which to make decisions.

The most telling criticism of randomized experiments is that they are expensive. Yes, they can be. Encouragement and funding from IES and the Laura and John Arnold Foundation is increasing the use of inexpensive experiments in situations in which treatments and (usually) measures are already being paid for by government or other sources, so only funding for the evaluation is needed. But these experiments are only possible in special circumstances. In others, someone has to come up with serious funding to support randomized designs.

This brings us back to Ruth Neild’s main point. We know what needs to be done in educational research. We need to develop a wide variety of promising innovations, subject them to rigorous, ultimately randomized experiments, and then disseminate those programs found to be effective. We have systems in place to do all of these things. We just need a lot more funding to do them faster and better.

I don’t know if the increases in the quality of research in education are understood by policy makers, or how much this quality matters for funding. But education now has a case to make that it deserves much greater funding. Educational research is no longer just of interest to the academics who do it. It is producing answers that matter for children, and that should justify funding in line with our field’s new, wonderful reputation.

Money and Evidence

Many years ago, I spent a few days testifying in a funding equity case in Alabama. At the end of my testimony, the main lawyer for the plaintiffs drove me to the airport. “I think we’re going to win this case,” he said, “But will it help my clients?”

The lawyer’s question has haunted me ever since. In Alabama, then and now, there are enormous inequities in education funding in rich and poor districts due to differences in property tax receipts in different districts. There are corresponding differences in student outcomes. The same is true in most states. To a greater or lesser degree, most states and the federal government provide some funding to reduce inequalities, but in most places it is still the case that poor districts have to tax themselves at a higher rate to produce education funding that is significantly lower than that of their wealthier neighbors.

Funding inequities are worse than wrong, they are repugnant. When I travel in other countries and try to describe our system, it usually takes me a while to get people outside the U.S. to even understand what I am saying. “So schools in poor areas get less than those in wealthy ones? Surely that cannot be true.” In fact, it is true in the U.S., but in all of our peer countries, national or at least regional funding policies ensure basic equality in school funding, and in most cases I know about they then add additional funding on top of equalized funding for schools serving many children in poverty. For example, England has long had equal funding, and the Conservative government added “Pupil Premium” funding in which each disadvantaged child brings additional funds to his or her school. Pupil Premium is sort of like Title I in the U.S., if you can imagine Title I adding resources on top of equal funding, which it does in only a few U.S. states.

So let’s accept the idea that funding inequity is a BAD THING. Now consider this: Would eliminating funding inequities eliminate achievement gaps in U.S. schools? This gets back to the lawyer’s question. If we somehow won a national “case” that required equalizing school funding, would the “clients” benefit?

More money for disadvantaged schools would certainly be welcome, and it would certainly create the possibility of major advances. But in order to maximize the impact of significant additional funding, it all depends on what schools do with the added dollars. Of course you’d have to increase teachers’ salaries and reduce class sizes to draw highly qualified teachers into disadvantaged schools. But you’d also have to spend a significant portion of new funds to help schools implement proven programs with fidelity and verve.

Again, England offers an interesting model. Twenty years ago, achievement in England was very unequal, despite equal funding. Children of immigrants from Pakistan and Bangladesh, Africans, Afro-Caribbeans, and other minorities performed well below White British children. The Labour government implemented a massive effort to change this, starting with the London Challenge and continuing with a Manchester Challenge and a Black Country Challenge in the post-industrial Midlands. Each “challenge” provided substantial professional development to school staffs, as well as organizing achievement data to show school leaders that other schools with exactly the same demographic challenges were achieving far better results.

Today, children of Pakistani and Bangladeshi immigrants are scoring at the English mean. Children of African and Afro-Caribbean immigrants are just below the English mean. Policy makers in England are now turning their attention to White working-class boys. But the persistent and substantial gaps we see as so resistant to change in the U.S. are essentially gone in England.

Today, we are getting even smarter about how to turn dollars into enhanced achievement, due to investments by the Institute of Education Sciences (IES) and Investing in Innovation (i3) program in the U.S. and the Education Endowment Foundation (EEF) in England. In both countries, however, we lack the funding to put into place what we know how to do on a large enough scale to matter, but this need not always be the case.

Funding matters. No one can make chicken soup out of chicken feathers, as we say in Baltimore. But funding in itself will not solve our achievement gap. Funding needs to be spent on specific, high-impact investments to make a big difference.

Accelerating the Pace of Innovation

The biggest problem in evidence-based reform in education is that there are too few replicable programs that have strong evidence of effectiveness available to educators. The evidence provisions of the Every Student Succeeds Act (ESSA) encourage the use of programs that have strong, moderate, or promising evidence of effectiveness, and they require School Improvement efforts (formerly SIG) to include approaches with evidence that meets these definitions. There are significant numbers of programs that do meet these definitions, but not enough to give educators multiple choices of proven programs for each subject and grade level. The Institute for Education Sciences (IES), Investing in Innovation (i3) program, the National Science Foundation (NSF), and England’s Education Endowment Foundation (EEF) have all been supporting rigorous evaluations of replicable programs at all levels, and this work (and work funded by others) is progressively enriching offerings of programs that are both proven to be effective and ready for widespread dissemination. However, progress is slow. Large-scale randomized experiments demanded by these funders are expensive and may take many years to be completed. As in any scientific field (such as medicine), most experiments do not show positive outcomes for innovative treatments. At a time when demand is starting to pick up, the supply needs to keep pace.

Given that money is not being thrown at education research by Congress or other funders, how can promising innovations be evaluated, made ready for dissemination, and taken to scale? First, existing funders need to be supported adequately to continue the good work they are doing. Grants for Education Innovation and Research (EIR) will pick up where i3 ends, and IES needs to maintain its leadership in supporting development and evaluation of promising programs in all subjects and grade levels. The National Science Foundation should invest far more in creating, evaluating, and disseminating proven STEM approaches. All of this work, in fact, is in need of increased funding and publicity to build political and public support for the entire enterprise.

However, there are several additional avenues that might be pursued to increase the number of proven, ready-to-disseminate approaches. One promising model is low-cost randomized evaluations of interventions supported by government or other funding. Both IES and the Laura and John Arnold Foundation are offering support for such studies. For example, imagine that a school district is introducing a digital textbook to its schools, however, it only can afford to provide the program to 30 schools each year. If the district finds 60 schools willing to receive the program and randomly assigns half of them to start in a given year, then it is spending no more on digital textbooks than it planned to spend. If state test scores can be obtained and used as pre- and post-tests, then the measurement costs nothing. The only costs of studying the effects of the digital textbooks might be the costs of data analysis, perhaps some questionnaires or observations to find out what schools did with the digital textbooks, and a report. Such a study would be very inexpensive, might produce results within a year or two, and would be evaluating something that is appealing to schools and ready to go.

Beyond these existing strategies, others might be considered to speed up the proven programs process. One example might be to build on Small Business Innovation Research (SBIR) grants. At $1 million over two years, these grants, limited to for-profit companies, are often too small to develop and evaluate promising approaches (usually, technology applications). IES or other funders might proactively look for promising SBIR projects and encourage them to apply for larger funding to complete development and do rigorous evaluations. One advantage of SBIRs is that they are usually created by small, ambitious, undercapitalized companies, which are motivated to take their programs to scale.

Another strategy that might work could be to fund “aggregators” whose job would be to identify promising approaches from any source, help assemble partnerships if necessary, and then help prepare applications for funding. This could help young innovators with great ideas combine their efforts, create more complete and powerful innovations, and subject them to rigorous evaluations. In addition to SBIR-funded projects, promising program elements might be found in projects funded by private foundations or agencies outside of education. They might be components of IES or i3 projects that produced promising but not conclusive outcomes in their evaluations, perhaps due to insufficient sample size. Aggregators might link up programs with broad reach but limited technology with brash technology start-ups in need of access to markets. If the goal is finding promising but incomplete efforts and helping them reach effectiveness and scale, every source should be fair game.

Government has made extraordinary progress in promoting the development, rigorous evaluation, and scale-up of proven programs. However, its success has led to a demand for proven programs that it cannot fulfill at the usual pace. Current grant programs at IES and i3/EIR should continue, but in addition we need innovative strategies capable of greatly accelerating the pace of development, evaluation, and scale up.