In Meta-Analyses, Weak Inclusion Standards Lead to Misleading Conclusions. Here’s Proof.

By Robert Slavin and Amanda Neitzel, Johns Hopkins University

In two recent blogs (here and here), I’ve written about Baltimore’s culinary glories: crabs and oysters. My point was just that in both cases, there is a lot you have to discard to get to what matters. But I was of course just setting the stage for a problem that is deadly serious, at least to anyone concerned with evidence-based reform in education.

Meta-analysis has contributed a great deal to educational research and reform, helping readers find out about the broad state of the evidence on practical approaches to instruction and school and classroom organization. Recent methodological developments in meta-analysis and meta-regression, and promotion of the use of these methods by agencies such as IES and NSF, have expanded awareness and use of modern methods.

Yet looking at large numbers of meta-analyses published over the past five years, even up to the present, the quality is highly uneven. That’s putting it nicely.  The problem is that most meta-analyses in education are far too unselective with regards to the methodological quality of the studies they include. Actually, I’ve been ranting about this for many years, and along with colleagues, have published several articles on it (e.g., Cheung & Slavin, 2016; Slavin & Madden, 2011; Wolf et al., 2020). But clearly, my colleagues and I are not making enough of a difference.

My colleague, Amanda Neitzel, and I thought of a simple way we could communicate the enormous difference it makes if a meta-analysis accepts studies that contain design elements known to inflate effect sizes. In this blog, we once again use the Kulik & Fletcher (2016) meta-analysis of research on computerized intelligent tutoring, which I critiqued in my blog a few weeks ago (here). As you may recall, the only methodological inclusion standards used by Kulik & Fletcher required that studies use RCTs or QEDs, and that they have a duration of at least 30 minutes (!!!). However, they included enough information to allow us to determine the effect sizes that would have resulted if they had a) weighted for sample size in computing means, which they did not, and b) excluded studies with various features known to inflate effect size estimates. Here is a table summarizing our findings when we additionally excluded studies containing procedures known to inflate mean effect sizes:

If you follow meta-analyses, this table should be shocking. It starts out with 50 studies and a very large effect size, ES=+0.65. Just weighting the mean for study sample sizes reduces this to +0.56. Eliminating small studies (n<60) cut the number of studies almost in half (n=27) and cut the effect size to +0.39. But the largest reductions are due to excluding “local” measures, which on inspection are always measures made by developers or researchers themselves. (The alternative was “standardized measures.”) By itself, excluding local measures (and weighting) cut the number of included studies to 12, and the effect size to +0.10, which was not significantly different from zero (p=.17). Excluding small, brief, and “local” measures only slightly changes the results, because both small and brief studies almost always use “local” (i.e., researcher-made) measures. Excluding all three, and weighting for sample size, leaves this review with only nine studies and an effect size of +0.09, which is not significantly different from zero (p=.21).

The estimates at the bottom of the chart represent what we call “selective standards.” These are the standards we apply in every meta-analysis we write (see, and in Evidence for ESSA (

It is easy to see why this matters. Selective standards almost always produce much lower estimates of effect sizes than do reviews with much less selective standards, which therefore include studies containing design features that have a strong positive bias on effect sizes. Consider how this affects mean effect sizes in meta-analyses. For example, imagine a study that uses two measures of achievement. One is a measure made by the researcher or developer specifically to be “sensitive” to the program’s outcomes. The other is a test independent of the program, such as GRADE/GMADE or Woodcock, standardized tests but not necessarily state tests. Imagine that the researcher-made measure obtains an effect size of +0.30, while the independent measure has an effect size of +0.10. A less-selective meta-analysis would report a mean effect size of +0.20, a respectable-sounding impact. But a selective meta-analysis would report an effect size of +0.10, a very small impact. Which of these estimates represents an outcome with meaning for practice? Clearly, school leaders should not value the +0.30 or +0.20 estimates, which require use of a test designed to be “sensitive” to the treatment. They should care about the gains on the independent test, which represents what educators are trying to achieve and what they are held accountable for. The information from the researcher-made test may be valuable to the researchers, but it has little or no value to educators or students.

The point of this exercise is to illustrate that in meta-analyses, choices of methodological exclusions may entirely determine the outcomes. Had they chosen other exclusions, the Kulik & Fletcher meta-analysis could have reported any effect size from +0.09 (n.s.) to +0.65 (p<.000).

The importance of these exclusions is not merely academic. Think how you’d explain the chart above to your sister the principal:

            Principal Sis: I’m thinking of using one of those intelligent tutoring programs to improve achievement in our math classes. What do you suggest?

            You:  Well, it all depends. I saw a review of this in the top journal in education research. It says that if you include very small studies, very brief studies, and studies in which the researchers made the measures, you could have an effect size of +0.65! That’s like seven additional months of learning!

            Principal Sis:  I like those numbers! But why would I care about small or brief studies, or measures made by researchers? I have 500 kids, we teach all year, and our kids have to pass tests that we don’t get to make up!

            You (sheepishly):  I guess you’re right, Sis. Well, if you just look at the studies with large numbers of students, which continued for more than 12 weeks, and which used independent measures, the effect size was only +0.09, and that wasn’t even statistically significant.

            Principal Sis:  Oh. In that case, what kinds of programs should we use?

From a practical standpoint, study features such as small samples or researcher-made measures add a lot to effect sizes while adding nothing to the value to students or schools of the programs or practices they want to know about. They just add a lot of bias. It’s like trying to convince someone that corn on the cob is a lot more valuable than corn off the cob, because you get so much more quantity (by weight or volume) for the same money with corn on the cob.     Most published meta-analyses only require that studies have control groups, and some do not even require that much. Few exclude researcher- or developer-made measures, or very small or brief studies. The result is that effect sizes in published meta-analyses are very often implausibly large.

Meta-analyses that include studies lacking control groups or studies with small samples, brief durations, pretest differences, or researcher-made measures report overall effect sizes that cannot be fairly compared to other meta-analyses that excluded such studies. If outcomes do not depend on the power of the particular program but rather on the number of potentially biasing features they did or did not exclude, then outcomes of meta-analyses are meaningless.

It is important to note that these two examples are not at all atypical. As we have begun to look systematically at published meta-analyses, most of them fail to exclude or control for key methodological factors known to contribute a great deal of bias. Something very serious has to be done to change this. Also, I’d remind readers that there are lots of programs that do meet strict standards and show positive effects based on reality, not on including biasing factors. At, you can see more than 120 reading and math programs that meet selective standards for positive impacts. The problem is that in meta-analyses that include studies containing biasing factors, these truly effective programs are swamped by a blizzard of bias.

In my recent blog (here) I proposed a common set of methodological inclusion criteria that I would think most methodologists would agree to.  If these (or a similar consensus list) were consistently used, we could make more valid comparisons both within and between meta-analyses. But as long as inclusion criteria remain highly variable from meta-analysis to meta-analysis, then all we can do is pick out the few that do use selective standards, and ignore the rest. What a terrible waste.


Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Slavin, R. E., Madden, N. A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4, 370–380.

Wolf, R., Morrison, J.M., Inns, A., Slavin, R. E., & Risman, K. (2020). Average effect sizes in developer-commissioned and independent evaluations. Journal of Research on Educational Effectiveness. DOI: 10.1080/19345747.2020.1726537

Photo credit: Deeper Learning 4 All, (CC BY-NC 4.0)

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to

A “Called Shot” for Educational Research and Impact

In the 1932 World Series, Babe Ruth stepped up to the plate and pointed to the center field fence. Everyone there understood: He was promising to hit the next pitch over the fence.

And then he did.

That one home run established Babe Ruth as the greatest baseball player ever. Even though several others have long since beaten his record of 60 home runs, no one else ever promised to hit a home run and then did it.

Educational research needs to execute a “called shot” of its own. We need to identify a clear problem, one that must be solved with some urgency, one that every citizen understands and cares about, one that government is willing and able to spend serious money to solve. And then we need to solve it, in a way that is obvious to all. I think the clear need for intensive services for students whose educations have suffered due to Covid-19 school closures provides an opportunity for our own “called shot.”

In my recent Open Letter to President-Elect Biden, I described a plan to provide up to 300,000 well-trained college-graduate tutors to work with up to 12 million students whose learning has been devastated by the Covid-19 school closures, or who are far below grade level for any reason. There are excellent reasons to do this, including making a rapid difference in the reading and mathematics achievement of vulnerable children, providing jobs to hundreds of thousands of college graduates who may otherwise be unemployed, and starting the best of these non-certified tutors on a path to teacher certification. These reasons more than justify the effort. But in today’s blog, I wanted to explain a fourth rationale, one that in the long run may be the most important of all.

A major tutoring enterprise, entirely focusing on high-quality implementation of proven programs, could be the “called shot” evidence-based education needs to establish its value to the American public.

Of course, the response to the Covid-19 pandemic is already supporting a “called shot” in medicine, the rush to produce a vaccine. At this time we do not know what the outcome will be, but throughout the world, people are closely following the progress of dozens of prominent attempts to create a safe and effective vaccine to prevent Covid-19. If this works as hoped, this will provide enormous benefits for entire populations and economies worldwide. But it could also raise the possibility that we can solve many crucial medical problems much faster than we have in the past, without compromising on strict research standards. The funding of many promising alternatives, and rigorous testing of each before they are disseminated, is very similar to what I and my colleagues have proposed for various approaches to tutoring. In both the medical case and the educational case, the size of the problem justifies this intensive, all-in approach. If all goes well with the vaccines, that will be a “called shot” for medicine, but medicine has long since proven its capability to use science to solve big problems. Curing polio, eliminating smallpox, and preventing measles come to mind as examples. In education, we need to earn this confidence, with a “called shot” of our own.

Think of it. Education researchers and leaders who support them would describe a detailed and plausible plan to solve a pressing problem of education. Then we announce that given X amount of money and Y amount of time, we will demonstrate that struggling students can perform substantially better than they would have without tutoring.

We’d know this would work, because part of the process would be identifying a) programs already proven to be effective, b) programs that already exist at some scale that would be successfully evaluated, and c) newly-designed programs that would successfully be evaluated. In each case, programs would have to meet rigorous evaluation standards before qualifying for substantial scale-up. In addition, in order to obtain funding to hire tutors, schools would have to agree to ensure that tutors use the programs with an amount and quality of training, coaching, and support at least as good as what was provided in the successful studies.

Researchers and policy makers who believe in evidence-based reform could confidently predict substantial gains, and then make good on their promises. No intervention in all of education is as effective as tutoring. Tutoring can be expensive, but it does not require a lengthy, uncertain transformation of the entire school. No sensible researcher or reformer would think that tutoring is all schools should do to improve student outcomes, but tutoring should be one element of any comprehensive plan to improve schools, and it happens to respond to the needs of post-Covid education for something that can have a dramatic, relatively quick, and relatively reliable impact.

If all went well in a large-scale tutoring intervention, the entire field of research could gain new respect, a belief among educators and the public that outcomes could be made much better than they are now by systematic applications of research, development, evaluation, and dissemination.

It is important to note that in order to be perceived to work, the tutoring “called shot” need not be proven effective across the board. By my count, there are 18 elementary reading tutoring programs with positive outcomes in randomized evaluations (see below). Let’s say 12 of them are ready for prime time and are put to the test, and 5 of those work very well at scale. That would be a tremendous success, because if we know which five approaches worked, we could make substantial progress on the problem of elementary reading failure. Just as with Covid-19 vaccines, we shouldn’t care how many vaccines failed. All that matters is that one or more of them succeeds, and can then be widely replicated.

I think it is time to do something bold to capture people’s imaginations. Let’s (figuratively) point to the center field fence, and (figuratively) hit the next pitch over it. The conditions today for such an effort are as good as they will ever be, because of universal understanding that the Covid-19 school closures deserve extraordinary investments in proven strategies. Researchers working closely with educators and political leaders can make a huge difference. We just have to make our case and insist on nothing less than whatever it takes. If a “called shot” works for tutoring, perhaps we could use similar approaches to solve other enduring problems of education.

It worked for the Babe. It should work for us, too, with much greater consequences for our children and our society than a mere home run.

*  *  *

Note: A reader of my previous blog asked what specific tutoring programs are proven effective, according to our standards. I’ve listed below reading and math tutoring programs that meet our standards of evidence. I cannot guarantee that all of these programs would be able to go to scale. We are communicating with program providers to try to assess each program’s capacity and interest in going to scale. But these programs are a good place to start in understanding where things stand today.

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to

An Open Letter To President-Elect Biden: A Tutoring Marshall Plan To Heal Our Students

Dear President-Elect Biden:

            Congratulations on your victory in the recent election. Your task is daunting; so much needs to be set right. I am writing to you about what I believe needs to be done in education to heal the damage done to so many children who missed school due to Covid-19 closures.

            I am aware that there are many basic things that must be done to improve schools, which have to continue to make their facilities safe for students and cope with the physical and emotional trauma that so many have experienced. Schools will be opening into a recession, so just providing ordinary services will be a challenge. Funding to enable schools to fulfill their core functions is essential, but it is not sufficient.

            Returning schools to the way they were when they closed last spring will not heal the damage students have sustained to their educational progress. This damage will be greatest to disadvantaged students in high-poverty schools, most of whom were unable to take advantage of the remote learning most schools provided. Some of these students were struggling even before schools closed, but when they re-open, millions of students will be far behind.

            Our research center at Johns Hopkins University studies the evidence on programs of all kinds for students who are at risk, especially in reading (Neitzel et al., 2020) and mathematics (Pellegrini et al., 2020). What we and many other researchers have found is that the most effective strategy for struggling students, especially in elementary schools, is one-to-one or one-to-small group tutoring. Structured tutoring programs can make a large difference in a short time, exactly what is needed to help students quickly catch up with grade level expectations.

A Tutoring Marshall Plan

            My colleagues and I have proposed a massive effort designed to provide proven tutoring services to the millions of students who desperately need it. Our proposal, based on a similar idea by Senator Coons (D-Del), would ultimately provide funding to enable as many as 300,000 tutors to be recruited, trained in proven tutoring models, and coached to ensure their effectiveness. These tutors would be required to have a college degree, but not necessarily a teaching certificate. Research has found that such tutors, using proven tutoring models with excellent professional development, can improve the achievement of students struggling in reading or mathematics as much as can teachers serving as tutors.

            The plan we are proposing is a bit like the Marshall Plan after World War II, which provided substantial funding to Western European nations devastated by the war. The idea was to put these countries on their feet quickly and effectively so that within a brief period of years, they could support themselves. In a similar fashion, a Tutoring Marshall Plan would provide intensive funding to enable Title I schools nationwide to substantially advance the achievement of their students who suffered mightily from Covid-19 school closures and related trauma. Effective tutoring is likely to enable these children to advance to the point where they can profit from ordinary grade-level instruction. We fear that without this assistance, millions of children will never catch up, and will show the negative effects of the school closures throughout their time in school and beyond.

            The Tutoring Marshall Plan will also provide employment to 300,000 college graduates, who will otherwise have difficulty entering the job market in a time of recession. These people are eager to contribute to society and to establish professional careers, but will need a first step on that ladder. Ideally, the best of the tutors will experience the joys of teaching, and might be offered accelerated certification, opening a new source of teacher candidates who will have had an opportunity to build and demonstrate their skills in school settings. Like the CCC and WPA programs in the Great Depression, these tutors will not only be helped to survive the financial crisis, but will perform essential services to the nation while building skills and confidence.

            The Tutoring Marshall Plan needs to start as soon as possible. The need is obvious, both to provide essential jobs to college graduates and to provide proven assistance to struggling students.

            Our proposal, in brief, is to ask the U.S. Congress to fund the following activities:

Spring, 2021

  • Fund existing tutoring programs to build capacity to scale up their programs to serve thousands of struggling students. This would include funds for installing proven tutoring programs in about 2000 schools nationwide.
  • Fund rigorous evaluations of programs that show promise, but have not been evaluated in rigorous, randomized experiments.
  • Fund the development of new programs, especially in areas in which there are few proven models, such as programs for struggling students in secondary schools.

Fall, 2021 to Spring, 2022

  • Provide restricted funds to Title I schools throughout the United States to enable them to hire up to 150,000 tutors to implement proven programs, across all grade levels, 1-9, and in reading and mathematics. This many tutors, mostly using small-group methods, should be able to provide tutoring services to about 6 million students each year. Schools should be asked to agree to select from among proven, effective programs. Schools would implement their chosen programs using tutors who have college degrees and experience with tutoring, teaching, or mentoring children (such as AmeriCorps graduates who were tutors, camp counselors, or Sunday school teachers).
  • As new programs are completed and piloted, third-party evaluators should be funded to evaluate them in randomized experiments, adding to capacity to serve students in grades 1-9. Those programs that produce positive outcomes would then be added to the list of programs available for tutor funding, and their organizations would need to be funded to facilitate preparation for scale-up.
  • Teacher training institutions and school districts should be funded to work together to design accelerated certification programs for outstanding tutors.

Fall, 2022-Spring, 2023

  • Title I schools should be funded to enable them to hire a total of 300,000 tutors. Again, schools will select among proven tutoring programs, which will train, coach, and evaluate tutors across the U.S. We expect these tutors to be able to work with about 12 million struggling students each year.
  • Development, evaluation, and scale-up of proven programs should continue to enrich the number and quality of proven programs adapted to the needs of all kinds of Title I schools.

            The Tutoring Marshall Plan would provide direct benefits to millions of struggling students harmed by Covid-19 school closures, in all parts of the U.S. It would provide meaningful work with a future to college graduates who might otherwise be unemployed. At the same time, it could establish a model of dramatic educational improvement based on rigorous research, contributing to knowledge and use of effective practice. If all goes well, the Tutoring Marshall Plan could demonstrate the power of scaling up proven programs and using research and development to improve the lives of children.


Neitzel, A., Lake, C., Pellegrini, M., & Slavin, R. (2020). A synthesis of quantitative research on programs for struggling readers in elementary schools. Available at Manuscript submitted for publication.

Pellegrini, M., Inns, A., Lake, C., & Slavin, R. (2020). Effective programs in elementary mathematics: A best-evidence synthesis. Available at Manuscript submitted for publication.

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to