One of the best things about living in Baltimore is eating steamed hard shell crabs every summer. They are cooked in a very spicy mix of spices, and with Maryland corn and Maryland beer, these define the very peak of existence for Marylanders. (To be precise, the true culture of the crab also extends into Virginia, but does not really exist more than 20 miles inland from the bay).

As every crab eater knows, a steamed crab comes with a lot of inedible shell and other inner furniture. So you get perhaps an ounce of delicious meat for every pound of whole crab. Here is a bit of crab math. Let’s say you have ten pounds of whole crabs, and I have 20 ounces of delicious crabmeat. Who gets more to eat? Obviously I do, because your ten pounds of crabs will only yield 10 ounces of meat.

All Baltimoreans instinctively understand this from birth. So why is this same principle not understood by so many meta-analysts?

I recently ran across a meta-analysis of research on intelligent tutoring programs by Kulik & Fletcher (2016), published in the *Review of Educational Research (RER).* The meta-analysis reported an overall effect size of +0.66! Considering that the single largest effect size of one-to-one tutoring in mathematics was “only” +0.31 (Torgerson et al., 2013), it is just plain implausible that the average effect size for a computer-assisted instruction intervention is twice as large. Consider that a meta-analysis our group did on elementary mathematics programs found a mean effect size of +0.19 for all digital programs, across 38 rigorous studies (Slavin & Lake, 2008). So how did Kulik & Fletcher come up with +0.66?

The answer is clear. The authors excluded very few studies except for those of less than 30 minutes’ duration. The studies they included used methods known to greatly inflate effect sizes, but they did not exclude or control for them. To the authors’ credit, they then carefully documented the effects of some key methodological factors. For example, they found that “local” measures (presumably made by researchers) had a mean effect size of +0.73, while standardized measures had an effect size of +0.13, replicating findings of many other reviews (e.g., Cheung & Slavin, 2016). They found that studies with sample sizes less than 80 had an effect size of +0.78, while those with samples of more than 250 had an effect size of +0.30. Brief studies had higher effect sizes than those of longer studies, as found in many studies. All of this is nice to know, but even knowing it all, Kulik & Fletcher failed to control for any of it, not even to weight by sample size. So, for example, the implausible mean effect size of +0.66 includes a study with a sample size of 33, a duration of 80 minutes, and an effect size of +1.17, on a “local” test. Another had 48 students, a duration of 50 minutes, and an effect size of +0.95. Now, if you believe that 80 minutes on a computer is three times as effective for math achievement than months of one-to-one tutoring by a teacher, then I have a lovely bridge in Baltimore I’d like to sell you.

I’ve long been aware of these problems with meta-analyses that neither exclude nor control for characteristics of studies known to greatly inflate effect sizes. This was precisely the flaw for which I criticized John Hattie’s equally implausible reviews. But what I did not know until recently was just how widespread this is.

I was working on a proposal to do a meta-analysis of research on technology applications in mathematics. A colleague located every meta-analysis published on this topic since 2013. She found 20 of them. After looking at the remarkable outcomes on a few, I computed a median effect size across all twenty. It was +0.44. That is, to put it mildly, implausible. Looking further, I discovered that only one of the reviews adjusted for sample size (inverse variances). Its mean effect size was +0.05. Every one of the other 19 meta-analyses, all in respectable journals, did not control for methodological features or exclude studies based on them, and reported effect sizes up to +1.02 and +1.05.

Meta-analyses are important, because they are widely read and widely cited, in comparison to individual studies. Yet until meta-analyses start consistently excluding, or at least controlling for studies with factors known to inflate mean effect sizes, then they will have little if any meaning for practice. As things stand now, the overall mean impacts reported by meta-analyses in education depend on how stringent the inclusion standards were, not how effective the interventions truly were.

This is a serious problem for evidence-based reform. Our field knows how to solve it, but all too many meta-analysts do not do so. This needs to change. We see meta-analyses claiming huge impacts, and then wonder why these effects do not transfer to practice. In fact, these big effect sizes do not transfer because they are due to methodological artifacts, not to actual impacts teachers are likely to obtain in real schools with real students.

Ten pounds (160 ounces) of crabs only appear to be more than 20 ounces of crabmeat, because the crabs contain a lot you need to discard. The same is true of meta-analyses. Using small samples, brief durations, and researcher-made measures in evaluations inflate effect sizes without adding anything to the actual impact of treatments for students. Our job as meta-analysts is to strip away the bias the best we can, and get to the actual impact. Then we can make comparisons and generalizations that make sense, and move forward understanding of what really works in education.

In our research group, when we deal with thorny issues of meta-analysis, I often ask my colleagues to consider that they had a sister who is a principal. “What would you say to her,” I ask, “if she asked what really works, all BS aside? Would you suggest a program that was very effective in a 30-minute study? One that has only been evaluated with 20 students? One that has only been shown to be effective if the researcher gets to make the measure? Principals are sharp, and appropriately skeptical. Your sister would never accept such evidence. Especially if she’s experienced with Baltimore crabs.”

**References**

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. *Educational Researcher, 45 *(5), 283-292.

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. *Review of Educational Research, 86*(1), 42-78.

Slavin, R., & Lake, C. (2008). Effective programs in elementary mathematics: A best-evidence synthesis. *Review of Educational Research, 78* (3), 427-515.

Torgerson, C. J., Wiggins, A., Torgerson, D., Ainsworth, H., & Hewitt, C. (2013). Every Child Counts: Testing policy effectiveness using a randomised controlled trial, designed, conducted and reported to CONSORT standards. *Research In Mathematics Education, 15*(2), 141–153. doi:10.1080/14794802.2013.797746.

Photo credit: Kathleen Tyler Conklin/(CC BY 2.0)

*This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.*

*Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to *thebee@bestevidence.org*. *