“But It Worked in the Lab!” How Lab Research Misleads Educators

In researching John Hattie’s meta-meta analyses, and digging into the original studies, I discovered one underlying factor that more than anything explains why he consistently comes up with greatly inflated effect sizes:  Most studies in the meta-analyses that he synthesizes are brief, small, artificial lab studies. And lab studies produce very large effect sizes that have little if any relevance to classroom practice.

This discovery reminds me of one of the oldest science jokes in existence: (One scientist to another): “Your treatment worked very well in practice, but how will it work in the lab?” (Or “…in theory?”)


The point of the joke, of course, is to poke fun at scientists more interested in theory than in practical impacts on real problems. Personally, I have great respect for theory and lab studies. My very first publication as a psychology undergraduate involved an experiment on rats.

Now, however, I work in a rapidly growing field that applies scientific methods to the study and improvement of classroom practice.  In our field, theory also has an important role. But lab studies?  Not so much.

A lab study in education is, in my view, any experiment that tests a treatment so brief, so small, or so artificial that it could never be used all year. Also, an evaluation of any treatment that could never be replicated, such as a technology program in which a graduate student is standing by every four students every day of the experiment, or a tutoring program in which the study author or his or her students provide the tutoring, might be considered a lab study, even if it went on for several months.

Our field exists to try to find practical solutions to practical problems in an applied discipline.  Lab studies have little importance in this process, because they are designed to eliminate all factors other than the variables of interest. A one-hour study in which children are asked to do some task under very constrained circumstances may produce very interesting findings, but cannot recommend practices for real teachers in real classrooms.  Findings of lab studies may suggest practical treatments, but by themselves they never, ever validate practices for classroom use.

Lab studies are almost invariably doomed to success. Their conditions are carefully set up to support a given theory. Because they are small, brief, and highly controlled, they produce huge effect sizes. (Because they are relatively easy and inexpensive to do, it is also very easy to discard them if they do not work out, contributing to the universally reported tendency of studies appearing in published sources to report much higher effects than reports in unpublished sources).  Lab studies are so common not only because researchers believe in them, but also because they are easy and inexpensive to do, while meaningful field experiments are difficult and expensive.   Need a publication?  Randomly assign your college sophomores to two artificial treatments and set up an experiment that cannot fail to show significant differences.  Need a dissertation topic?  Do the same in your third-grade class, or in your friend’s tenth grade English class.  Working with some undergraduates, we once did three lab studies in a single day. All were published. As with my own sophomore rat study, lab experiments are a good opportunity to learn to do research.  But that does not make them relevant to practice, even if they happen to take place in a school building.

By doing meta-analyses, or meta-meta-analyses, Hattie and others who do similar reviews obscure the fact that many and usually most of the studies they include are very brief, very small, and very artificial, and therefore produce very inflated effect sizes.  They do this by covering over the relevant information with numbers and statistics rather than information on individual studies, and by including such large numbers of studies that no one wants to dig deeper into them.  In Hattie’s case, he claims that Visible Learning meta-meta-analyses contain 52,637 individual studies.  Who wants to read 52,637 individual studies, only to find out that most are lab studies and have no direct bearing on classroom practice?  It is difficult for readers to do anything but assume that the 52,637 studies must have taken place in real classrooms, and achieved real outcomes over meaningful periods of time.  But in fact, the few that did this are overwhelmed by the thousands of lab studies that did not.

Educators have a right to data that are meaningful for the practice of education.  Anyone who recommends practices or programs for educators to use needs to be open about where that evidence comes from, so educators can judge for themselves whether or not one-hour or one-week studies under artificial conditions tell them anything about how they should teach. I think the question answers itself.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.


Meta-Analysis and Its Discontents

Everyone loves meta-analyses. We did an analysis of the most frequently opened articles on Best Evidence in Brief. Almost all of the most popular were meta-analyses. What’s so great about meta-analyses is that they condense a lot of evidence and synthesize it, so instead of just one study that might be atypical or incorrect, a meta-analysis seems authoritative, because it averages many individual studies to find the true effect of a given treatment or variable.

Meta-analyses can be wonderful summaries of useful information. But today I wanted to discuss how they can be misleading. Very misleading.

The problem is that there are no norms among journal editors or meta-analysts themselves about standards for including studies or, perhaps most importantly, how much or what kind of information needs to be reported about each individual study in a meta-analysis. Some meta-analyses are completely statistical. They report all sorts of statistics and very detailed information on exactly how the search for articles took place, but never say anything about even a single study. This is a problem for many reasons. Readers may have no real understanding of what the studies really say. Even if citations for the included studies are available, only a very motivated reader is going to go find any of them. Most meta-analyses do have a table listing studies, but the information in the table may be idiosyncratic or limited.

One reason all of this matters is that without clear information on each study, readers can be easily misled. I remember encountering this when meta-analysis first became popular in the 1980s. Gene Glass, who coined the very term, proposed some foundational procedures, and popularized the methods. Early on, he applied meta-analysis to determine the effects of class size, which by then had been studied several times and found to matter very little except in first grade. Reducing “class size” to one (i.e., one-to-one tutoring) also was known to make a big difference, but few people would include one-to-one tutoring in a review of class size. But Glass and Smith (1978) found a much higher effect, not limited to first grade or tutoring. It was a big deal at the time.

I wanted to understand what happened. I bought and read Glass’ book on class size, but it was nearly impossible to tell what had happened. But then I found in an obscure appendix a distribution of effect sizes. Most studies had effect sizes near zero, as I expected. But one had a huge effect size, of +1.25! It was hard to tell which particular study accounted for this amazing effect but I searched by process of elimination and finally found it.

It was a study of tennis.


The outcome measure was the ability to “rally a ball against a wall so many times in 30 seconds.” Not surprisingly, when there were “large class sizes,” most students got very few chances to practice, while in “small class sizes,” they did.

If you removed the clearly irrelevant tennis study, the average effect size for class sizes (other than tutoring) dropped to near zero, as reported in all other reviews (Slavin, 1989).

The problem went way beyond class size, of course. What was important, to me at least, was that Glass’ presentation of the data made it very difficult to find out what was really going on. He had attractive and compelling graphs and charts showing effects of class size, but they all depended on the one tennis study, and there was no easy way to find out.

Because of this review and several others appearing in the 1980s, I wrote an article criticizing numbers–only meta-analyses and arguing that reviewers should show all of the relevant information about the studies in their meta-analyses, and should even describe each study briefly to help readers understand what was happening. I made up a name for this, “best-evidence synthesis” (Slavin, 1986).

Neither the term nor the concept really took hold, I’m sad to say. You still see meta-analyses all the time that do not tell readers enough for them to know what’s really going on. Yet several developments have made the argument for something like best-evidence synthesis a lot more compelling.

One development is the increasing evidence that methodological features can be strongly correlated with effect sizes (Cheung & Slavin, 2016). The evidence is now overwhelming that effect sizes are greatly inflated when sample sizes are small, when study durations are brief, when measures are made by developers or researchers, or when quasi-experiments rather than randomized experiments are used, for example. Many meta-analyses check for the effects of these and other study characteristics, and may make adjustments if there are significant differences. But this is not sufficient, because in a particular meta-analysis, there may not be enough studies to make any study-level factors significant. For example, if Glass had tested “tennis vs. non-tennis,” there would have been no significant difference, because there was only one tennis study. Yet that one study dominated the means anyway. Eliminating studies using, for example, researcher/developer-made measures or very small sample sizes or very brief durations is one way to remove bias from meta-analyses, and this is what we do in our reviews. But at bare minimum, it is important to have enough information available in tables to enable readers or journal reviewers to look for such biasing factors so they can recompute or at least understand the main effects if they are so inclined.

The second development that makes it important to require more information on individual studies in meta-analyses is the increased popularity of meta-meta-analyses, where the average effect sizes from whole meta-analyses are averaged. These have even more potential for trouble than the worst statistics-only reviews, because it is extremely unlikely that many readers will follow the citations to each included meta-analysis and then follow those citations to look for individual studies. It would be awfully helpful if readers or reviewers could trust the individual meta-analyses (and therefore their averages), or at least see for themselves.

As evidence takes on greater importance, this would be a good time to discuss reasonable standards for meta-analyses. Otherwise, we’ll be rallying balls uselessly against walls forever.


Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292

Glass, G., & Smith, M. L. (1978). Meta-Analysis of research on the relationship of class size and achievement. San Francisco: Far West Laboratory for Educational Research and Development.

Slavin, R.E. (1986). Best-evidence synthesis: An alternative to meta-analytic and traditional reviews. Educational Researcher, 15 (9), 5-11.

Slavin, R. E. (1989). Class size and student achievement:  Small effects of small classes. Educational Psychologist, 24, 99-110.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

On Meta-Analysis: Eight Great Tomatoes

I remember a long-ago advertisement for Contadina tomato paste. It went something like this:

Eight great tomatoes in an itsy bitsy can!

This ad creates an appealing image, or at least a provocative one, that I suppose sold a lot of tomato paste.

In educational research, we do something a lot like “eight great tomatoes.” It’s called meta-analysis, or systematic review.  I am particularly interested in meta-analyses of experimental studies of educational programs.  For example, there are meta-analyses of reading and math and science programs.  I’ve written them myself, as have many others.  In each, some number of relevant studies are identified.  From each study, one or more “effect sizes” are computed to represent the impact of the program on important outcomes, such as scores on achievement tests. These are then averaged to get an overall impact for each program or type of program.  Think of the effect size as boiling down tomatoes to make concentrated paste, to fit into an itsy bitsy can.

But here is the problem.  The Contadina ad specifies eight great tomatoes. If even one tomato is instead a really lousy one, the contents of the itsy bitsy can will be lousy.  Ultimately, lousy tomato pastes would bankrupt the company.

The same is true of meta-analyses.  Some meta-analyses include a broad range of studies – good, mediocre, and bad.  They may try to statistically control for various factors, but this does not do the job.  Bad studies lead to bad outcomes.  Years ago, I critiqued a study of “class size.”  The studies of class size in ordinary classrooms found small effects.  But there was one study that involved teaching tennis.  In small classes, the kids got a lot more court time than did kids in large classes.  This study, and only this study, found substantial effects of class size, significantly affecting the average.  There were not eight great tomatoes, there was at least one lousy tomato, which made the itsy bitsy can worthless.

The point I am making here is that when doing meta-analysis, the studies must be pre-screened for quality, and then carefully scrubbed.  Specifically, there are many factors that greatly (and falsely) inflate effect size.  Examples include use of assessments made by the researchers and ones that assess what was taught in the experimental group but not the control group, use of small samples, and provision of excessive assistance to the teachers.

Some meta-analyses just shovel all the studies onto a computer and report an average effect size.  More responsible ones shovel the studies into a computer and then test for and control for various factors that might affect outcomes. This is better, but you just can’t control for lousy studies, because they are often lousy in many ways.

Instead, high-quality meta-analyses set specific criteria for inclusion intended to minimize bias.  Studies often use both valid measures and crummy measures (such as those biased toward the experimental group).  Good meta-analyses use the good measures but not the (defined in advance) crummy ones.  Studies that only used crummy measures are excluded.  And so on.

With systematic standards, systematically applied, meta-analyses can be of great value.  Call it the Contadina method.  In order to get great tomato paste, start with great tomatoes. The rest takes care of itself.