“But It Worked in the Lab!” How Lab Research Misleads Educators

In researching John Hattie’s meta-meta analyses, and digging into the original studies, I discovered one underlying factor that more than anything explains why he consistently comes up with greatly inflated effect sizes:  Most studies in the meta-analyses that he synthesizes are brief, small, artificial lab studies. And lab studies produce very large effect sizes that have little if any relevance to classroom practice.

This discovery reminds me of one of the oldest science jokes in existence: (One scientist to another): “Your treatment worked very well in practice, but how will it work in the lab?” (Or “…in theory?”)

blog_6-28-18_scientists_500x424

The point of the joke, of course, is to poke fun at scientists more interested in theory than in practical impacts on real problems. Personally, I have great respect for theory and lab studies. My very first publication as a psychology undergraduate involved an experiment on rats.

Now, however, I work in a rapidly growing field that applies scientific methods to the study and improvement of classroom practice.  In our field, theory also has an important role. But lab studies?  Not so much.

A lab study in education is, in my view, any experiment that tests a treatment so brief, so small, or so artificial that it could never be used all year. Also, an evaluation of any treatment that could never be replicated, such as a technology program in which a graduate student is standing by every four students every day of the experiment, or a tutoring program in which the study author or his or her students provide the tutoring, might be considered a lab study, even if it went on for several months.

Our field exists to try to find practical solutions to practical problems in an applied discipline.  Lab studies have little importance in this process, because they are designed to eliminate all factors other than the variables of interest. A one-hour study in which children are asked to do some task under very constrained circumstances may produce very interesting findings, but cannot recommend practices for real teachers in real classrooms.  Findings of lab studies may suggest practical treatments, but by themselves they never, ever validate practices for classroom use.

Lab studies are almost invariably doomed to success. Their conditions are carefully set up to support a given theory. Because they are small, brief, and highly controlled, they produce huge effect sizes. (Because they are relatively easy and inexpensive to do, it is also very easy to discard them if they do not work out, contributing to the universally reported tendency of studies appearing in published sources to report much higher effects than reports in unpublished sources).  Lab studies are so common not only because researchers believe in them, but also because they are easy and inexpensive to do, while meaningful field experiments are difficult and expensive.   Need a publication?  Randomly assign your college sophomores to two artificial treatments and set up an experiment that cannot fail to show significant differences.  Need a dissertation topic?  Do the same in your third-grade class, or in your friend’s tenth grade English class.  Working with some undergraduates, we once did three lab studies in a single day. All were published. As with my own sophomore rat study, lab experiments are a good opportunity to learn to do research.  But that does not make them relevant to practice, even if they happen to take place in a school building.

By doing meta-analyses, or meta-meta-analyses, Hattie and others who do similar reviews obscure the fact that many and usually most of the studies they include are very brief, very small, and very artificial, and therefore produce very inflated effect sizes.  They do this by covering over the relevant information with numbers and statistics rather than information on individual studies, and by including such large numbers of studies that no one wants to dig deeper into them.  In Hattie’s case, he claims that Visible Learning meta-meta-analyses contain 52,637 individual studies.  Who wants to read 52,637 individual studies, only to find out that most are lab studies and have no direct bearing on classroom practice?  It is difficult for readers to do anything but assume that the 52,637 studies must have taken place in real classrooms, and achieved real outcomes over meaningful periods of time.  But in fact, the few that did this are overwhelmed by the thousands of lab studies that did not.

Educators have a right to data that are meaningful for the practice of education.  Anyone who recommends practices or programs for educators to use needs to be open about where that evidence comes from, so educators can judge for themselves whether or not one-hour or one-week studies under artificial conditions tell them anything about how they should teach. I think the question answers itself.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

What Kinds of Studies Are Likely to Replicate?

Replicated scientists 03 01 18

In the hard sciences, there is a publication called the Journal of Irreproducible Results.  It really has nothing to do with replication of experiments, but is a humor journal by and for scientists.  The reason I bring it up is that to chemists and biologists and astronomers and physicists, for example, an inability to replicate an experiment is a sure indication that the original experiment was wrong.  To the scientific mind, a Journal of Irreproducible Results is inherently funny, because it is a journal of nonsense.

Replication, the ability to repeat an experiment and get a similar result, is the hallmark of a mature science.  Sad to say, replication is rare in educational research, which says a lot about our immaturity as a science.  For example, in the What Works Clearinghouse, about half of programs across all topics are represented by a single evaluation.  When there are two or more, the results are often very different.  Relatively recent funding initiatives, especially studies supported by Investing in Innovation (i3) and the Institute for Education Sciences (IES), and targeted initiatives such as Striving Readers (secondary reading) and the Preschool Curriculum Evaluation Research (PCER), have added a great deal in this regard. They have funded many large-scale, randomized, very high-quality studies of all sorts of programs in the first place, and many of these are replications themselves, or they provide a good basis for replications later.  As my colleagues and I have done many reviews of research in every area of education, pre-kindergarten to grade 12 (see www.bestevidence.org), we have gained a good intuition about what kinds of studies are likely to replicate and what kinds are less likely.

First, let me define in more detail what I mean by “replication.”  There is no value in replicating biased studies, which may well consistently find the same biased results (as when, for example, both the original studies and the replication studies used the same researcher- or developer-made outcome measures that are slanted toward the content the experimental group experienced but not what the control group experienced) (See http://www.tandfonline.com/doi/abs/10.1080/19345747.2011.558986.)

Instead, I’d consider a successful replication one that shows positive outcomes both in the original studies and in at least one large-scale, rigorous replication. One obvious way to increase the chances that a program producing a positive outcome in one or more initial studies will succeed in such a rigorous replication evaluation is to use a similar, equally rigorous evaluation design in the first place. I think a lot of treatments that fail to replicate are ones that used weak methods in the original studies. In particular, small studies tend to produce greatly inflated effect sizes (see http://www.bestevidence.org/methods/methods.html), which are unlikely to replicate in larger evaluations.

Another factor likely to contribute to replicability is use in the earlier studies of methods or conditions that can be repeated in later studies, or in schools in general. For example, providing teachers with specific manuals, videos demonstrating the methods, and specific student materials all add to the chances that a successful program can be successfully replicated. Avoiding unusual pilot sites (such as schools known to have outstanding principals or staff) may contribute to replication, as these conditions are unlikely to be found in larger-scale studies. Having experimenters or their colleagues or graduate students extensively involved in the early studies diminishes replicability, of course, because those conditions will not exist in replications.

Replications are entirely possible. I wish there were a lot more of them in our field. Showing that programs can be effective in just two rigorous evaluations is way more convincing than just one. As evidence becomes more and more important, I hope and expect that replications, perhaps carried out by states or districts, will become more common.

The Journal of Irreproducible Results is fun, but it isn’t science. I’d love to see a Journal of Replications in Education to tell us what really works for kids.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Columbus and Replicability

Happy Columbus Day!

Columbus is revered among researchers because:

  1. He didn’t know where he was going;
  2. He didn’t know where he was when he got there; and
  3. He did it all on government money.

Columbus gets a lot of abuse these days, and for good reason. He was a terrible person. However, people also say that he didn’t actually discover America. Leif Erikson had been here earlier, they say, and of course the Indians were already here.

What Columbus did discover was not America per se, but a replicable and openly published route to America. And that’s what made him justifiably famous. In research, as in discovery, what matters is replicability, the ability to show that you can do something again, and to tell others how they can do the same. Columbus was indisputably the first to do that (Leif Erikson kept his voyage secret).

Replicability is the hallmark of science. In science, if you can’t do it again, it didn’t happen. In fact, there is a popular science humor magazine called the Journal of Irreproducible Results, named for this principle.

As important as replication is in all of science, it is rare in educational research. It’s difficult to get funding to do replications, and if you manage to replicate a finding, journal editors are likely to dismiss it (“What does this add to the literature?” they say). Yet as evidence-based reform in education advances, the need for replication increases. This is a problem because, for example, the majority of programs with at least one study that met What Works Clearinghouse standards had exactly one study that did so.

Soon, results will become available for the first and largest cohort of projects funded by the Investing in Innovation (i3) program. Some of these will show positive effects and some will show outcomes close enough to significance to be worth trying again. I hope there will be opportunities for these programs to replicate and hopefully improve their outcomes, so we can expand our armamentarium of replicable and effective approaches to enduring problems of education.

We really should celebrate Columbus Day on November 3rd when Columbus returned to the New World. The day he reached the New World was a significant event, but it wasn’t really important until he showed that he (and anyone else) could do it again.

It’s Proven. It’s Perfect. I’ll Change It.

I recently visited Kraków, Poland. It’s a wonderful city. One of its highlights is a beautiful royal castle, built in the 16th century by an Italian architect. The castle had just one problem. It had no interior hallways. To go from room to room, you had to go outside onto a covered walkway overlooking a courtyard. This is a perfectly good idea in warm Italy, but in Poland it can get to 30 below in the winter!

In evidence-based reform in education, we have a related problem. As proven programs become more important in policy and practice, many educators ask whether programs proven in one place (say, warm Florida) will work in another (say, cold Minnesota). In fact, many critics of evidence-based reform base their criticism on the idea that every school and every context is different, so it is impossible to have programs that can apply across all schools.

Obviously, the best answer to this problem is to test promising programs in many places, until we can say either that they work across a broad range of circumstances or that there are key context-based limiting variables. While the evidence may not yet (or ever) be definitive, it is worthwhile to use common sense about what factors might limit generalizability and which are unlikely to do so. For example, for indoor activities such as teaching, hot and cold climates probably do not matter. Rural versus urban locations might matter a great deal for parent involvement programs or attendance programs or after school programs, where families’ physical proximity to the school and transportation issues are likely to be important. English learners certainly need accommodations to their needs that other children may not. Other ethnic-group or social class differences may impact the applicability of particular programs in particular settings. But especially for classroom instructional approaches, it will most often be the case that kids are kids, schools are schools, and effective is effective. Programs that are effective with one broad set of schools and students are likely to be effective in other similar settings. Programs that work in urban Title I schools mainly teaching native English-speaking students in several locations are likely to be effective in similar settings nationally, and so on.

Yet many educators, even those who believe in evidence, are willing to adopt proven programs, but then immediately want to change them, often in major ways. This is usually a very bad idea. The research field is full of examples of programs that consistently work when implemented as intended, but fail miserably when key elements are altered or completely left out. Unless there are major, clear reasons why changes must be made, it is best to implement programs as they were when they achieved their positive outcomes. Over time, as schools become familiar with a program, school leaders and teachers might discuss revisions with the program developer and implement sensible changes in line with the model’s theory of action and evidence base.

Faithful replication is important for obvious reasons, namely sticking as close as possible to the factors that made the original program effective. However, there is a less obvious reason that replications should be as true as possible to the original, at least in the first year or early years of implementation. The reason is because when educators complain about a new program “taking away their creativity,” they are often in fact looking for ways to keep doing what they have always done. And if educators do what they have always done, they will get what they have always gotten, as Einstein noted.

Innovation within proven programs can be a good thing, when schools have fully embraced and thoroughly understand a given program and now can see where it can be improved or adapted to their circumstances. However, innovation too early in replication is likely to turn the best of innovations into mush.

It is perfectly fair for school districts, schools and/or teachers to examine the evidence supporting a new approach to judge just how robust that evidence is: has the program proved itself across a reasonable range of school environments not radically unlike their own? But if the answer to that question is yes, then fidelity of implementation should be the guiding principle of adopting the new program.

Kraków’s castle should have had interior halls to adapt to the cold Polish winters. However, if everyone’s untested ideas about palace design were thrown into the mix from the outset, the palace might never have stood up in the first place!