“But It Worked in the Lab!” How Lab Research Misleads Educators

In researching John Hattie’s meta-meta analyses, and digging into the original studies, I discovered one underlying factor that more than anything explains why he consistently comes up with greatly inflated effect sizes:  Most studies in the meta-analyses that he synthesizes are brief, small, artificial lab studies. And lab studies produce very large effect sizes that have little if any relevance to classroom practice.

This discovery reminds me of one of the oldest science jokes in existence: (One scientist to another): “Your treatment worked very well in practice, but how will it work in the lab?” (Or “…in theory?”)

blog_6-28-18_scientists_500x424

The point of the joke, of course, is to poke fun at scientists more interested in theory than in practical impacts on real problems. Personally, I have great respect for theory and lab studies. My very first publication as a psychology undergraduate involved an experiment on rats.

Now, however, I work in a rapidly growing field that applies scientific methods to the study and improvement of classroom practice.  In our field, theory also has an important role. But lab studies?  Not so much.

A lab study in education is, in my view, any experiment that tests a treatment so brief, so small, or so artificial that it could never be used all year. Also, an evaluation of any treatment that could never be replicated, such as a technology program in which a graduate student is standing by every four students every day of the experiment, or a tutoring program in which the study author or his or her students provide the tutoring, might be considered a lab study, even if it went on for several months.

Our field exists to try to find practical solutions to practical problems in an applied discipline.  Lab studies have little importance in this process, because they are designed to eliminate all factors other than the variables of interest. A one-hour study in which children are asked to do some task under very constrained circumstances may produce very interesting findings, but cannot recommend practices for real teachers in real classrooms.  Findings of lab studies may suggest practical treatments, but by themselves they never, ever validate practices for classroom use.

Lab studies are almost invariably doomed to success. Their conditions are carefully set up to support a given theory. Because they are small, brief, and highly controlled, they produce huge effect sizes. (Because they are relatively easy and inexpensive to do, it is also very easy to discard them if they do not work out, contributing to the universally reported tendency of studies appearing in published sources to report much higher effects than reports in unpublished sources).  Lab studies are so common not only because researchers believe in them, but also because they are easy and inexpensive to do, while meaningful field experiments are difficult and expensive.   Need a publication?  Randomly assign your college sophomores to two artificial treatments and set up an experiment that cannot fail to show significant differences.  Need a dissertation topic?  Do the same in your third-grade class, or in your friend’s tenth grade English class.  Working with some undergraduates, we once did three lab studies in a single day. All were published. As with my own sophomore rat study, lab experiments are a good opportunity to learn to do research.  But that does not make them relevant to practice, even if they happen to take place in a school building.

By doing meta-analyses, or meta-meta-analyses, Hattie and others who do similar reviews obscure the fact that many and usually most of the studies they include are very brief, very small, and very artificial, and therefore produce very inflated effect sizes.  They do this by covering over the relevant information with numbers and statistics rather than information on individual studies, and by including such large numbers of studies that no one wants to dig deeper into them.  In Hattie’s case, he claims that Visible Learning meta-meta-analyses contain 52,637 individual studies.  Who wants to read 52,637 individual studies, only to find out that most are lab studies and have no direct bearing on classroom practice?  It is difficult for readers to do anything but assume that the 52,637 studies must have taken place in real classrooms, and achieved real outcomes over meaningful periods of time.  But in fact, the few that did this are overwhelmed by the thousands of lab studies that did not.

Educators have a right to data that are meaningful for the practice of education.  Anyone who recommends practices or programs for educators to use needs to be open about where that evidence comes from, so educators can judge for themselves whether or not one-hour or one-week studies under artificial conditions tell them anything about how they should teach. I think the question answers itself.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

What Schools in One Place Can Learn from Schools Elsewhere

In a recent blog, I responded to an article by Lisbeth Schorr and Srik Gopal about their concerns that the findings of randomized experiments will not generalize from one set of schools to another. I got a lot of supportive response to the blog, but I realize that I left out a key point.

The missing point was this: the idea that effective programs readily generalize from one place to another is not theoretical. It happens all the time. I try to avoid talking about our own programs, but in this case, it’s unavoidable. Our Success for All program started almost 30 years ago, working with African American students in Baltimore. We got terrific results with those first schools. But our first dissemination schools beyond Baltimore included a Philadelphia school primarily serving Cambodian immigrants, rural schools in the South, small town schools in the Midwest, and so on. We had to adapt and refine our approaches for these different circumstances, but we found positive effects across a very wide range of settings and circumstances. Over the years, some of our most successful schools have been ones serving a Native Americans, such as a school in the Arizona desert and a school in far northern Quebec. Another category of schools where we see outstanding success is ones serving Hispanic students, including English language learners, as in the Alhambra district in Phoenix and a charter school near Los Angeles. One of our most successful districts anywhere is in small-city Steubenville, Ohio. We have established a successful network of SFA schools in England and Wales, where we have extraordinary schools primarily serving Pakistani, African, and disadvantaged White students in a very different policy context from the one we face in the U.S. And yes, we continue to find great results in Baltimore and in cities that resemble our original home, such as Detroit.

The ability to generalize from one set of schools to others is not at all limited to Success for All. Reading Recovery, for example, has had success in every kind of school, in countries throughout the world. Direct Instruction has also been successful in a wide array of types of schools. In fact, I’d argue that it is rare to find programs that have been proven to be effective in rigorous research that then fail to generalize to other schools, even ones that are quite different. Of course, there is great variation in outcomes in any set of schools using any innovative program, but that variation has to do with leadership, local support, resources, and so on, not with a fundamental limitation on generalizability to additional populations.

How is it possible that programs initially designed for one setting and population so often generalize to others? My answer would be that in most fundamental regards, the closer you get to the classroom, the more schools begin to resemble each other. Individual students do not all learn the same way, but every classroom contains a range of students who have a predictable set of needs. Any effective program has to be able to meet those needs, wherever the school happens to be located. For example, every classroom has some number of kids who are confident, curious, and capable, some number who are struggling, some number who are shy and quiet, some number who are troublemakers. Most contain students who are not native speakers of English. Any effective program has to have a workable plan for each of these types of students, even if the proportions of each may vary from classroom to classroom and school to school.

There are reasonable adaptations necessary for different school contexts, of course. There are schools where attendance is a big issue and others where it can be assumed, schools where safety is a major concern and others where it is less so. Schools in rural areas have different needs from those in urban or suburban ones, and obviously schools with many recent immigrants have different needs from those in which all students are native speakers of English. Involving parents effectively looks different in different places, and there are schools in which eyeglasses and other health concerns can be assumed to be taken care of and others where they are major impediments to success. But after the necessary accommodations are made, you come down to a teacher and twenty to thirty children who need to be motivated, to be guided, to have their individual needs met, and to have their time used to greatest effect. You need to have an effective plan to manage diverse needs and to inspire kids to see their own possibilities. You need to fire children’s imaginations and help them use their minds well to write and solve problems and imagine their own futures. These needs exist equally in Peru and Poughkeepsie, in the Arizona desert or the valleys of Wales, in Detroit or Eastern Kentucky, in California or Maine.

Disregarding evidence from randomized experiments because it does not always replicate is a recipe for the status quo, as far as the eye can see. And the status quo is unacceptable. In my experience, the reason programs fail to replicate is that they were never all that successful in the first place, or because they attempt to replicate a form of a model much less robust than the one they researched.

Generalization can happen. It happens all the time. It has to be planned for, designed for, not just assumed, but it can and does happen. Rather than using failure to replicate as a stick to beat evidence-based policy, let’s agree that we can learn to replicate, and then use every tool at hand to do so. There are so many vulnerable children who need better educations, and we cannot be distracted by arguments that “nothing replicates” that are contradicted by many examples throughout the world.

Can Findings From One Set Of Schools Apply To Others?

Every person is unique. Yet that does not mean that research showing the effectiveness of medical treatments, for example, does not apply to people beyond the ones who were in a particular study. If nothing generalized from one circumstance to others, then science would be meaningless. I think every educated person understands this.

Yet for some reason, research in education is often criticized for trying to generalize from one set of schools to others. Whenever I speak about evidence-based reform in education, most recently in a talk at my alma mater, Reed College, someone raises this concern. In a recent article by Lisbeth Schorr and Srik Gopal, the authors wonder how anything can generalize “from Peru to Poughkeepsie.”

First, let me state the obvious. Every school is different, and findings from studies done elsewhere cannot be assumed to apply to a specific school or set of schools. However, it would be foolish to ignore the evidence from high-quality research, especially to the degree that a given school considering using a program or practice found effective elsewhere resembles the schools in the study that established that evidence. So Peru to Poughkeepsie might be a stretch, unless it is Peru, Illinois. And should Poughkeepsie ignore evidence from nearby Tarrytown and Nyack? Taking a position that generalization is never appropriate would be just as unjustified as taking a position that it is always justified.

There is an old saying to the effect that the race is not always to the swift nor the fight to the strong, but it’s best to bet that way. When responsible educators choose programs for their schools and districts, they are making a bet on behalf of their children. Why would they not take the evidence into account in making these important choices?

Determining when generalization is more or less likely is not too difficult. First, you’d want to consider the strength of the evidence. For example, a program proven effective in multiple studies done by multiple researchers with many diverse schools, with random assignment to experimental or control groups and measures not made by the experimenters or developers, should give potential adopters a lot of confidence. To the degree that those studies involved schools similar to yours, serving similar communities, that adds a lot. The consistency of the outcomes across different studies would be important to consider.

Schorr and Gopal are not opposed to randomized studies, but they warn about placing too much reliance on them. Yet the advantage of randomized studies is that they rule out bias. How can that be a bad thing? What we need is a lot more randomized studies, and other studies with rigorous designs, done in a lot of places, so that we can build up a large and diverse evidence base for programs that can be replicated. It so happens that the road to generalizability is precisely the one that Schorr and Gopal want us to diminish, because if fewer randomized studies are done, we will lack the quality, size, freedom from bias, and diversity of research needed to determine if a program is truly and broadly effective.

Discussions about when generalization is most likely to take place are healthy and welcome. But they are not academic. America’s schools are not getting better fast enough, and achievement gaps by race and class remain unacceptable. Identifying proven programs and practices and replicating them broadly is the best way I know of to make genuine, lasting progress. The evidence base is only now getting large and good enough to justify policies of evidence-based reform, as the recent ESSA legislation tentatively begins to do. We need to continue to expand that evidence base and to use what we do know while working to learn more. Pretending that no school can learn from what was done in any other does not move us forward, and forward is the direction we need to be moving as fast as we possibly can.

It’s Proven. It’s Perfect. I’ll Change It.

I recently visited Kraków, Poland. It’s a wonderful city. One of its highlights is a beautiful royal castle, built in the 16th century by an Italian architect. The castle had just one problem. It had no interior hallways. To go from room to room, you had to go outside onto a covered walkway overlooking a courtyard. This is a perfectly good idea in warm Italy, but in Poland it can get to 30 below in the winter!

In evidence-based reform in education, we have a related problem. As proven programs become more important in policy and practice, many educators ask whether programs proven in one place (say, warm Florida) will work in another (say, cold Minnesota). In fact, many critics of evidence-based reform base their criticism on the idea that every school and every context is different, so it is impossible to have programs that can apply across all schools.

Obviously, the best answer to this problem is to test promising programs in many places, until we can say either that they work across a broad range of circumstances or that there are key context-based limiting variables. While the evidence may not yet (or ever) be definitive, it is worthwhile to use common sense about what factors might limit generalizability and which are unlikely to do so. For example, for indoor activities such as teaching, hot and cold climates probably do not matter. Rural versus urban locations might matter a great deal for parent involvement programs or attendance programs or after school programs, where families’ physical proximity to the school and transportation issues are likely to be important. English learners certainly need accommodations to their needs that other children may not. Other ethnic-group or social class differences may impact the applicability of particular programs in particular settings. But especially for classroom instructional approaches, it will most often be the case that kids are kids, schools are schools, and effective is effective. Programs that are effective with one broad set of schools and students are likely to be effective in other similar settings. Programs that work in urban Title I schools mainly teaching native English-speaking students in several locations are likely to be effective in similar settings nationally, and so on.

Yet many educators, even those who believe in evidence, are willing to adopt proven programs, but then immediately want to change them, often in major ways. This is usually a very bad idea. The research field is full of examples of programs that consistently work when implemented as intended, but fail miserably when key elements are altered or completely left out. Unless there are major, clear reasons why changes must be made, it is best to implement programs as they were when they achieved their positive outcomes. Over time, as schools become familiar with a program, school leaders and teachers might discuss revisions with the program developer and implement sensible changes in line with the model’s theory of action and evidence base.

Faithful replication is important for obvious reasons, namely sticking as close as possible to the factors that made the original program effective. However, there is a less obvious reason that replications should be as true as possible to the original, at least in the first year or early years of implementation. The reason is because when educators complain about a new program “taking away their creativity,” they are often in fact looking for ways to keep doing what they have always done. And if educators do what they have always done, they will get what they have always gotten, as Einstein noted.

Innovation within proven programs can be a good thing, when schools have fully embraced and thoroughly understand a given program and now can see where it can be improved or adapted to their circumstances. However, innovation too early in replication is likely to turn the best of innovations into mush.

It is perfectly fair for school districts, schools and/or teachers to examine the evidence supporting a new approach to judge just how robust that evidence is: has the program proved itself across a reasonable range of school environments not radically unlike their own? But if the answer to that question is yes, then fidelity of implementation should be the guiding principle of adopting the new program.

Kraków’s castle should have had interior halls to adapt to the cold Polish winters. However, if everyone’s untested ideas about palace design were thrown into the mix from the outset, the palace might never have stood up in the first place!