Response to Intervention and Bob’s Law

The problem in education reform isn’t a lack of good ideas. It’s a lack of good ideas implemented with enough clarity, consistency and integrity to actually make a difference in rigorous experiments (and therefore in large-scale application). A recent large-scale evaluation of Response to Intervention (RTI) illustrates this problem once again.

Response to Intervention (RTI) is a strategy for helping students who are struggling to keep up with ordinary classroom teaching. The idea is that following initial instruction (Tier 1), teachers provide mild assistance to students who are having difficulties (Tier 2), such as small-group remediation. Those who continue to struggle might receive more intensive assistance (Tier 3), such as one-to-one tutoring.

RTI has been common in U.S. classrooms for 20 years, and was virtually mandated as part of No Child Left Behind. So it is distressing that a recent study by MDRC, a respected independent research organization, found no positive effects of Tier 2 services in grades 1-3 reading. In fact, there were slight negative effects for first graders receiving Tier 2 services.

Philosophically, I am a supporter of RTI, but I’m even more of a supporter of rigorous evidence. Yet here’s a very large, well-done (though not randomized) study of RTI that finds no benefits.

I think the findings of the RTI evaluation speak to a broader problem of education policy. Often, national, state or local policies promote or require uses of broadly defined strategies. RTI is a perfect example. Everyone understands the general idea, but there are thousands of ways to implement RTI in practice.

Studies of broad teaching concepts almost always find that they make little if any difference. The reason is that general concepts are implemented differently from class to class and school to school, and end up on average looking a lot like what teachers were doing before, or are doing in schools that do not claim to be implementing the broad concept. That is, the “experimental” classes are not terribly different from the “control” classes.

As a good example of this problem, in the 1970s and ‘80s, Madeline Hunter was extremely popular, and she spoke everywhere suggesting effective classroom strategies. Yet several studies found that when teachers were given training and coaching in Madeline Hunter strategies, it made no difference in achievement. Why? The studies also found that control teachers were already using strategies much like those in the Hunter model. It may well be that Madeline Hunter’s theories were so popular precisely because they were appealing descriptions of what teachers already were doing. Everyone likes to hear that what they’ve always done turns out to be supported by research. So, exactly what made Hunter’s prescriptions popular also made them no more effective than ordinary teaching, because they were ordinary teaching.

In the case of RTI, the MDRC researchers documented some differences between schools using RTI and those that were not, but there was enormous overlap.

So here’s a proposal for what I’ll call Bob’s Law: General teaching strategies subject to substantially varying interpretations by individual teachers are likely to be transformed into practices much like ordinary teaching. For this reason, they are unlikely to produce better outcomes than the control group does.

This does not mean that ordinary teaching methods are bad, or that teaching methods informed by general concepts are bad, but what it does imply is that if you want to see marked improvements in student achievement across many teachers and schools, you have to have programs that are well-conceived, well-specified and well-supported by top-quality professional development and materials. Even then, not all programs work, but success is at least possible if programs bring about systematic and sensible change in teaching methods.

Back to RTI, I remain hopeful that RTI’s strategies can improve student outcomes. However, the approaches to this concept that are likely to work are ones that are specific about all key aspects of the design and help teachers implement approaches that are markedly better than whatever they were using before.

Random Thoughts

Perhaps the most extraordinary change brought about by the evidence-based reform movement in education is the rapidly expanding number of experimental studies that use random assignment to treatment or control groups. As recently as the 1990s, randomized experiments were rare in education. However, in the 2000s, the new Institute of Education Sciences (IES) began to strongly encourage randomized experiments, and later Investing in Innovation (i3) insisted on randomization for its larger grants. IES established training programs to greatly increase the number of scholars able to design, carry out, and analyze randomized experiments. In England, similar developments took place with the establishment of the Education Endowment Foundation (EEF). As randomized experiments have become the standard of evidence, other agencies, private foundations, and commercial companies have also begun to sponsor randomized experiments.

The importance of randomization is clear from the world of medicine. One of the most important reasons that medicine makes rapid and irreversible progress in new drugs and procedures is that medicine routinely subjects promising treatments to experiments in which subjects are assigned at random to receive either the new treatment or a control treatment representing the current standard of care. Random assignment is essential in experiments because it allows experimenters to make sure that on average, subjects in the experimental and control groups can be considered equal in every way except for the treatment itself.

The main alternative to randomized experiments is quasi-experiments, where subjects who get the new treatment are matched on key factors. In education, especially in evaluations of methods intended to increase learning, students may be matched on prior achievement, as well as demographic factors such as social class, race, and English proficiency. Quasi-experiments can be very good in many cases, but in a quasi-experiment there is always a chance that some unmeasured factor could explain positive-looking effects. For example, even if a quasi-experiment matched students on achievement and demographics, teachers may not be equal because those using the new method chose to do so, while the control teachers did not. The teachers who chose the method might be better teachers, perhaps more enthusiastic, harder working, or more positively oriented toward innovation, and these factors rather than the treatment itself could lead to improved outcomes.

While methodologists have long favored randomized over matched experiments, actual experience in education has been mixed, in the sense that in some systematic reviews of treatment studies, effect sizes were pretty much the same in randomized and matched studies. If matched studies introduce bias in favor of the treatment group, shouldn’t this result in inflated effects?

Along with my colleague Alan Cheung, we had an opportunity to test this question on a large scale. In a study of the effects of methodology on effect sizes, we looked at 645 studies that had met the stringent inclusion standards of the Best Evidence Encyclopedia (BEE). Of these, 196 used random assignment and 449 were quasi-experiments.

The result was clear. Matched quasi-experiments did produce inflated effect sizes (ES=+0.23 for quasi-experiments, +0.16 for randomized). This difference is not nearly as large as other factors we looked at, such as sample size (small studies greatly exaggerate outcomes), use of experimenter-made measures, and published vs. unpublished sources (experimenter-made tests and published sources exaggerate impacts). But our findings about matched vs. randomized studies are reason for caution about putting too much faith in quasi-experiments.

One kind of matched study is of particular concern. This is the post hoc, or retrospective study. In such studies, experimenters might start with a group that already received a given treatment and has already taken posttests and then go find a control group that started out at the same level on pretest scores and was similar in demographics. Such studies have an obvious danger in that since outcomes are already known, an unscrupulous investigator can easily choose matched controls already known to have made limited gains.

However, even honest researchers (and most are honest) can fool even themselves with post hoc designs. The problem is that in any experiment, some number of students drop out, or fail to complete the treatment. Researchers are likely to exclude such students from the experimental group. However, forming a control group by picking off students from a computer file, they might be including the very students who, had they been in the experimental program, would have dropped out. Here’s an extreme example. Imagine that in the first months of a matched experiment testing a new high school technology approach, 10% of the students are arrested and sent to school at Juvenile Hall. These students are naturally dropped from the experimental group. However, similar students in the control group are still in the district, so those with matching pretest scores to those in the experimental group would be maintained in the sample (in randomized experiments, “intent to treat” procedures are usually used, keeping all subjects in the experiment no matter what). Our Best Evidence Encyclopedia (BEE) now excludes post hoc quasi-experiments, although it continues to accept matched studies in which the experimental and control groups were designated in advance.

It is not yet possible in education to require that every study be randomized, and the dangers of quasi-experiments are much less when controls are designated in advance. However, if things continue as they have been in recent years, there will soon come a day when we no longer have to play with matches, except under situations in which randomization is impossible.

Good News from i3 Projects

In ancient Sparta, when a young man finished his military training, the city elders held a ceremony in which they removed a stone from the city wall. The idea was that the defense of Sparta was in its people, not its walls.

Just a few days ago, I was privileged to witness an event in which five more programs funded by the Investing in Innovation (i3) program presented their methods and findings on Capitol Hill. Our government might have used this as an opportunity to remove some of the walls that impede progress in education. As more and more programs demonstrate that they can reliably improve student achievement in high-poverty schools, can we slowly back away from the idea that evidence has little role to play in educational policy? Not all of the programs supported by i3, the Institute of Education Sciences (IES), and other funders, are achieving immediate positive outcomes in comparison to control groups, though they may do so in the future. But based just on the programs we know to be effective, why can’t we begin to base educational policies on what works while simultaneously continuing to build up our army of proven approaches?

The findings from the first set of programs funded by i3 are starting to become available. As most readers of this blog know, i3 provides grants according to the levels of evidence programs already have. In 2010, scale-up proposals had to have strong, replicated evidence of effectiveness, and could receive up to $50 million over 5 years. Validation grants needed at least one supportive study, and could receive up to $30 million, and development grants needed only a good idea, and could receive up to $5 million.

Not surprisingly, all four of the 2010 scale-up grants found positive effects in rigorous third-party evaluations. These programs – Reading Recovery, Teach for America, KIPP, and our own Success for All – reported on their impacts in a briefing on Capitol Hill in September, 2014. This year, a new crop of validation and development grantees had good news to share, and did so in a Capitol Hill briefing on October 29. Sarah Sparks of Education Week served as moderator. The Annie E. Casey Foundation, represented by Ilene Berman, funded the session.

Five programs were presented. Ruth Schoenbach of WestEd spoke for Reading Apprenticeship Improving Secondary Literacy (RAISE), which provides professional development for high school and middle school content teachers (e.g., science, history, English) to help them engage students in productive ways of engaging with text.

Nancy Brynelson, of the California State University system, described the Expository Reading and Writing Course (ERWC), an approach used in 800 high schools across California to attempt to reduce the number of students who need remedial literacy courses when they go on to post-secondary education. Like RAISE, ERWC provides professional development to high school teachers to help them help students to engage deeply with text, using discussions, extended writing, and development of critical thinking skills.

Building Assets Reducing Risks (BARR) also focuses on high schools, but uses a very different approach. Represented at the briefing by its evaluator, Dr. Maryann Corsello, BARR focuses on social-emotional as well as academic learning. It provides professional development to teachers focused on building relationships among students and staff, effective communication, and risk reviews for low-performing students.

Two of the programs focused on early reading. The Children’s Literacy Initiative (CLI), represented by Joel Zarrow, works in grades pre-K to 3 in high-poverty schools. The program provides one-to-one coaching to teachers, as well as group and leadership coaching, to help teachers with creating and managing the learning environment, using data to guide instruction, and using best practices for early literacy.

SPARK is a tutoring program for grades K-3 that is provided by Boys & Girls Clubs of Greater Milwaukee. Pat Marcus spoke for the program. It provides one-to-one tutoring for struggling readers, primarily using AmeriCorps members. The tutoring strategies are patterned on those of Reading Recovery, an i3 scale-up program used nationally. The SPARK model also includes a family engagement component to increase parents’ skills in supporting their children’s education.

It was wonderful to learn about each of the five programs, but to me, it was particularly exciting to reflect on the larger message: Investing in Innovation is doing what it was intended to do. Every one of the programs at the briefing had been evaluated by third-party evaluators and found to be effective in rigorous experiments. The programs were highly diverse in approaches and intentions, but each adds significantly to our armamentarium of effective models, ready to serve large numbers of students throughout the U.S. Add the proven i3 scale-up programs and many more i3 programs sure to produce similar impacts, not to mention many more already proven programs being funded by other sources, as well as many sure to be proven effective in the next few years, and you can see the potential for dramatic change in how we collectively improve educational outcomes.

The i3 process has had a huge impact on our understanding of educational change. Programs that have shown positive impacts provide solutions that are ready to go. Those that did not are sure to be a majority (in all areas of evidence-based research, most experiments do not show positive effects). However, a great deal is being learned from those studies as well, and the entire field is moving forward at a pace once thought unimaginable. The recent Hill briefing is the second of what I am sure will be many celebrations of proven programs for an ever-expanding set of subjects, grade levels, and situations.

As we gain confidence in each of these approaches, I hope we can learn to rely on our people, our ingenuity, and our science to improve our schools. Like the Spartans of long ago, let’s recognize that these are the essential assets for our security and our progress.