The Fabulous 20%: Programs Proven Effective in Rigorous Research

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

Over the past 15 years, governments in the U.S. and U.K. have put quite a lot of money (by education standards) into rigorous research on promising programs in PK-12 instruction. Rigorous research usually means studies in which schools, teachers, or students are assigned at random to experimental or control conditions and then pre- and posttested on valid measures independent of the developers. In the U.S., the Institute for Education Sciences (IES) and Investing in Innovation (i3), now called Education Innovation Research (EIR), have led this strategy, and in the U.K., it’s the Education Endowment Foundation (EEF). Enough research has now been done to enable us to begin to see important patterns in the findings.

One finding that is causing some distress is that the numbers of studies showing significant positive effects is modest. Across all funding programs, the proportion of studies reporting positive, significant findings averages around 20%. It is important to note that most funded projects evaluate programs that have been newly developed and not previously evaluated. The “early phase” or “development” category of i3/EIR is a good example; it provides small grants intended to fund creation or refinement of new programs, so it is not so surprising that these studies are less likely to find positive outcomes. However, even programs that have been successfully evaluated in the past often do not replicate their positive findings in the large, rigorous evaluations required at the higher levels of i3/EIR and IES, and in all full-scale EEF studies. The problem is that positive outcomes may have been found in smaller studies in which hard-to-replicate levels of training or monitoring by program developers may have been possible, or in which measures made by developers or researchers were used, or where other study features made it easier to find positive outcomes.

The modest percentage of positive findings has caused some observers to question the value of all these rigorous studies. They wonder if this is a worthwhile investment of tax dollars.

One answer to this concern is to point out that while the percentage of all studies finding positive outcomes is modest, so many have been funded that the number of proven programs is growing rapidly. In our Evidence for ESSA website (, we have found 111 programs that meet ESSA’s Strong, Moderate, or Promising standards in elementary and secondary reading or math. That’s a lot of proven programs, especially in elementary reading, where there were 62.

The situation is a bit like that in medicine. A very small percentage of rigorous studies of medicines or other treatments show positive effects. Yet so many are done that each year, new proven treatments for all sorts of diseases enter widespread use in medical practice. This dynamic is one explanation for the steady increases in life expectancy taking place throughout the world.

Further, high quality studies that fail to find positive outcomes also contribute to the science and practice of education. Some programs do not meet standards for statistical significance, but nevertheless they show promise overall or with particular subgroups. Programs that do not find clear positive outcomes but closely resemble other programs that do are another category worth further attention. Funders can take this into account in deciding whether to fund another study of programs that “just missed.”

On the other hand, there are programs that show profoundly zero impact, in categories that never or almost never find positive outcomes. I reported recently on benchmark assessments,  with an overall effect size of -0.01 across 10 studies. This might be a good candidate for giving up, unless someone has a markedly different approach unlike those that have failed so often. Another unpromising category is textbooks. Textbooks may be necessary, but the idea that replacing one textbook with another has failed many, many times. This set of negative results can be helpful to schools, enabling them to focus their resources on programs that do work. But giving up on categories of studies that hardly ever work would significantly reduce the 80% failure rate, and save money better spent on evaluating more promising approaches.

The findings of many studies of replicable programs can also reveal patterns that should help current or future developers create programs that meet modern standards of evidence. There are a few patterns I’ve seen across many programs and studies:

  1. I think developers (and funders) vastly underestimate the amount and quality of professional development needed to bring about significant change in teacher behaviors and student outcomes. Strong professional development requires top-quality initial training, including simulations and/or videos to show teachers how a program works, not just tell them. Effective PD almost always includes coaching visits to classrooms to give teachers feedback and new ideas. If teachers fall back into their usual routines due to insufficient training and follow-up coaching, why would anyone expect their students’ learning to improve in comparison to the outcomes they’ve always gotten? Adequate professional development can be expensive, but this cost is highly worthwhile if it improves outcomes.
  2. In successful programs, professional development focuses on classroom practices, not solely on improving teachers’ knowledge of curriculum or curriculum-specific pedagogy. Teachers standing at the front of the class using the same forms of teaching they’ve always used but doing it with more up-to-date or better-aligned content are not likely to significantly improve student learning. In contrast, professional development focused on tutoring, cooperative learning, and classroom management has a far better track record.
  3. Programs that focus on motivation and relationships between teachers and students and among students are more likely to enhance achievement than programs that focus on cognitive growth alone. Successful teaching focuses on students’ hearts and spirits, not just their minds.
  4. You can’t beat tutoring. Few approaches other than one-to-one or one-to-small group tutoring have consistent powerful impacts. There is much to learn about how to make tutoring maximally effective and cost-effective, but let’s start with the most effective and cost-effective tutoring models we have now and build out from there .
  5. Many, perhaps most failed program evaluations involve approaches with great potential (or great success) in commercial applications. This is one reason that so many evaluations fail; they assess textbooks or benchmark assessments or ordinary computer assisted instruction approaches. These often involve little professional development or follow-up, and they may not make important changes in what teachers do. Real progress in evidence-based reform will begin when publishers and software developers come to believe that only proven programs will succeed in the marketplace. When that happens, vast non-governmental resources will be devoted to development, evaluation, and dissemination of well-implemented forms of proven programs. Medicine was once dominated by the equivalent of Dr. Good’s Universal Elixir (mostly good-tasting alcohol and sugar). Very cheap, widely marketed, and popular, but utterly useless. However, as government began to demand evidence for medical claims, Dr. Good gave way to Dr. Proven.

Because of long-established policies and practices that have transformed medicine, agriculture, technology, and other fields, we know exactly what has to be done. IES, i3/EIR, and EEF are doing it, and showing great progress. This is not the time to get cold feet over the 80% failure rate. Instead, it is time to celebrate the fabulous 20% – programs that have succeeded in rigorous evaluations. Then we need to increase investments in evaluations of the most promising approaches.



This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.


Tutoring Works. But Let’s Learn How It Can Work Better and Cheaper

I was once at a meeting of the British Education Research Association, where I had been invited to participate in a debate about evidence-based reform. We were having what journalists often call “a frank exchange of views” in a room packed to the rafters.

At one point in the proceedings, a woman stood up and, in a furious tone of voice, informed all and sundry that (I’m paraphrasing here) “we don’t need to talk about all this (very bad word). Every child should just get Reading Recovery.” She then stomped out.

I don’t know how widely her view was supported in the room or anywhere else in Britain or elsewhere, but what struck me at the time, and what strikes even more today, is the degree to which Reading Recovery has long defined, and in many ways limited, discussions about tutoring. Personally, I have nothing against Reading Recovery, and I have always admired the commitment Reading Recovery advocates have had to professional development and to research. I’ve also long known that the evidence for Reading Recovery is very impressive, but you’d be amazed if one-to-one tutoring by well-trained teachers did not produce positive outcomes. On the other hand, Reading Recovery insists on one-to-one instruction by certified teachers with a lot of cost for all that admirable professional development, so it is very expensive. A British study estimated the cost per child at $5400 (in 2018 dollars). There are roughly one million Year 1 students in the U.K., so if the angry woman had her way, they’d have to come up with the equivalent of $5.4 billion a year. In the U.S., it would be more like $27 billion a year. I’m not one to shy away from very expensive proposals if they provide also extremely effective services and there are no equally effective alternatives. But shouldn’t we be exploring alternatives?

If you’ve been following my blogs on tutoring, you’ll be aware that, at least at the level of research, the Reading Recovery monopoly on tutoring has been broken in many ways. Reading Recovery has always insisted on certified teachers, but many studies have now shown that well-trained teaching assistants can do just as well, in mathematics as well as reading. Reading Recovery has insisted that tutoring should just be for first graders, but numerous studies have now shown positive outcomes of tutoring through seventh grade, in both reading and mathematics. Reading Recovery has argued that its cost was justified by the long-lasting impacts of first-grade tutoring, but their own research has not documented long-lasting outcomes. Reading Recovery is always one-to-one, of course, but now there are numerous one-to-small group programs, including a one-to-three adaptation of Reading Recovery itself, that produce very good effects. Reading Recovery has always just been for reading, but there are now more than a dozen studies showing positive effects of tutoring in math, too.


All of this newer evidence opens up new possibilities for tutoring that were unthinkable when Reading Recovery ruled the tutoring roost alone. If tutoring can be effective using teaching assistants and small groups, then it is becoming a practicable solution to a much broader range of learning problems. It also opens up a need for further research and development specific to the affordances and problems of tutoring. For example, tutoring can be done a lot less expensively than $5,400 per child, but it is still expensive. We created and evaluated a one-to-six, computer-assisted tutoring model that produced effect sizes of around +0.40 for $500 per child. Yet I just got a study from the Education Endowment Fund (EEF) in England evaluating one-to-three math tutoring by college students and recent graduates. They only provided tutoring one hour per week for 12 weeks, to sixth graders. The effect size was much smaller (ES=+0.19), but the cost was only about $150 per child.

I am not advocating this particular solution, but isn’t it interesting? The EEF also evaluated another means of making tutoring inexpensive, using online tutors from India and Sri Lanka, and another, using cross-age peer tutors, both in math. Both failed miserably, but isn’t that interesting?

I can imagine a broad range of approaches to tutoring, designed to enhance outcomes, minimize costs, or both. Out of that research might come a diversity of approaches that might be used for different purposes. For example, students in deep trouble, headed for special education, surely need something different from what is needed by students with less serious problems. But what exactly is it that is needed in each situation?

In educational research, reliable positive effects of any intervention are rare enough that we’re usually happy to celebrate anything that works. We might say, “Great, tutoring works! But we knew that.”  However, if tutoring is to become a key part of every school’s strategies to prevent or remediate learning problems, then knowing that “tutoring works” is not enough. What kind of tutoring works for what purposes?  Can we use technology to make tutors more effective? How effective could tutoring be if it is given all year or for multiple years? Alternatively, how effective could we make small amounts of tutoring? What is the optimal group size for small group tutoring?

We’ll never satisfy the angry woman who stormed out of my long-ago symposium at BERA. But for those who can have an open mind about the possibilities, building on the most reliable intervention we have for struggling learners and creating and evaluating effective and cost-effective tutoring approaches seems like a worthwhile endeavor.

Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Rethinking Technology in Education

Antonine de Saint Exupéry, in his 1931 classic Night Flight, had a wonderful line about early airmail service in Patagonia, South America:

“When you are crossing the Andes and your engine falls out, well, there’s nothing to do but throw in your hand.”


I had reason to think about this quote recently, as I was attending a conference in Santiago, Chile, the presumed destination of the doomed pilot. The conference focused on evidence-based reform in education.

Three of the papers described large scale, randomized evaluations of technology applications in Latin America, funded by the Inter-American Development Bank (IDB). Two of them documented disappointing outcomes of large-scale, traditional uses of technology. One described a totally different application.

One of the studies, reported by Santiago Cueto (Cristia et al., 2017), randomly assigned 318 high-poverty, mostly rural primary schools in Peru to receive sturdy, low-cost, practical computers, or to serve as a control group. Teachers were given great latitude in how to use the computers, but limited professional development in how to use them as pedagogical resources. Worse, the computers had software with limited alignment to the curriculum, and teachers were expected to overcome this limitation. Few did. Outcomes were essentially zero in reading and math.

In another study (Berlinski & Busso, 2017), the IDB funded a very well-designed study in 85 schools in Costa Rica. Schools were randomly assigned to receive one of five approaches. All used the same content on the same schedule to teach geometry to seventh graders. One group used traditional lectures and questions with no technology. The others used active learning, active learning plus interactive whiteboards, active learning plus a computer lab, or active learning plus one computer per student. “Active learning” emphasized discussions, projects, and practical exercises.

On a paper-and-pencil test covering the content studied by all classes, all four of the experimental groups scored significantly worse than the control group. The lowest performance was seen in the computer lab condition, and, worst of all, the one computer per child condition.

The third study, in Chile (Araya, Arias, Bottan, & Cristia, 2018), was funded by the IDB and the International Development Research Center of the Canadian government. It involved a much more innovative and unusual application of technology. Fourth grade classes within 24 schools were randomly assigned to experimental or control conditions. In the experimental group, classes in similar schools were assigned to serve as competitors to each other. Within the math classes, students studied with each other and individually for a bi-monthly “tournament,” in which students in each class were individually given questions to answer on the computers. Students were taught cheers and brought to fever pitch in their preparations. The participating classes were compared to the control classes, which studied the same content using ordinary methods. All classes, experimental and control, were studying the national curriculum on the same schedule, and all used computers, so all that differed was the tournaments and the cooperative studying to prepare for the tournaments.

The outcomes were frankly astonishing. The students in the experimental schools scored much higher on national tests than controls, with an effect size of +0.30.

The differences in the outcomes of these three approaches are clear. What might explain them, and what do they tell us about applications of technology in Latin America and anywhere?

In Peru, the computers were distributed as planned and generally functioned, but teachers receive little professional development. In fact, teachers were not given specific strategies for using the computers, but were expected to come up with their own uses for them.

The Costa Rica study did provide computer users with specific approaches to math and gave teachers much associated professional development. Yet the computers may have been seen as replacements for teachers, and the computers may just not have been as effective as teachers. Alternatively, despite extensive PD, all four of the experimental approaches were very new to the teachers and may have not been well implemented.

In contrast, in the Chilean study, tournaments and cooperative study were greatly facilitated by the computers, but the computers were not central to program effectiveness. The theory of action emphasized enhanced motivation to engage in cooperative study of math. The computers were only a tool to achieve this goal. The tournament strategy resembles a method from the 1970s called Teams-Games-Tournaments (TGT) (DeVries & Slavin, 1978). TGT was very effective, but was complicated for teachers to use, which is why it was not widely adopted. In Chile, computers helped solve the problems of complexity.

It is important to note that in the United States, technology solutions are also not producing major gains in student achievement. Reviews of research on elementary reading (ES=+0.05; Inns et al. 2018) and secondary reading (ES= -0.01; Baye et al., in press) have reported near-zero effects of technology-assisted effects of technology-assisted approaches. Outcomes in elementary math are only somewhat better, averaging an effect size of +0.09 (Pellegrini et al., 2018).

The findings of these rigorous studies of technology in the U.S. and Latin America lead to a conclusion that there is nothing magic about technology. Applications of technology can work if the underlying approach is sound. Perhaps it is best to consider which non-technology approaches are proven or likely to increase learning, and only then imagine how technology could make effective methods easier, less expensive, more motivating, or more instructionally effective. As an analogy, great audio technology can make a concert more pleasant or audible, but the whole experience still depends on great composition and great performances. Perhaps technology in education should be thought of in a similar enabling way, rather than as the core of innovation.

St. Exupéry’s Patagonian pilots crossing the Andes had no “Plan B” if their engines fell out. We do have many alternative ways to put technology to work or to use other methods, if the computer-assisted instruction strategies that have dominated technology since the 1970s keep showing such small or zero effects. The Chilean study and certain exceptions to the overall pattern of research findings in the U.S. suggest appealing “Plans B.”

The technology “engine” is not quite falling out of the education “airplane.” We need not throw in our hand. Instead, it is clear that we need to re-engineer both, to ask not what is the best way to use technology, but what is the best way to engage, excite, and instruct students, and then ask how technology can contribute.

Photo credit: Distributed by Agence France-Presse (NY Times online) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.


Araya, R., Arias, E., Bottan, N., & Cristia, J. (2018, August 23). Conecta Ideas: Matemáticas con motivatión social. Paper presented at the conference “Educate with Evidence,” Santiago, Chile.

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Berlinski, S., & Busso, M. (2017). Challenges in educational reform: An experiment on active learning in mathematics. Economics Letters, 156, 172-175.

Cristia, J., Ibarraran, P., Cueto, S., Santiago, A., & Severín, E. (2017). Technology and child development: Evidence from the One Laptop per Child program. American Economic Journal: Applied Economics, 9 (3), 295-320.

DeVries, D. L., & Slavin, R. E. (1978). Teams-Games-Tournament:  Review of ten classroom experiments. Journal of Research and Development in Education, 12, 28-38.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018, March 3). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018, March 3). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Programs and Practices

One issue I hear about all the time when I speak about evidence-based reform in education relates to the question of programs vs. practices. A program is a specific set of procedures, usually with materials, software, professional development, and other elements, designed to achieve one or more important outcomes, such as improving reading, math, or science achievement. Programs are typically created by non-profit organizations, though they may be disseminated by for-profits. Almost everything in the What Works Clearinghouse (WWC) and Evidence for ESSA is a program.

A practice, on the other hand, is a general principle that a teacher can use. It may not require any particular professional development or materials.  Examples of practices include suggestions to use more feedback, more praise, a faster pace of instruction, more higher-order questions, or more technology.

In general, educators, and especially teachers, love practices, but are not so crazy about programs. Programs have structure, requiring adherence to particular activities and use of particular materials. In contrast, every teacher can use practices as they wish. Educational leaders often say, “We don’t do programs.” What they mean is, “we give our teachers generic professional development and then turn them loose to interpret them.”

One problem with practices is that because they leave the details up to each teacher, teachers are likely to interpret them in a way that conforms to what they are already doing, and then no change happens. As an example of this, I once attended a speech by the late, great Madeline Hunter, extremely popular in the 1970s and ‘80s. She spoke and wrote clearly and excitingly in a very down-to-earth way. The auditorium she spoke to was stuffed to the rafters with teachers, who hung on her every word.

When her speech was over, I was swept out in a throng of happy teachers. They were all saying to each other, “Madeline Hunter supports absolutely everything I’ve ever believed about teaching!”

I love happy teachers, but I was puzzled by their reaction. If all the teachers were already doing the things Madeline Hunter recommended to the best of their ability, then how did her ideas improve their teaching? In actuality, a few studies of Hunters’ principles found no significant effects on student learning, and even more surprising, they found few differences between the teaching behaviors of teachers trained in Hunter’s methods and those who had not been. Essentially, one might argue, Madeline Hunter’s principles were popular precisely because they did not require teachers to change very much, and if teachers do not change their teaching, why would we expect their students’ learning to change?


Another reason that practices rarely change learning is that they are usually small improvements that teachers are expected to assemble to improve their teaching. However, asking teachers to put together many pieces into major improvements is a bit like giving someone the pieces and parts of a lawnmower and asking them to put them together (see picture above). Some mechanically-minded people could do it, but why bother? Why not start with a whole lawnmower?

In the same way, there are gifted teachers who can assemble principles of effective practice into great instruction, but why make it so difficult? Great teachers who could assemble isolated principles into effective teaching strategies are also sure to be able to take a proven program and implement it very well. Why not start with something known to work and then improve it with effective implementation, rather than starting from scratch?

One problem with practices is that most are impossible to evaluate. By definition, everyone has their own interpretation of every practice. If practices become specific, with specific guides, supports, and materials, they become programs. So a practice is a practice exactly because it is too poorly specified to be a program. And practices that are difficult to clearly specify are also unlikely to improve student outcomes.

There are exceptions, where practices can be evaluated. For example, eliminating ability grouping or reducing class size or assigning (or not assigning) homework are practices that can be evaluated, and can be specified. But these are exceptions.

The squishiness of most practices is the reason that they rarely appear in the WWC or Evidence for ESSA. A proper evaluation contrasts one treatment (an experimental group) to a control group continuing current practices. The treatment group almost has to be a program, because otherwise it is impossible to tell what is being evaluated. For example, how can an experiment evaluate “feedback” if teachers make up their own definitions of “feedback”? How about higher-order questions? How about praise? Rapid pace? Use of these practices can be measured using observation, but differences between the treatment and control groups may be hard to detect because in each case teachers in the control group may also be using the same practices. What teacher does not provide feedback? What teacher does not praise children? What teacher does not use higher-order questions? Some may use these practices more than others, but the differences are likely to be subtle. And subtle differences rarely produce important outcomes.

The distinction between programs and practices has a lot to do with the practices (not programs) promoted by John Hattie. He wants to identify practices that can help teachers know about what works in instruction. That’s a noble goal, but it can rarely be done using real classroom research done over real periods of time. In order to isolate particular practices for study, researchers often do very brief, artificial lab studies that have nothing to do with classroom practices.  For example, some lab studies in Hattie’s own review of feedback contrast teachers giving feedback to teachers giving no feedback. What teacher would do that?

It is worthwhile to use what we know from research, experience, program evaluations, and theory to discuss what practices may be most useful for teachers. But claiming particular effect sizes for such studies is rarely justified. The strongest evidence for practical use in schools will almost always come from experiments evaluating programs. Practices have their place, but focusing on exposing teachers to a lot of practices and expecting them to put them together to improve student outcomes is not likely to work.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

What’s the Evidence that Evidence Works?

I recently gave a couple of speeches on evidence-based reform in education in Barcelona.  In preparing for them, one of the organizers asked me an interesting question: “What is your evidence that evidence works?”

At one level, this is a trivial question. If schools select proven programs and practices aligned with their needs and implement them with fidelity and intelligence, with levels of resources similar to those used in the original successful research, then of course they’ll work, right? And if a school district adopts proven programs, encourages and funds them, and monitors their implementation and outcomes, then of course the appropriate use of all these programs is sure to enhance achievement district-wide, right?

Although logic suggests that a policy of encouraging and funding proven programs is sure to increase achievement on a broad scale, I like to be held to a higher standard: Evidence. And, it so happens, I happen to have some evidence on this very topic. This evidence came from a large-scale evaluation of an ambitious, national effort to increase use of proven and promising schoolwide programs in elementary and middle schools, in a research center funded by the Institute for Education Sciences (IES) called the Center for Data-Driven Reform in Education, or CDDRE (see Slavin, Cheung, Holmes, Madden, & Chamberlain, 2013). The name of the program the experimental schools used was Raising the Bar.

How Raising the Bar Raised the Bar

The idea behind Raising the Bar was to help schools analyze their own needs and strengths, and then select whole-school reform models likely to help them meet their achievement goals. CDDRE consultants provided about 30 days of on-site professional development to each district over a 2-year period. The PD focused on review of data, effective use of benchmark assessments, school walk-throughs by district leaders to see the degree to which schools were already using the programs they claimed to be using, and then exposing district and school leaders to information and data on schoolwide programs available to them, from several providers. If districts selected a program to implement, their district and school received PD on ensuring effective implementation and principals and teachers received PD on the programs they chose.


Evaluating Raising the Bar

In the study of Raising the Bar we recruited a total of 397 elementary and 225 middle schools in 59 districts in 7 states (AL, AZ, IN, MS, OH, TN). All schools were Title I schools in rural and mid-sized urban districts. Overall, 30% of students were African-American, 20% were Hispanic, and 47% were White. Across three cohorts, starting in 2005, 2006, or 2007, schools were randomly assigned to either use Raising the Bar, or to continue with what they were doing. The study ended in 2009, so schools could have been in the Raising the Bar group for two, three, or four years.

Did We Raise the Bar?

State test scores were obtained from all schools and transformed to z-scores so they could be combined across states. The analyses focused on grades 5 and 8, as these were the only grades tested in some states at the time. Hierarchical linear modeling, with schools nested within districts, were used for analysis.

For reading in fifth grade, outcomes were very good. By Year 3, the effect sizes were significant, with significant individual-level effect sizes of +0.10 in Year 3 and +0.19 in Year 4. In middle school reading, effect sizes reached an effect size of +0.10 by Year 4.

Effects were also very good in fifth grade math, with significant effects of +0.10 in Year 3 and +0.13 in Year 4. Effect sizes in middle school math were also significant in Year 4 (ES=+0.12).

Note that these effects are for all schools, whether they adopted a program or not. Non-experimental analyses found that by Year 4, elementary schools that had chosen and implemented a reading program (33% of schools by Year 3, 42% by Year 4) scored better than matched controls in reading. Schools that chose any reading program usually chose our Success for All reading program, but some chose other models. Even in schools that did not adopt reading or math programs, scores were always higher, on average, (though not always significantly higher) than for schools that did not choose programs.

How Much Did We Raise the Bar?

The CDDRE project was exceptional because of its size and scope. The 622 schools, in 59 districts in 7 states, were collectively equivalent to a medium-sized state. So if anyone asks what evidence-based reform could do to help an entire state, this study provides one estimate. The student-level outcome in elementary reading, an effect size of +0.19, applied to NAEP scores, would be enough to move 43 states to the scores now only attained by the top 10. If applied successfully to schools serving mostly African American and Hispanic students or to students receiving free- or reduced-price lunches regardless of ethnicity, it would reduce the achievement gap between these and White or middle-class students by about 38%. All in four years, at very modest cost.

Actually, implementing something like Raising the Bar could be done much more easily and effectively today than it could in 2005-2009. First, there are a lot more proven programs to choose from than there were then. Second, the U.S. Congress, in the Every Student Succeeds Act (ESSA), now has definitions of strong, moderate, and promising levels of evidence, and restricts school improvement grants to schools that choose such programs. The reason only 42% of Raising the Bar schools selected a program is that they had to pay for it, and many could not afford to do so. Today, there are resources to help with this.

The evidence is both logical and clear: Evidence works.


Slavin, R. E., Cheung, A., Holmes, G., Madden, N. A., & Chamberlain, A. (2013). Effects of a data-driven district reform model on state assessment outcomes. American Educational Research Journal, 50 (2), 371-396.

Photo by Sebastian Mary/Gio JL [CC BY-SA 2.0  (], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

First There Must be Love. Then There Must be Technique.

I recently went to Barcelona. This was my third time in this wonderful city, and for the third time I visited La Sagrada Familia, Antoni Gaudi’s breathtaking church. It was begun in the 1880s, and Gaudi worked on it from the time he was 31 until he died in 1926 at 74. It is due to be completed in 2026.

Every time I go, La Sagrada Familia has grown even more astonishing. In the nave, massive columns branching into tree shapes hold up the spectacular roof. The architecture is extremely creative, and wonders lie around every corner.


I visited a new museum under the church. At the entrance, it had a Gaudi quote:

First there must be love.

Then there must be technique.

This quote sums up La Sagrada Familia. Gaudi used complex mathematics to plan his constructions. He was a master of technique. But he knew that it all meant nothing without love.

In writing about educational research, I try to remind my readers of this from time to time. There is much technique to master in creating educational programs, evaluating them, and fairly summarizing their effects. There is even more technique in implementing proven programs in schools and classrooms, and in creating policies to support use of proven programs. But what Gaudi reminds us of is just as essential in our field as it was in his. We must care about technique because we care about children. Caring about technique just for its own sake is of little value. Too many children in our schools are failing to learn adequately. We cannot say, “That’s not my problem, I’m a statistician,” or “that’s not my problem, I’m a policymaker,” or “that’s not my problem, I’m an economist.” If we love children and we know that our research can help them, then it’s all of our problems. All of us go into education to solve real problems in real classrooms. That’s the structure we are all building together over many years. Building this structure takes technique, and the skilled efforts of many researchers, developers, statisticians, superintendents, principals, and teachers.

Each of us brings his or her own skills and efforts to this task. None of us will live to see our structure completed, because education keeps growing in techniques and capability. But as Gaudi reminds us, it’s useful to stop from time to time and remember why we do what we do, and for whom.

Photo credit: By Txllxt TxllxT [CC BY-SA 4.0  (], from Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Fads and Evidence in Education

York, England, has a famous racecourse. When I lived there I never saw a horse race, but I did see women in town for the race all dressed up and wearing very strange contraptions in their hair, called fascinators. The picture below shows a couple of examples. They could be twisted pieces of metal or wire or feathers or just about anything as long as they were . . . well, fascinating. The women paraded down Mickelgate, York’s main street, showing off their fancy clothes and especially, I’d guess, their fascinators.


The reason I bring up fascinators is to contrast the world of fashion and the world of science. In fashion, change happens constantly, but it is usually change for the sake of change. Fascinators, I’d assume, derived from hats, which women have been wearing to fancy horse races as long as there have been fancy horse races. Hats themselves change all the time. I’m guessing that what’s fascinating about a fascinator is that it maintains the concept of a racing-day hat in the most minimalist way possible, almost mocking the hat tradition while at the same time honoring it. The point is, fascinators get thinner because hats used to be giant, floral contraptions. In art, there was realism and then there were all sorts of non-realism. In music there was Frank Sinatra and then Elvis and then Beatles and then disco. Eventually there was hip hop. Change happens, but it’s all about taste. People get tired of what once was popular, so something new comes along.

Science-based fields have a totally different pattern of change. In medicine, engineering, agriculture, and other fields, evidence guides changes. These fields are not 100% fad-free, but ultimately, on big issues, evidence wins out. In these fields, there is plenty of high-quality evidence, and there are very serious consequences for making or not making evidence-based policies and practices. If someone develops an artificial heart valve that is 2% more effective than the existing valves, with no more side effects, surgeons will move toward that valve to save lives (and avoid lawsuits).

In education, which model do we follow? Very, very slowly we are beginning to consider evidence. But most often, our model of change is more like the fascinators. New trends in education take the schools by storm, and often a few years later, the opposite policy or practice will become popular. Over long periods, very similar policies and practices keep appearing, disappearing, and reappearing, perhaps under a different name.

It’s not that we don’t have evidence. We do, and more keeps coming every year. Yet our profession, by and large, prefers to rush from one enthusiasm to another, without the slightest interest in evidence.

Here’s an exercise you might enjoy. List the top ten things schools and districts are emphasizing right now. Put your list into a “time capsule” envelope and file it somewhere. Then take it out in five years, and then ten years. Will those same things be the emphasis in schools in districts then? To really punish yourself, write the NAEP reading and math scores overall and by ethnic groups at fourth and eighth grade. Will those scores be a lot better in five or ten years? Will gaps be diminishing? Not if current trends continue and if we continue to give only lip service to evidence.

Change + no evidence = fashion

Change + evidence = systematic improvement

We can make a different choice. But it will take real leadership. Until that leadership appears, we’ll be doing what we’ve always done, and the results will not change.

Isn’t that fascinating?

Photo credit: Both photos by Chris Phutully [CC BY 2.0 (], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.