Scaling Up: Penicillin and Education

In 1928, the Scottish scientist Alexander Fleming invented penicillin. As the story goes, he invented penicillin by accident, when he left a petri dish containing bacteria on his desk overnight and the next morning found that it was infected with rod-shaped organisms that had killed the bacteria. Fleming isolated the rods and recognized that if they could kill bacteria, they might be useful in curing many diseases.

Early on it was clear that penicillin had extraordinary possibilities. In World War I, more soldiers and civilians had been killed by bacterial diseases than were killed by bullets. What if these diseases could be cured? Early tests showed very promising effects.

Yet there was a big problem. No one knew how to produce penicillin in quantity. Very small experiments established that penicillin had potential for curing bacterial infections and was not toxic. However, the total world supply at the onset of World War II was about enough for a single adult. The impending need for penicillin was obvious, but it still was not ready for prime time.

American and British scientists finally began to work together to find a way to scale up production of penicillin. Finally, the Merck Company developed a mass production method, and was making billions of units by D-Day.

The key dynamic of the penicillin story has much in common with an essential problem of education reform. The Merck work did not change the structure of penicillin itself, but Merck scientists did a lot of science and experimentation to find strains that were stable and replicable. In education reform, it is equally the case that the development and initial evaluation of a given program may be a very different process from that intended to carry out large-scale evaluations and scaling up of proven programs.

In some cases, different organizations may be necessary to do large scale evaluation and implementation, as was the case with Merck and Fleming, and in other cases the same organization may carry though the development, initial evaluation, large-scale evaluation, and dissemination. Whoever is responsible for the various steps, their requirements are similar.

At small scale, innovators are likely to work in schools nearby, where they can frequently visit schools, see what is going on, hear teachers’ perspectives, and change strategies in course in response to what is going on. At small scale, programs might vary a great deal from class to class or school to school. Homemade measures, opinions, observations, and other informal indicators may be all developers need or want. From a penicillin perspective, this is still the Fleming level.

When a program moves to the next level, it may be working in many schools or distant locations, and the approach must change substantially. This is the Merck stage of development in penicillin terms. Developers must have a very clear idea of what the program is, and then provide student materials, software, professional development, and coaching directed toward helping teachers to enact the program effectively. Rather than being able to adapt a great deal to the desires or ideas of every school or teacher, principals and teachers can be asked to vote on participation, with an understanding that if they decide to participate, they commit to follow the program more or less as designed, with reasonable variations in light of unique characteristics of the school (e.g., urban/rural, presence of English learners, or substantial poverty). Professional development and coaching need to be standardized, with room for appropriate adaptations. Organizations that provide large-scale services need to learn how to manage functions such as finance, human resources, and IT.

As programs grow, they should seek funding for large-scale, randomized evaluations, ideally by third party evaluators.

In order to get to the Merck level in education reform, we must be ready to build robust, flexible, self-sustaining organizations, capable of ensuring positive impacts of educational programs on a broad scale. Funding from government and private foundations are needed along the way, but the organizations ultimately must be able to operate mostly or entirely on revenues from schools, especially Title I or other funds likely to be available in many or most schools.

Over the years, penicillin has saved millions of lives, due to the pioneering work of Fleming and the pragmatic work of Merck. In the same way, we can greatly enhance the learning of millions of children, combining innovative design and planful, practical scale-up.


Teachers as Professionals in Evidence-Based Reform


In a February 2012 op-ed in Education Week, Don Peurach wrote about a 14-year investigation he carried out as part of a large University of Michigan study of comprehensive school reform. In the overall study, our Success for All program and the America’s Choice program did very well in terms of both implementation and outcomes, while an approach in which teachers largely made up their own instructional approaches did not bring about much change in teachers’ behaviors or student learning. Because both Success for All and America’s Choice have well-specified training, teacher’s manuals, and student materials, the findings support the idea that it is important for school-wide reform models to have a well-structured approach.

Peurach’s focus was on Success for All as an organization. He wanted to know how our network of hundreds of schools in 40 states contributes to the development of the approach and to each other’s success. His key finding was that Success for All is not a top-down approach, but is constantly learning from its teachers and principals and then spreading good practices throughout the network.

In our way of thinking, this is the very essence of professionalism. A teacher who does wonderful, innovative things in one class is perhaps benefiting 25 children each year, but one whose ideas scale up to inform the practices of hundreds of thousands of schools is making a real difference. Yet in order for teachers’ ideas and impact to be broadly impactful, it helps a great deal for the teachers to be part of a national or regional network that speaks a common language and has common standards of practice.

Teachers need not be researchers to contribute to their profession. By participating in networks of like-minded educators – implementing, continuously improving and communicating about practical approaches intended to improve outcomes of proven approaches – they play an essential role in the improvement of their profession.

Improvement by Design


I just read a very interesting book, Improvement by Design: The Promise of Better Schools, by David Cohen, Donald Peurach, Joshua Glazer, Karen Gates, and Simona Goldin. From 1996 to 2008, researchers originally at the University of Michigan studied three of the largest comprehensive school reform models of the time: America’s Choice (AC), Accelerated Schools Plus (ASP), and our own Success for All (SFA). A portion of the study, led by Brian Rowan, compared 115 elementary schools using one of these models to a matched control group and to each other. The quantitative study found that Success for All had strong impacts on reading achievement by third grade, America’s Choice had strong impacts on writing, and there were few impacts of Accelerated Schools Plus.

Improvement by Design tells a different story, based on qualitative studies of the three organizations over a very long time period. Despite sharp differences between the models, all of the organizations had to face a common set of challenges: creating viable models and organizations to support them, dealing with rapid scale-up through the 1990s (especially during the time period from 1997 to 2002 when Obey-Porter Comprehensive School Reform funding was made available to schools), and then managing catastrophe when the George W. Bush Administration ended comprehensive school reform.

The book is straightforward history, comparing and contrasting these substantial reform efforts, and does not directly draw policy conclusions. However, there is much in it that does have direct policy consequences. These are my conclusions, not the authors’, but I think they are consistent with the history.

1. Large-scale change that dramatically changes daily teaching is difficult but not impossible in high-poverty schools. All three models have worked in hundreds of schools, as have several other whole-school reform models.

2. Providing general principles and then leaving schools to create the details for themselves is not a successful strategy. This is what Accelerated Schools Plus tried to do, and the Michigan study not only found that ASP failed to change student outcomes, but also that it failed to have much observable impact on teaching, in contrast to AC and SFA.

3. What (2) implies is that if whole-school “improvement by design” is to succeed in the thousands of Title I schools that need it, large, well-managed, and well-capitalized organizations are necessary to provide high-quality and very specific training, coaching, and materials to implement proven models.

4. Federal policies (at least) need to be consistently hospitable to an environment in which schools and districts are choosing among many proven whole-school models. For example, federal requests for proposals might have a few competitive preference points for schools proposing to use whole-school reform models with strong evidence of effectiveness. This would signal an invitation to adopt such models without forcing schools to do so and risking extensive pushback. Further, federal policies promoting use of proven whole-school models should remain in effect for an extended period. Turmoil introduced by changing federal support for whole-school reform was very damaging to earlier efforts.

Improvement by Design provides a tantalizing glimpse of what could be possible in a system that encourages a diversity of proven, whole-school options to high-poverty schools. This approach to reform has many obstacles to overcome, of course. But for what approach radical enough and scalable enough to potentially reform American education would this not be true?

Success in Evidence-Based Reform: The Importance of Failure

As always, Winston Churchill said it best: “Success consists of going from failure to failure without loss of enthusiasm.” There is a similar Japanese saying: “Success is being knocked down seven times and getting up eight.”

These quotes came to my mind while I was reading a recently released report from the Aspen Institute, “Leveraging Learning: The Evolving Role of Federal Policy in Education Research.” The report is a useful scan of the education research horizon, intended as background for the upcoming reauthorization of the Education Sciences Reform Act (ESRA), the legislation that authorizes the Institute of Education Sciences (IES). However, the report also contains brief chapters by various policy observers (including myself), focusing on how research might better inform and improve practice and outcomes in education. A common point of departure in some of these was that while randomized experiments (RCTs) emphasized for the past decade by IES and, more recently, Investing in Innovation (i3), are all well and good, the IES experience is that most randomized experiments evaluating educational programs find few achievement effects. Several cited testimony by Jon Baron that “of the 90 interventions evaluated in randomized trials by IES, 90% were found to have weak or no positive effects.” As a response, the chapter authors proposed various ways in which IES could add to its portfolio more research that is not RCTs.

Within the next year or two, the problem Baron was reporting will take on a great deal of importance. The results of the first cohort of Investing in Innovation grants will start being released. At the same time, additional IES reports will appear, and the Education Endowment Foundation (EEF) in the U.K., much like i3, will also begin to report outcomes. All four of the first cohort of scale-up programs funded by i3 (our Success for All programReading RecoveryTeach for America, and KIPP) have had positive first-year findings in i3 or similar evaluations recently, but this is not surprising, as they had to pass a high evidence bar to get scale-up funding in the first place. The much larger number of validation and development projects were not required to have such strong research bases, and many of these are sure to show no effects on achievement. Kevan Collins, Director of the EEF, has always openly said that he’d be delighted if 10% of the studies EEF has funded find positive impacts. Perhaps in the country of Churchill, Collins is better placed to warn his countrymen that success in evidence-based reform is going to require some blood, sweat, toil, and tears.

In the U.S., I’m not sure if policymakers or educators are ready for what is about to happen. If most i3 validation and development projects fail to produce significant positive effects in rigorous, well-conducted evaluations, will opinion leaders celebrate the programs that do show good outcomes and value the knowledge gained from the whole process, including knowledge about what almost worked and what to avoid doing next time? Will they support additional funding for projects that take these learnings into account? Or will they declare the i3 program a failure and move on to the next set of untried policies and practices?

I very much hope that i3 or successor programs will stay the course, insisting on randomized experiments and building on what has been learned. Even if only 10% of validation and development projects report clear, positive achievement outcomes and capacity to go to scale, there will be many reasons to celebrate and stay on track:

1. There are currently 112 i3 validation and development projects (plus 5 scale-ups). If just 10% of these were found to be effective and scalable, that would be 11 new programs. Adding this to the scale-up programs and other programs already positively reviewed in the What Works Clearinghouse, this would be a substantial base of proven programs. In medicine, the great majority of treatments initially evaluated are found not to be effective, yet the medical system of innovation works because the few proven approaches make such a big difference. Failure is fine if it leads to success.

2. Among the programs that do not produce statistically significant positive outcomes on achievement measures, there are sure to be many that show promise but do not quite reach significance. For example, any program whose evaluation shows a student-level positive effect size of, say, +0.15 or more should be worthy of additional investment to refine and improve its procedures and its evaluation to reach a higher standard, rather than being considered a bust.

3. The i3 process is producing a great deal of information about what works and what does not, what gets implemented and what does not, and the match between schools’ needs and programs’ approaches. These learnings should contribute to improvements in new programs, to revisions of existing programs, and to the policies applied by i3, IES, and other funders.

4. As the findings of the i3 and IES evaluations become known, program developers, grant reviewers, and government leaders should get smarter about what kinds of approaches are likely to work and to go to scale. Because of this, one might imagine that even if only 10% of validation and development programs succeed in RCTs today, higher and higher proportions will succeed in such studies in the future.

Evidence-based reform, in which promising scalable approaches are ultimately evaluated in RCTs or similarly rigorous evaluations, is the best way to create substantial and lasting improvements in student achievement. Failures of individual evaluations or projects are an expected, even valued part of the process of research-based reform. We need to be prepared for them, and to celebrate the successes and the learnings along the way.

As Churchill also said, “Success is not final, failure is not fatal; It is the courage to continue that counts.”



Can Educational Innovations Go To National Scale?


In conversations about evidence-based reform, I often hear the objection that “we don’t really know how to take proven innovations to scale” or that “in order for schools or districts to adopt innovations, they must have a central role in creating and disseminating them locally.”

These assumptions turn out to be false. There are in fact many instances in which programs not developed by the educators using them have been widely and enthusiastically adopted by schools all over the U.S.

National Diffusion Network (NDN)
First, there was the National Diffusion Network (NDN). In 1979-1996, NDN invited program developers of all kinds to be reviewed by a Joint Dissemination Review Panel, which certified the program’s effects, likelihood of going to scale, and practical utility.

The program made “developer-dissemination” grants (at about $25,000 per year) to developers of promising programs. State facilitators were established in each state to promote the use of the appropriate programs. By the end of the NDN funding, thousands of schools were using one of more than 500 programs.

Comprehensive School Reform (CSR)
Beginning in 1991, a coalition of large corporations established New American Schools (NAS) to help fund innovators to create comprehensive whole-school reform models. Out of 700 applications, 11 were initially selected, and 7 of these were maintained after initial testing. These models began to be used in hundreds of school collectively. NAS helped identify target districts in which they held “effective methods fairs.” Hundreds of principals, teachers, and school board members came to learn about the models. They could ask representatives of one or more models to present at their schools. They then had a chance to contract with the models they chose. Starting in 1998, the Obey-Porter Act in Congress established incentive funding of at least $50,000 per year for three years for schools to implement comprehensive school reforms of their choice. This caused an outpouring of interest both in the NAS models and in others that were assembled to resemble NAS models. Within a few years, there were more than 2500 Title I schools receiving CSR funding and another 3500 schools adopting these models without CSR funding, mostly using existing Title I funds.

Evaluations of the CSR models began in the 1990s and continued into the early 2000s. They found consistent positive effects for some of the programs, especially the Comer School Development Program, America’s Choice, Modern Red Schoolhouse, and our Success for All program. Obey-Porter funding ended in 2003, but many of the school programs continued without Obey-Porter for many years, up to the present.

Investing in Innovation (i3)
The election of Barack Obama in 2008 brought in an administration eager to expand the use of research-proven programs in education and other fields. In a program called Investing in Innovation (i3), $650 million was set aside to fund educational programs in one of three categories: scale-up, validation, or development. To qualify for scale-up grants, programs had to have strong, positive, replicated outcomes in rigorous evaluations. Validation required a single positive study, and development grants only required a strong theory of action. Scale-up grantees received $50 million over five years to evaluate and scale-up their reforms, while validation projects received $30 million and development projects $5 million. A total of 47 programs, including 4 scale-up projects (Success for All, Reading Recovery, KIPP, and Teach for America) received funding in the first round. In years after the first, annual i3 funding was dropped to $150 million, and grants in each category were cut in half. After four rounds of funding, 77 development, 35 validation, and 5 scale-up projects have been funded. It is too early to say how these grants will work, but scale-up and validation projects are working in hundreds of additional schools under i3 funding and are developing capacity to do more. All of the programs will be rigorously evaluated by third-party evaluators.

NDN, CSR, and i3 have established beyond any doubt that:

1. With encouragement and modest funding, thousands of schools will eagerly adopt research-based programs.
2. Organizations willing and able to support school adoptions nationally will come forward and operate effectively if government helps schools with initial funding barriers.
3. Many whole-school reform models have developed strong evidence of effectiveness, but a strong evidence base without government encouragement and incentives does not lead to robust adoptions.
4. The idea that whole-school reforms must be created by the schools that use them has clearly been disproved. Schools are willing and able to adopt proven programs developed elsewhere if they can afford them.

As reforms in federal education programs such as Title ISchool Improvement Grants, and Race to the Top go forward, it makes sense to continue to develop, evaluate, and disseminate whole-school reform models. This approach can expand rapidly while maintaining quality at scale and can improve outcomes for millions of disadvantaged children.

Lessons from Innovators: Reading Recovery


The process of moving an educational innovation from a good idea to widespread effective implementation is far from straightforward, and no one has a magic formula for doing it. The William T. Grant and Spencer Foundations, with help from the Forum for Youth Investment, have created a community composed of grantees in the federal Investing in Innovation (i3) program to share ideas and best practices. Our Success for All program participates in this community. In this space, I, in partnership with the Forum for Youth Investment, highlight observations from the experiences of i3 grantees other than our own, in an attempt to share the thinking and experience of colleagues out on the front lines of evidence-based reform.

This blog is based on an interview between the Forum for Youth Investment and Jerry D’Agostino, Professor of Education at the Ohio State University and Director of Reading Recovery’s i3 project. A persistent challenge for programs that have scaled up is how to sustain for the long term. In this interview, D’Agostino shares how this long-standing literacy intervention has dealt with the challenge and how it has reinvented itself over the years in order to stay current.

Stay Fresh
Reading Recovery is a research-based, short-term intervention that involves one-to-one teaching for the lowest-achieving first graders. It began in New Zealand in the 1970’s but has been in operation in the United States for 30 years and has spread across the country. Over the years, Reading Recovery has expanded and contracted depending on funding, interest from school districts, and our capacity. Today there are training centers at 19 universities that equip teachers to deliver the intervention and the program has a presence in some 8,000 schools across 49 states. With that kind of scale and longevity, it can be easy to become complacent and assume the intervention speaks for itself. D’Agostino says just the opposite is true. “We know that being the old brand that has been around for a long time can be hard,” he notes. “You have to think about how to keep the brand fresh. Superintendents want the newest hot thing. Teachers have to know it will work with their kids in their classrooms. We have spent time focused on how to adjust the model to offer new features and respond to current education trends such as the Common Core. You always have to show teachers and administrators how the intervention addresses the issue of the day. For example, it isn’t enough that the intervention produces strong effect sizes. For teachers, that is a meaningless number. They want to know that the program will help their third graders achieve the literacy level now required in nearly 40 states to be promoted to 4th grade.”

Be Flexible but Maintain Your Core
Reading Recovery has taken seriously the idea of identifying the intervention’s core elements and also responding to the educational system’s current needs. They know that one-to-one instruction and 30-minute daily lessons are non-negotiable, but they also recognize that adaptations are needed. For example, innovations in the lesson framework have resulted in a design for classroom instruction (Literacy Collaborative), small groups (Comprehensive Intervention Model), and training for special education and ESL teachers (Literacy Lessons). “Our innovations have come as direct requests from schools,” says D’Agostino. “For example, a school says they need something for English Language Learners and we develop something new for that one school that then becomes a part of our overall product line. It allows growth for Reading Recovery and flexibility for schools.” Another non-negotiable is keeping training centralized. Although teacher leaders can receive training at one of the 19 partner universities, there are only a few places where trainers of teacher leaders can get certified. That allows Reading Recovery to maintain some quality control and fidelity over teacher leader training. “I’ve always been impressed with the fidelity of Reading Recovery instruction,” said D’Aogstino. “I’ve seen Reading Recovery lessons in Zanesville, Ohio and Dublin, Ireland. The framework is the same, but each lesson is different in terms of how the teacher interacts with the student to scaffold literacy learning.

Combine Historical Expertise with Fresh Perspective
D’Agostino is quick to note that one of Reading Recovery’s strengths and challenges is the longevity of its founders and senior leadership. Many of the original developers of the intervention are still in leadership positions. This allows for a historical perspective and continuity of purpose that are rare in education these days. It can also hinder innovation. That is why the organization also tries to find leadership positions for newer faculty and teachers with recent teaching and administrative experience who can bring fresh ideas and a willingness to push for some of the new adjustments to the model that schools are requesting.

Adapt, Adjust, and Meet Schools Where They Are
D’Agostino emphasizes that Reading Recovery’s current success and long history is no reason to sit back and relax. “We have survived a lot of changes over the years. We’ve grown, we’ve shrunk, we’ve survived major threats to our program from other national initiatives. Right now with our i3 grant, we are in a great position. We are going to reach our goal of training 3,700 teachers and producing good effects. But I don’t know that that will position us well for the future. In fact, I won’t be happy if we just reach our goals.” Sustaining an effective intervention and bringing it to more schools and students around the country means innovating, moving, pushing to the next level…and spreading the word. “Schools don’t necessarily hear about government funded initiatives that achieve high evidence standards according to the What Works Clearinghouse,” muses D’Agostino. “They hear from hundreds of vendors each year citing their effectiveness, so how do we distinguish ourselves? We can’t just assume success in our i3 grant will lead to sustainability. Sustainability is all about results. For example, we know that the outcomes are remarkable – most of the lowest-achieving first graders accelerate with Reading Recovery and reach the average of their cohort – but we also know from our annual evaluation that there’s a great deal of variation across schools and teachers. So right now we want to know, what do effective Reading Recovery teachers do and how is that different from less effective Reading Recovery teachers? Knowing more about that black box of teaching will help the intervention overall. And understanding how to foster local ownership will give the intervention its real staying power.”

Many Programs Meet New Evidence Standards


One of the most common objections to evidence-based reform is that there are too few programs with strong evidence of effectiveness to start encouraging schools to use proven programs. The concern is that it looks bad if a policy of “use what works” leads educators to look for proven programs, only to find that there are very few such programs in a given area, or that there are none at all.

The lack of proven programs is indeed a problem in some areas, such as science and writing, but it is not a problem in others, such as reading and math. There is no reason to hold back on encouraging evidence where it exists.

The U.S. Department of Education has proposed changes to its EDGAR regulations to define “strong” and “moderate” levels of evidence supporting educational programs. These standards use information from the What Works Clearinghouse (WWC), and are very similar to those used in the federal Investing in Innovation (i3) program to designate programs eligible for “scale-up” or “validation” grants, respectively.

As an exercise, my colleagues and I checked to see how many elementary reading programs currently exist that qualify as “strong” or “moderate” according to the new EDGAR standards. This necessitated excluding WWC-approved programs that are not actively disseminated and those that would not meet current WWC standards (2.1 or 3.0), and adding programs not yet reviewed by WWC but that appear likely to meet its standards.

Here’s a breakdown of what we found.

Beginning Reading (K-1)
Total programs                              26
School/classroom programs      16
Small-group tutoring                   4
1-1 tutoring                                   6

Upper Elementary Reading (2-6)
Total programs                               17
School/classroom programs       12
Small-group tutoring                    4
1-1 tutoring                                     1

The total number of unique programs is 35 (many of the programs covered both beginning and upper-elementary reading). Of these, only four met the EDGAR “strong” criterion, but the “moderate” category, which requires a single rigorous study with positive impacts, had 31 programs.

We’ll soon be looking at secondary reading and elementary and secondary math, but the pattern is clear. While few programs will meet the highest EDGAR standard, many will meet the “moderate” standard.

Here’s why this matters. The EDGAR definitions can be referenced in any competitive request for proposals to encourage and/or incentivize the use of proven programs, perhaps offering two competitive preference points for proposals to implement programs meeting the “moderate” standard and three points for proposals to adopt programs meeting the “strong” standard.

Since there are many programs to choose from, educators will not feel constrained by this process. In fact, many may be happy to learn about the many offerings available, and to obtain objective information on their effectiveness. If none of the programs fit their needs, they can choose something unevaluated and forgo the extra points, but even then, they will have considered evidence as a basis for their decisions. And that would be a huge step forward.