Can Findings From One Set Of Schools Apply To Others?

Every person is unique. Yet that does not mean that research showing the effectiveness of medical treatments, for example, does not apply to people beyond the ones who were in a particular study. If nothing generalized from one circumstance to others, then science would be meaningless. I think every educated person understands this.

Yet for some reason, research in education is often criticized for trying to generalize from one set of schools to others. Whenever I speak about evidence-based reform in education, most recently in a talk at my alma mater, Reed College, someone raises this concern. In a recent article by Lisbeth Schorr and Srik Gopal, the authors wonder how anything can generalize “from Peru to Poughkeepsie.”

First, let me state the obvious. Every school is different, and findings from studies done elsewhere cannot be assumed to apply to a specific school or set of schools. However, it would be foolish to ignore the evidence from high-quality research, especially to the degree that a given school considering using a program or practice found effective elsewhere resembles the schools in the study that established that evidence. So Peru to Poughkeepsie might be a stretch, unless it is Peru, Illinois. And should Poughkeepsie ignore evidence from nearby Tarrytown and Nyack? Taking a position that generalization is never appropriate would be just as unjustified as taking a position that it is always justified.

There is an old saying to the effect that the race is not always to the swift nor the fight to the strong, but it’s best to bet that way. When responsible educators choose programs for their schools and districts, they are making a bet on behalf of their children. Why would they not take the evidence into account in making these important choices?

Determining when generalization is more or less likely is not too difficult. First, you’d want to consider the strength of the evidence. For example, a program proven effective in multiple studies done by multiple researchers with many diverse schools, with random assignment to experimental or control groups and measures not made by the experimenters or developers, should give potential adopters a lot of confidence. To the degree that those studies involved schools similar to yours, serving similar communities, that adds a lot. The consistency of the outcomes across different studies would be important to consider.

Schorr and Gopal are not opposed to randomized studies, but they warn about placing too much reliance on them. Yet the advantage of randomized studies is that they rule out bias. How can that be a bad thing? What we need is a lot more randomized studies, and other studies with rigorous designs, done in a lot of places, so that we can build up a large and diverse evidence base for programs that can be replicated. It so happens that the road to generalizability is precisely the one that Schorr and Gopal want us to diminish, because if fewer randomized studies are done, we will lack the quality, size, freedom from bias, and diversity of research needed to determine if a program is truly and broadly effective.

Discussions about when generalization is most likely to take place are healthy and welcome. But they are not academic. America’s schools are not getting better fast enough, and achievement gaps by race and class remain unacceptable. Identifying proven programs and practices and replicating them broadly is the best way I know of to make genuine, lasting progress. The evidence base is only now getting large and good enough to justify policies of evidence-based reform, as the recent ESSA legislation tentatively begins to do. We need to continue to expand that evidence base and to use what we do know while working to learn more. Pretending that no school can learn from what was done in any other does not move us forward, and forward is the direction we need to be moving as fast as we possibly can.

Joy is a Basic Skill in Secondary Reading

I have a policy of not talking about studies I’m engaged in before they are done and available, but I have an observation to make that just won’t wait.

I’m working on a review of research on secondary reading programs with colleagues Ariane Baye (University of Liege in Belgium) and Cynthia Lake (Johns Hopkins University). We have found a large number of very high-quality studies evaluating a broad range of programs. Most are large, randomized experiments.

Mostly, our review is really depressing. The great majority of studies have found no effects on learning. In particular, programs that focus on teaching middle and high school students struggling in reading in classes of 12 to 20, emphasizing meta-cognitive strategies, phonics, fluency, and/or training for teachers in what they were already doing, show few impacts on learning. Most of the studies provided daily, extra reading classes to help struggling readers build their skills, while the control group got band or art. They should have stayed in band or art.

Yet all is not dismal. Two approaches did have markedly positive effects. One was tutoring students in groups of one to four, not every day but perhaps twice a week. The other was cooperative learning, where students worked in four-member teams to help each other learn and practice reading skills. How could these approaches be so much more effective than the others?

My answer begins with a consideration of the nature of struggling adolescent readers. They are bored out of their brains. They are likely to see school as demeaning, isolating, and unrewarding. All adolescents live for their friends. They crave mastery and respect. Remedial approaches have to be fantastic to overcome the negative aspects of having to be remediated in the first place.

Tutoring can make a big difference, because groups are small enough for students to make meaningful relationships with adults and with other kids, and instruction can be personalized to meet their unique needs, to give them a real shot at mastery.

Cooperative learning, however, had a larger average effect size than tutoring. Even though cooperative learning did not require smaller class sizes and extra daily instructional periods, it was much more effective than remedial instruction. Cooperative learning gives struggling adolescent readers opportunities to work with their peers, to teach each other, to tease each other, to laugh, to be active rather than passive. To them, it means joy. And joy is a basic skill.

Of course, joy is not enough. Kids must be learning joyfully, not just joyful. Yet in our national education system, so focused on testing and accountability, we have to keep remembering who we are teaching and what they need. More of the same, a little slower and a little louder, won’t do it. Adolescents need a reason to believe that things can be better, and that school need not cut them off from their peers. They need opportunities to teach and learn from each other. School must be joyful, or it is nothing at all, for so many adolescents.

Does Research Based Reform Require Standardized Tests?

Whenever I speak about evidence-based reform someone always asks this question: “Won’t all of these randomized evaluations just reinforce teaching to the (very bad word) standardized reading and math tests?” My wife Nancy and I were recently out at our alma mater, Reed College. I gave a speech on evidence-based reform, and of course I got this question.

Just to save time in future talks I’m going to answer this question here once and for all. And the answer is:

No! In fact, evidence-based reform could be America’s escape route from excessive use of accountability and sanctions to guide education policy.

Here’s how this could happen. I’d be the first to admit that today, most studies use standardized tests as their outcome measures. However, this need not be the case, and there are in fact exceptions.

Experiments compare learning gains made by students in a given treatment group to those in a control group. Any assessment can be used as the posttest as long as it meets two key standards:

a) It measures content taught equally in both groups, and
b) It was not made up by the developer or researchers.

What this means is that authentic performance measures, measures aligned with today’s standards, measures that require creativity and non-routine problem solving, or measures of subjects (such as social studies) that are taught in school but not tested for accountability, can all be valid in experiments, as long as experimental and control students had exposure to the content, skills, or performances being assessed.

As a practical example, imagine a year-long evaluation of an innovative science program for seventh graders. The new program emphasizes the use of open-ended group laboratory projects and explanations of processes observed in the lab. The control group uses a traditional science text, demonstrations, and more traditional lab work.

Experimenters might evaluate the innovative approach using a multifaceted measure (made by someone other than the researchers) covering all of seventh grade science, including open-ended as well as multiple choice tests. In addition, students might be asked to set up, carry out, and explain a lab exercise, involving content that was not presented in either program.

If this independent measure showed positive effects on all types of measures, great. If it showed advantages on open-ended and lab assessments but no differences on multiple choice tests, this could be important, too.

The point here is that measures used in program evaluations need not be limited to standardized measures, but can instead help break us free from standardized measures. One reason it’s hard to get away from multiple choice tests is that they are cheap and easy to administer and score, which means they are much easier to use when you have to test a whole state. It would be very difficult to have every student in a state set up a lab experiment for example. But program evaluations need not be bound by this practical restriction, because they only involve a few dozen schools, at most. Even if the state does not test writing, for example, program evaluations can. Even if the state does not test social studies, program evaluations can.

As evidence-based reform becomes more important, it may become more and more possible to justify the teaching of approaches proven to be effective on valid measures, even if the subject is not currently assessed for accountability and even if the tests are not accountability measures. States may never administer history tests, but they may encourage or incentivize use of proven approaches to teaching history. Perhaps proven approaches to music, art, or physical education could be identified and then supported and encouraged.

Managing schools by accountability alone leaves us trying to enforce minimums using limited assessments. This may be necessary to be sure that all schools are meeting minimums and making good progress on basic skills, but we need to go beyond standardized tests to consider all of the needs shared by all children. Evidence that goes beyond what states can test well may be just the ticket out of the standardized test trap.

What If You Crossed a Sears Catalogue With Consumer Reports?

When I was in high school, I had a summer job delivering Sears catalogues. What a great job. I got to drive all over the Maryland suburbs of Washington DC carrying tons of paper and wrecking the suspension on my mother’s 1960 Chevy station wagon. But more than that, I was instantly popular. A Sears catalogue wasn’t just advertising. It was a portal to peoples’ dreams.

In evidence-based reform in education, we have to somehow capture the sense of possibilities that the Sears catalogue brought for more mundane purposes. I’d love to imagine principals, superintendents, and teachers leafing through descriptions of programs of all sorts and from all providers that they could use with their kids, firing their imaginations with new ways to improve outcomes in their schools.

For this vision to truly make a difference, though, you’d have to marry the Sears Catalogue to another 1960s standard, Consumer Reports. Imagine that each product that was supposed to produce an outcome came with an independent rating of its effectiveness and reliability. Does a washing machine produce cleaner clothes without breaking down? Lawn mower? Power tools? Motorboat? Bicycle?

A mixture of Sears catalogue and Consumer Reports could have many important benefits. Right at the time readers are getting fired up about a new grill or energized about a new battery, their enthusiasm could be heightened or tempered by practical, independent information, so they could make better choices, comparing outcomes to those of alternative products. Further, knowing that readers would see this information, Sears would have to make sure that its products actually work, so the products themselves would get better over time, continuously enhancing consumers’ faith in the whole catalogue, in a virtuous cycle of innovation and satisfaction.

Switching back to education, we need something for educational programs just like the marriage of the Sears catalogue and Consumer Reports. Of course, today it would be a web site rather than a massive suspension-crushing book. That web site could help educators find out about all the wonderful solutions out there, ways to improve students’ math or science achievement, reduce reading failures, increase social emotional health, and so on. Just knowing all the possibilities would be great, but imagine that the web site would provide independent, scientifically valid summaries of the evidence on each program or practice in the “catalogue”?

The evidence standards in the Every Student Succeeds Act (ESSA) define strong, moderate and promising levels of evidence, and encourage their use. But educators should not use evidence-based programs and practices just because the government suggests they do so. They should use proven programs because they are responsible stewards of the futures of their children.

A catalogue of dreams, tied to the practical information and resources to make dreams come true, is what we need in education.

An Exploded View of Comprehensive School Reform

Recently, I had to order a part for an electric lawnmower. I enjoyed looking at the exploded view (similar to the one above) on the manufacturer’s web site. What struck me about it was that so many of the parts were generic screws, bolts, springs, wheels, and so on. With a bit of ingenuity, I’m sure someone (not me!) could track down generic electric motors, mower blades, and other more specialized parts, and build their very own do-it-yourself lawn mower.

There are just a few problems with this idea.

  1. It would cost a lot more than the original mower
  2. It would take a lot of time that could possibly be used for better purposes
  3. It wouldn’t work and you’d end up with an expensive pile of junk to discard.

Why am I yammering on about exploded views of lawn mowers? Because the idea of assembling lawn mowers from generic parts is a lot like what all too many struggling schools do in the name of whole school reform.

In education, the equivalent do-it-yourself idea using generic parts is the idea that if you choose one program for reading and another for behavior and a third for parent involvement and a fourth for tutoring and a fifth for English learners and a sixth for formative assessment and a seventh for coaching, the school is bound to do better. It might, but this piecemeal approach is really hard to do well.

The alternative to assembling all of those generic parts is to adopt a comprehensive school improvement model. These are models that have coordinated, well worked-out, well-supported approaches to increasing student success. Our own Success for All program is one of them, but there are others for elementary and secondary schools. After years of encouraging schools receiving School Improvement Grants (SIG) to assemble their own comprehensive reforms (remember the lawn mower?), the U.S. Department of Education finally offered SIG schools the option of choosing a proven whole-school approach. In addition to our Success for All program, the U.S. Department of Education approved three other comprehensive programs based on their evidence of effectiveness: Positive Action, the Institute for Student Achievement, and New York City’s small high schools of choice approach. These all met the Department’s standards because they had at least one randomized experiment showing positive outcomes on achievement measures, but some had a lot more evidence than that.

Comprehensive approaches resemble the fully assembled lawn mower rather than the DIY exploded view. The parts of the comprehensive models may be like those of the do-it-yourself SIG models, but the difference is that the comprehensive models have a well-thought-out plan for coordinating all of those elements. Also, even if a school used proven elements to build its own model, those elements would not have been proven in combination, and each might compete for the energies, enthusiasm, and resources of the beleaguered school staff.

This is the last year of SIG under the old rules, but it will continue in a different form under ESSA. The ESSA School Improvement provisions require the use of programs that meet strong, moderate, or promising evidence standards. Assembling individual proven elements is not a terrible idea, and is a real improvement on the old SIG because it at least requires evidence for some of the parts. But perhaps broader use of comprehensive programs with strong evidence of effectiveness for the whole school-wide approach, not just the parts, will help finally achieve the bold goals of school improvement for some of the most challenging schools in our country.