Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Programs and Practices

One issue I hear about all the time when I speak about evidence-based reform in education relates to the question of programs vs. practices. A program is a specific set of procedures, usually with materials, software, professional development, and other elements, designed to achieve one or more important outcomes, such as improving reading, math, or science achievement. Programs are typically created by non-profit organizations, though they may be disseminated by for-profits. Almost everything in the What Works Clearinghouse (WWC) and Evidence for ESSA is a program.

A practice, on the other hand, is a general principle that a teacher can use. It may not require any particular professional development or materials.  Examples of practices include suggestions to use more feedback, more praise, a faster pace of instruction, more higher-order questions, or more technology.

In general, educators, and especially teachers, love practices, but are not so crazy about programs. Programs have structure, requiring adherence to particular activities and use of particular materials. In contrast, every teacher can use practices as they wish. Educational leaders often say, “We don’t do programs.” What they mean is, “we give our teachers generic professional development and then turn them loose to interpret them.”

One problem with practices is that because they leave the details up to each teacher, teachers are likely to interpret them in a way that conforms to what they are already doing, and then no change happens. As an example of this, I once attended a speech by the late, great Madeline Hunter, extremely popular in the 1970s and ‘80s. She spoke and wrote clearly and excitingly in a very down-to-earth way. The auditorium she spoke to was stuffed to the rafters with teachers, who hung on her every word.

When her speech was over, I was swept out in a throng of happy teachers. They were all saying to each other, “Madeline Hunter supports absolutely everything I’ve ever believed about teaching!”

I love happy teachers, but I was puzzled by their reaction. If all the teachers were already doing the things Madeline Hunter recommended to the best of their ability, then how did her ideas improve their teaching? In actuality, a few studies of Hunters’ principles found no significant effects on student learning, and even more surprising, they found few differences between the teaching behaviors of teachers trained in Hunter’s methods and those who had not been. Essentially, one might argue, Madeline Hunter’s principles were popular precisely because they did not require teachers to change very much, and if teachers do not change their teaching, why would we expect their students’ learning to change?

blog_8-23-18_mowerparts_500x333

Another reason that practices rarely change learning is that they are usually small improvements that teachers are expected to assemble to improve their teaching. However, asking teachers to put together many pieces into major improvements is a bit like giving someone the pieces and parts of a lawnmower and asking them to put them together (see picture above). Some mechanically-minded people could do it, but why bother? Why not start with a whole lawnmower?

In the same way, there are gifted teachers who can assemble principles of effective practice into great instruction, but why make it so difficult? Great teachers who could assemble isolated principles into effective teaching strategies are also sure to be able to take a proven program and implement it very well. Why not start with something known to work and then improve it with effective implementation, rather than starting from scratch?

One problem with practices is that most are impossible to evaluate. By definition, everyone has their own interpretation of every practice. If practices become specific, with specific guides, supports, and materials, they become programs. So a practice is a practice exactly because it is too poorly specified to be a program. And practices that are difficult to clearly specify are also unlikely to improve student outcomes.

There are exceptions, where practices can be evaluated. For example, eliminating ability grouping or reducing class size or assigning (or not assigning) homework are practices that can be evaluated, and can be specified. But these are exceptions.

The squishiness of most practices is the reason that they rarely appear in the WWC or Evidence for ESSA. A proper evaluation contrasts one treatment (an experimental group) to a control group continuing current practices. The treatment group almost has to be a program, because otherwise it is impossible to tell what is being evaluated. For example, how can an experiment evaluate “feedback” if teachers make up their own definitions of “feedback”? How about higher-order questions? How about praise? Rapid pace? Use of these practices can be measured using observation, but differences between the treatment and control groups may be hard to detect because in each case teachers in the control group may also be using the same practices. What teacher does not provide feedback? What teacher does not praise children? What teacher does not use higher-order questions? Some may use these practices more than others, but the differences are likely to be subtle. And subtle differences rarely produce important outcomes.

The distinction between programs and practices has a lot to do with the practices (not programs) promoted by John Hattie. He wants to identify practices that can help teachers know about what works in instruction. That’s a noble goal, but it can rarely be done using real classroom research done over real periods of time. In order to isolate particular practices for study, researchers often do very brief, artificial lab studies that have nothing to do with classroom practices.  For example, some lab studies in Hattie’s own review of feedback contrast teachers giving feedback to teachers giving no feedback. What teacher would do that?

It is worthwhile to use what we know from research, experience, program evaluations, and theory to discuss what practices may be most useful for teachers. But claiming particular effect sizes for such studies is rarely justified. The strongest evidence for practical use in schools will almost always come from experiments evaluating programs. Practices have their place, but focusing on exposing teachers to a lot of practices and expecting them to put them together to improve student outcomes is not likely to work.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

New Findings on Tutoring: Four Shockers

blog_04 05 18_SURPRISE_500x353One-to-one and one-to-small group tutoring have long existed as remedial approaches for students who are performing far below expectations. Everyone knows that tutoring works, and nothing in this blog contradicts this. Although different approaches have their champions, the general consensus is that tutoring is very effective, and the problem with widespread use is primarily cost (and for tutoring by teachers, availability of sufficient teachers). If resources were unlimited, one-to-one tutoring would be the first thing most educators would recommend, and they would not be wrong. But resources are never unlimited, and the numbers of students performing far below grade level are overwhelming, so cost-effectiveness is a serious concern. Further, tutoring seems so obviously effective that we may not really understand what makes it work.

In recent reviews, my colleagues and I examined what is known about tutoring. Beyond the simple conclusion that “tutoring works,” we found some big surprises, four “shockers.” Prepare to be amazed! Further, I propose an explanation to account for these unexpected findings.

We have recently released three reviews that include thorough, up-to-date reviews of research on tutoring. One is a review of research on programs for struggling readers in elementary schools by Amanda Inns and colleagues (2018). Another is a review on programs for secondary readers by Ariane Baye and her colleagues (2017). Finally, there is a review on elementary math programs by Marta Pellegrini et al. (2018). All three use essentially identical methods, from the Best Evidence Encyclopedia (www.bestevidence.org). In addition to sections on tutoring strategies, all three also include other, non-tutoring methods directed at the same populations and outcomes.

What we found challenges much of what everyone thought they knew about tutoring.

Shocker #1: In all three reviews, tutoring by paraprofessionals (teaching assistants) was at least as effective as tutoring by teachers. This was found for reading and math, and for one-to-one and one-to-small group tutoring.  For struggling elementary readers, para tutors actually had higher effect sizes than teacher tutors. Effect sizes were +0.53 for paras and +0.36 for teachers in one-to-one tutoring. For one-to-small group, effect sizes were +0.27 for paras, +0.09 for teachers.

Shocker #2: Volunteer tutoring was far less effective than tutoring by either paras or teachers. Some programs using volunteer tutors provided them with structured materials and extensive training and supervision. These found positive impacts, but far less than those for paraprofessional tutors. Volunteers tutoring one-to-one had an effect size of +0.18, paras had an effect size of +0.53. Because of the need for recruiting, training, supervision, and management, and also because the more effective tutoring models provide stipends or other pay, volunteers were not much less expensive than paraprofessionals as tutors.

Shocker #3:  Inexpensive substitutes for tutoring have not worked. Everyone knows that one-to-one tutoring works, so there has long been a quest for approaches that simulate what makes tutoring work. Yet so far, no one, as far as I know, has found a way to turn lead into tutoring gold. Although tutoring in math was about as effective as tutoring in reading, a program that used online math tutors communicating over the Internet from India and Sri Lanka to tutor students in England, for example, had no effect. Technology has long been touted as a means of simulating tutoring, yet even when computer-assisted instruction programs have been effective, their effect sizes have been far below those of the least expensive tutoring models, one-to-small group tutoring by paraprofessionals. In fact, in the Inns et al. (2018) review, no digital reading program was found to be effective with struggling readers in elementary schools.

 Shocker #4: Certain whole-class and whole-school approaches work as well or better for struggling readers than tutoring, on average. In the Inns et al. (2018) review, the average effect size for one-to-one tutoring approaches was +0.31, and for one-to-small group approaches it was +0.14. Yet the mean for whole-class approaches, such as Ladders to Literacy (ES = +0.48), PALS (ES = +0.65), and Cooperative Integrated Reading and Composition (ES = +0.19) averaged +0.33, similar to one-to-one tutoring by teachers (ES = +0.36). The mean effect sizes for comprehensive tiered school approaches, such as Success for All (ES = +0.41) and Enhanced Core Reading Instruction (ES = +0.22) was +0.43, higher than any category of tutoring (note that these models include tutoring as part of an integrated response to implementation approach). Whole-class and whole-school approaches work with many more students than do tutoring models, so these impacts are obtained at a much lower cost per pupil.

Why does tutoring work?

Most researchers and others would say that well-structured tutoring models work primarily because they allow tutors to fully individualize instruction to the needs of students. Yet if this were the only explanation, then other individualized approaches, such as computer-assisted instruction, would have outcomes similar to those of tutoring. Why is this not the case? And why do paraprofessionals produce at least equal outcomes to those produced by teachers as tutors? None of this squares with the idea that the impact of tutoring is entirely due to the tutor’s ability to recognize and respond to students’ unique needs. If that were so, other forms of individualization would be a lot more effective, and teachers would presumably be a lot more effective at diagnosing and responding to students’ problems than would less highly trained paraprofessionals. Further, whole-class and whole-school reading approaches, which are not completely individualized, would have much lower effect sizes than tutoring.

My theory to account for the positive effects of tutoring in light of the four “shockers” is this:

  • Tutoring does not work due to individualization alone. It works due to individualization plus nurturing and attention.

This theory begins with the fundamental and obvious assumption that children, perhaps especially low achievers, are highly motivated by nurturing and attention, perhaps far more than by academic success. They are eager to please adults who relate to them personally.  The tutoring setting, whether one-to-one or one-to-very small group, gives students the undivided attention of a valued adult who can give them personal nurturing and attention to a degree that a teacher with 20-30 students cannot. Struggling readers may be particularly eager to please a valued adult, because they crave recognition for success in a skill that has previously eluded them.

Nurturing and attention may explain the otherwise puzzling equality of outcomes obtained by teachers and paraprofessionals as tutors. Both types of tutors, using structured materials, may be equally able to individualize instruction, and there is no reason to believe that paras will be any less nurturing or attentive. The assumption that teachers would be more effective as tutors depends on the belief that tutoring is complicated and requires the extensive education a teacher receives. This may be true for very unusual learners, but for most struggling students, a paraprofessional may be as capable as a teacher in providing individualization, nurturing, and attention. This is not to suggest that paraprofessionals are as capable as teachers in every way. Teachers have to be good at many things: preparing and delivering lessons, managing and motivating classes, and much more. However, in their roles as tutors, teachers and paraprofessionals may be more similar.

Volunteers certainly can be nurturing and attentive, and can be readily trained in structured programs to individualize instruction. The problem, however, is that studies of volunteer programs report difficulties in getting volunteers to attend every day and to avoid dropping out when they get a paying job. This is may be less of a problem when volunteers receive a stipend; paid volunteers are much more effective than unpaid ones.

The failure of tutoring substitutes, such as individualized technology, is easy to predict if the importance of nurturing and attention is taken into account. Technology may be fun, and may be individualized, but it usually separates students from the personal attention of caring adults.

Whole-Class and Whole-School Approaches.

Perhaps the biggest shocker of all is the finding that for struggling readers, certain non-technology approaches to instruction for whole classes and schools can be as effective as tutoring. Whole-class and whole-school approaches can serve many more students at much lower cost, of course. These classroom approaches mostly use cooperative learning and phonics-focused teaching, or both, and the whole-school models especially Success for All,  combine these approaches with tutoring for students who need it.

The success of certain whole-class programs, of certain tutoring approaches, and of whole-school approaches that combine proven teaching strategies with tutoring for students who need more, argues for response to intervention (RTI), the policy that has been promoted by the federal government since the 1990s. So what’s new? What’s new is that the approach I’m advocating is not just RTI. It’s RTI done right, where each component of  the strategy has strong evidence of effectiveness.

The good news is that we have powerful and cost-effective tools at our disposal that we could be putting to use on a much more systematic scale. Yet we rarely do this, and as a result far too many students continue to struggle with reading, even ending up in special education due to problems schools could have prevented. That is the real shocker. It’s up to our whole profession to use what works, until reading failure becomes a distant memory. There are many problems in education that we don’t know how to solve, but reading failure in elementary school isn’t one of them.

Practical Implications.

Perhaps the most important practical implication of this discussion is a realization that benefits similar or greater than those of one-to-one tutoring by teachers can be obtained in other ways that can be cost-effectively extended to many more students: Using paraprofessional tutors, using one-to-small group tutoring, or using whole-class and whole-school tiered strategies. It is no longer possible to say with a shrug, “of course tutoring works, but we can’t afford it.” The “four shockers” tell us we can do better, without breaking the bank.

 

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (2017). Effective reading programs for secondary students. Manuscript submitted for publication. Also see Baye, A., Lake, C., Inns, A. & Slavin, R. E. (2017, August). Effective Reading Programs for Secondary Students. Baltimore, MD: Johns Hopkins University, Center for Research and Reform in Education.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Photo by Westsara (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

 

Implementing Proven Programs

There is an old joke that goes like this. A door-to-door salesman is showing a housewife the latest, fanciest, most technologically advanced vacuum cleaner. “Ma’am,” says the salesman, “this machine will do half your work!”

“Great!” says the housewife. “I’ll take two!”

All too often, when school leaders decide to adopt proven programs, they act like the foolish housewife. The program is going to take care of everything, they think. Or if it doesn’t, it’s the program’s fault, not theirs.

I wish I could tell you that you could just pick a program from our Evidence for ESSA site (launching on February 28! Next week!), wind it up, and let it teach all your kids, sort of the way a Roomba is supposed to clean your carpets. But I can’t.

Clearly, any program, no matter how good the evidence behind it is, has to be implemented with the buy-in and participation of all involved, planning, thoughtfulness, coordination, adequate professional development, interim assessment and data-based adjustments, and final assessment of program outcomes. In reality, implementing proven programs is difficult, but so is implementing ordinary unproven programs. All teachers and administrators go home every day dead tired, no matter what programs they use. The advantage of proven programs is that they hold out promise that this time, teachers’ and administrators’ efforts will pay off. Also, almost all effective programs provide extensive, high-quality professional development, and most teachers and administrators are energized and enthusiastic about engaging professional development. Finally, whole-school innovations, done right, engage the whole staff in common activities, exchanging ideas, strategies, successes, challenges, and insights.

So how can schools implement proven programs with the greatest possible chance of success? Here are a few pointers (from 43 years of experience!).

Get Buy-In. No one likes to be forced to do anything and no one puts in their best effort or imagination for an activity they did not choose.

When introducing a proven program to a school staff, have someone from the program provider’s staff come to explain it to the staff, and then get staff members to vote by secret ballot. Require an 80% majority.

This does several things. First, it ensures that the school staff is on board, willing to give the program their best shot. Second, it effectively silences the small minority in every school that opposes everything. After the first year, additional schools that did not select the program in the first round should be given another opportunity, but by then they will have seen how well the program works in neighboring schools.

Plan, Plan, Plan. Did you ever see the Far Side cartoon in which there is a random pile of horses and cowboys and a sheriff says, “You don’t just throw a posse together, dadgummit!” (or something like that). School staffs should work with program providers to carefully plan every step of program introduction. The planning should focus on how the program needs to be adapted to the specific requirements of this particular school or district, and make best use of human, physical, technological, and financial resources.

Professional Development. Perhaps the most common mistake in implementing proven programs is providing too little on-site, up-front training, and too little on-site, ongoing coaching. Professional development is expensive, especially if travel is involved, and users of proven programs often try to minimize costs by doing less professional development, or doing all or most of it electronically, or using “trainer-of-trainer” models (in which someone from the school or district learns the model and then teaches it to colleagues).

Here’s a dark secret. Developers of proven programs almost never use any of these training models in their own research. Quite the contrary, they are likely to have top-quality coaches swarming all over schools, visiting classes and ensuring high-quality implementation any way they can. Yet when it comes time for dissemination, they keep costs down by providing much, much less than what was needed (which is why they provided it in their studies). This is such a common problem that Evidence for ESSA excludes programs that used a lot of professional development in their research, but today just send an online manual, for example. Evidence for ESSA tries to describe dissemination requirements in terms of what was done in the research, not what is currently offered.

Coaching. Coaching means having experts visit teachers’ classes and give them individual or schoolwide feedback on their quality of implementation.

Coaching is essential because it helps teachers know whether they are on track to full implementation, and enables the project to provide individualized, actionable feedback. If you question the need for feedback, consider how you could learn to play tennis or golf, play the French horn, or act in Shakespearean plays, if no one ever saw you do it and gave you useful and targeted feedback and suggestions for improvement. Yet teaching is much, much more difficult.

Sure, coaching is expensive. But poor implementation squanders not only the cost of the program, but also teachers’ enthusiasm and belief that things can be better.

Feedback. Coaches, building facilitators, or local experts should have opportunities to give regular feedback to schools using proven programs, on implementation as well as outcomes. This feedback should be focused on solving problems together, not on blaming or shaming, but it is essential in keeping schools on track toward goals. At the end of each quarter or at least annually, school staffs need an opportunity to consider how they are doing with a proven program and how they are going to make it better.

Proven programs plus thoughtful, thorough implementation are the most powerful tool we have to make a major difference in student achievement across whole schools and districts. They build on the strengths of schools and teachers, and create a lasting sense of efficacy. A team of teachers and administrators that has organized itself around a proven program, implemented it with pride and creativity, and saw enhanced outcomes, is a force to be reckoned with. A force for good.

Time Passes. Will You?

When I was in high school, one of my teachers posted a sign on her classroom wall under the clock:

Time passes. Will you?

Students spend a lot of time watching clocks, yearning for the period to be over. Yet educators and researchers often seem to believe that more time is of course beneficial to kids’ learning. Isn’t that obvious?

In a major review of secondary reading programs I am completing with my colleagues Ariane Baye, Cynthia Lake, and Amanda Inns, it turns out that the kids were right. More time, at least in remedial reading, may not be beneficial at all.

Our review identified 60 studies of extraordinary quality- mostly large-scale randomized experiments- evaluating reading programs for students in grades 6 to 12. In most of the studies, students reading 2 to 5 grade levels below expectations were randomly assigned to receive an extra class period of reading instruction every day all year, in some cases for two or three years. Students randomly assigned to the control group continued in classes such as art, music, or study hall. The strategies used in the remedial classes varied widely, including technology approaches, teaching focused on metacognitive skills (e.g., summarization, clarification, graphic organizers), teaching focused on phonics skills that should have been learned in elementary school, and other remedial approaches, all of which provided substantial additional time for reading instruction. It is also important to note that the extra-time classes were generally smaller than ordinary classes, in the range of 12 to 20 students.

In contrast, other studies provided whole class or whole school methods, many of which also focused on metacognitive skills, but none of which provided additional time.

Analyzing across all studies, setting aside five British tutoring studies, there was no effect of additional time in remedial reading. The effect size for the 22 extra-time studies was +0.08, while for 34 whole class/whole school studies, it was slightly higher, ES =+0.10. That’s an awful lot of additional teaching time for no additional learning benefit.

So what did work? Not surprisingly, one-to-one and small-group tutoring (up to one to four) were very effective. These are remedial and do usually provide additional teaching time, but in a much more intensive and personalized way.

Other approaches that showed particular promise simply made better use of existing class time. A program called The Reading Edge involves students in small mixed-ability teams where they are responsible for the reading success of all team members. A technology approach called Achieve3000 showed substantial gains for low-achieving students. A whole-school model called BARR focuses on social-emotional learning, building relationships between teachers and students, and carefully monitoring students’ progress in reading and math. Another model called ERWC prepares 12th graders to succeed on the tests used to determine whether students have to take remedial English at California State Universities.

What characterized these successful approaches? None were presented as remedial. All were exciting and personalized, and not at all like traditional instruction. All gave students social supports from peers and teachers, and reasons to hope that this time, they were going to be successful.

There is no magic to these approaches, and not every study of them found positive outcomes. But there was clearly no advantage of remedial approaches providing extra time.

In fact, according to the data, students would have done just as well to stay in art or music. And if you’d asked the kids, they’d probably agree.

Time is important, but motivation, caring, and personalization are what counts most in secondary reading, and surely in other subjects as well.

Time passes. Kids will pass, too, if we make such good use of our time with them that they won’t even notice the minutes going by.

Keep Up the Good Work (To Keep Up the Good Outcomes)

I just read an outstanding study that contains a hard but crucially important lesson. The study, by Woodbridge et al. (2014), evaluated a behavior management program for students with behavior problems. The program, First Step to Success, has been successfully evaluated many times. In the Woodbridge et al. study, 200 children in grades 1 to 3 with serious behavior problems were randomly assigned to experimental or control groups. On behavior and achievement measures, students in the experimental group scored much higher, with effect sizes of +0.44 to +0.87. Very impressive.

The researchers came back a year later to see if the outcomes were still there. Despite the substantial impacts seen at posttest, none of three prosocial/adaptive behavior measures, only one of three problem/maladaptive behaviors, and none of four academic achievement measures showed positive outcomes.

These findings were distressing to the researchers, but they contain a message. In this study, students passed from teachers who had been trained in the First Step method to teachers who had not. The treatment is well-established and inexpensive. Why should it ever be seen as a one-year intervention with a follow-up? Instead, imagine that all teachers in the school learned the program and all continued to implement it for many years. In this circumstance, it would be highly likely that the first-year positive impacts would be sustained and most likely improved over time.

Follow-up assessments are always interesting, and for interventions that are very expensive it may be crucial to demonstrate lasting impacts. But so often in education effective treatments can be maintained for many years, creating more effective school-wide environments and lasting impacts over time. Much as we might like to have one-shot treatments with long-lasting impacts, this does not correspond to the nature of children. The personal, family, or community problems that led children to have problems at a given point in time are likely to lead to problems in the future, too. But the solution is clear. Keep up the good work to keep up the good outcomes!

How Much Difference Does an Education Program Make?

When you use Consumer Reports car repair ratings to choose a reliable car, you are doing something a lot like what evidence-based reform in education is proposing. You look at the evidence and take it into account, but it does not drive you to a particular choice. There are other factors you’d also consider. For example, Consumer Reports might point you to reliable cars you can’t afford, or ones that are too large or too small or too ugly for your purposes and tastes, or ones with dealerships that are too far away. In the same way, there are many factors that school staffs or educational leaders might consider beyond effect size.

An effect size, or statistical significance, is only a starting point for estimating the impact a program or set of programs might have. I’d propose the term “potential impact” to subsume the following factors that a principal or staff might consider beyond effect size or statistical significance in adopting a program to improve education outcomes:

  • Cost-effectiveness
  • Evidence from similar schools
  • Immediate and long-term payoffs
  • Sustainability
  • Breadth of impact
  • Low-hanging fruit
  • Comprehensiveness

Cost-EffectivenessEconomists’ favorite criterion of effectiveness is cost-effectiveness. Cost-effectiveness is simple in concept (how much gain did the program cause at what cost?), but in fact there are two big elements of cost-effectiveness that are very difficult to determine:

1. Cost
2. Effectiveness

Cost should be easy, right? A school buys some service or technology and pays something for it. Well, it’s almost never so clear. When a school uses a given innovation, there are usually costs beyond the purchase price. For example, imagine that a school purchases digital devices for all students, loaded with all the software they will need. Easy, right? Wrong. Should you count in the cost of the time the teachers spend in professional development? The cost of tech support? Insurance? Security costs? The additional electricity required? Space for storage? Additional loaner units to replace lost or broken units? The opportunity costs for whatever else the school might have chosen to do?

Here is an even more difficult example. Imagine a school starts a tutoring program for struggling readers using paraprofessionals as tutors. Easy, right? Wrong. There is the cost for the paraprofessionals’ time, of course, but what if the paraprofessionals were already on the schools’ staff? If so, then a tutoring program may be very inexpensive, but if additional people must be hired as tutors, then tutoring is a far more expensive proposition. Also, if paraprofessionals already in the school are no longer doing what they used to do, might this diminish student outcomes? Then there is the problem with outcomes. As I explained in a recent blog, the meaning of effect sizes depends on the nature of the studies that produced them, so comparing apples to apples may be difficult. A principal might look at effect sizes for two programs and decide they look very similar. Yet one effect size might be from large-scale randomized experiments, which tend to produce smaller (and more meaningful) effect sizes, while the other might be from less rigorous studies.

Nevertheless, issues of cost and effectiveness do need to be considered. Somehow.

Evidence from Similar Schools
Clearly, a school staff would want to know that a given program has been successful in schools like theirs. For example, schools serving many English learners, or schools in rural areas, or schools in inner-city locations, might be particularly interested in data from similar schools. At a minimum, they should want to know that the developers have worked in schools like theirs, even if the evidence only exists from less similar schools.

Immediate and Long-Term Payoffs
Another factor in program impacts is the likelihood that a program will solve a very serious problem that may ultimately have a big effect on individual students and perhaps save a lot of money over time. For example, it may be that a very expensive parent training program may make a big difference for students with serious behavior problems. If this program produces lasting effects (documented in the research), its high cost might be justified, especially if it might reduce the need for even more expensive interventions, such as special education placement, expulsion, or incarceration.

Sustainability
Programs that either produce lasting impacts, or those that can be readily maintained over time, are clearly preferable to those that have short-term impacts only. In education, long-term impacts are not typically measured, but sustainability can be determined by the cost, effort, and other elements required to maintain an intervention. Most programs get a lot cheaper after the first year, so sustainability can usually be assumed. This means that even programs with modest effect sizes could bring about major changes over time.

Breadth of Impact
Some educational interventions with modest effect sizes might be justified because they apply across entire schools and for many years. For example, effective coaching for principals might have a small effect overall, but if that effect is seen across thousands of students over a period of years, it might be more than worthwhile. Similarly, training teachers in methods that become part of their permanent repertoire, such as cooperative learning, teaching metacognitive skills, or classroom management, might affect hundreds of students per teacher over time.

Low-Hanging Fruit
Some interventions may have either modest impacts on students in general, or strong outcomes for only a subset of students, but be so inexpensive or easy to adopt and implement that it would be foolish not to do so. One example might be making sure that disadvantaged students who need eyeglasses are assessed and given glasses. Not everyone needs glasses, but for those who do this makes a big difference at low cost. Another example might be implementing a whole-school behavior management approach like Positive Behavior Interventions and Support (PBIS), a low-cost, proven approach any school can implement.

Comprehensiveness
Schools have to solve many quite different problems, and they usually do this by pulling various solutions off of various shelves. The problem is that this approach can be uncoordinated and inefficient. The different elements may not link up well with each other, may compete for the time and attention of the staff, and may cost a lot more than a unified, comprehensive solution that addresses many objectives in a planful way. A comprehensive approach is likely to have a coherent plan for professional development, materials, software, and assessment across all program elements. It is likely to have a plan for sustaining its effects over time and extending into additional parts of the school or additional schools.

Potential Impact
Potential impact is the sum of all the factors that make a given program or a coordinated set of programs effective in the short and long term, broad in its impact, focused on preventing serious problems, and cost-effective. There is no numerical standard for potential impact, but the concept is just intended to give educators making important choices for their kids a set of things to consider, beyond effect size and statistical significance alone.

Sorry. I wish this were simple. But kids are complex, organizations are complex, and systems are complex. It’s always a good idea for education leaders to start with the evidence but then think through how programs can be used as tools to transform their particular schools.