Implementing Proven Programs

There is an old joke that goes like this. A door-to-door salesman is showing a housewife the latest, fanciest, most technologically advanced vacuum cleaner. “Ma’am,” says the salesman, “this machine will do half your work!”

“Great!” says the housewife. “I’ll take two!”

All too often, when school leaders decide to adopt proven programs, they act like the foolish housewife. The program is going to take care of everything, they think. Or if it doesn’t, it’s the program’s fault, not theirs.

I wish I could tell you that you could just pick a program from our Evidence for ESSA site (launching on February 28! Next week!), wind it up, and let it teach all your kids, sort of the way a Roomba is supposed to clean your carpets. But I can’t.

Clearly, any program, no matter how good the evidence behind it is, has to be implemented with the buy-in and participation of all involved, planning, thoughtfulness, coordination, adequate professional development, interim assessment and data-based adjustments, and final assessment of program outcomes. In reality, implementing proven programs is difficult, but so is implementing ordinary unproven programs. All teachers and administrators go home every day dead tired, no matter what programs they use. The advantage of proven programs is that they hold out promise that this time, teachers’ and administrators’ efforts will pay off. Also, almost all effective programs provide extensive, high-quality professional development, and most teachers and administrators are energized and enthusiastic about engaging professional development. Finally, whole-school innovations, done right, engage the whole staff in common activities, exchanging ideas, strategies, successes, challenges, and insights.

So how can schools implement proven programs with the greatest possible chance of success? Here are a few pointers (from 43 years of experience!).

Get Buy-In. No one likes to be forced to do anything and no one puts in their best effort or imagination for an activity they did not choose.

When introducing a proven program to a school staff, have someone from the program provider’s staff come to explain it to the staff, and then get staff members to vote by secret ballot. Require an 80% majority.

This does several things. First, it ensures that the school staff is on board, willing to give the program their best shot. Second, it effectively silences the small minority in every school that opposes everything. After the first year, additional schools that did not select the program in the first round should be given another opportunity, but by then they will have seen how well the program works in neighboring schools.

Plan, Plan, Plan. Did you ever see the Far Side cartoon in which there is a random pile of horses and cowboys and a sheriff says, “You don’t just throw a posse together, dadgummit!” (or something like that). School staffs should work with program providers to carefully plan every step of program introduction. The planning should focus on how the program needs to be adapted to the specific requirements of this particular school or district, and make best use of human, physical, technological, and financial resources.

Professional Development. Perhaps the most common mistake in implementing proven programs is providing too little on-site, up-front training, and too little on-site, ongoing coaching. Professional development is expensive, especially if travel is involved, and users of proven programs often try to minimize costs by doing less professional development, or doing all or most of it electronically, or using “trainer-of-trainer” models (in which someone from the school or district learns the model and then teaches it to colleagues).

Here’s a dark secret. Developers of proven programs almost never use any of these training models in their own research. Quite the contrary, they are likely to have top-quality coaches swarming all over schools, visiting classes and ensuring high-quality implementation any way they can. Yet when it comes time for dissemination, they keep costs down by providing much, much less than what was needed (which is why they provided it in their studies). This is such a common problem that Evidence for ESSA excludes programs that used a lot of professional development in their research, but today just send an online manual, for example. Evidence for ESSA tries to describe dissemination requirements in terms of what was done in the research, not what is currently offered.

Coaching. Coaching means having experts visit teachers’ classes and give them individual or schoolwide feedback on their quality of implementation.

Coaching is essential because it helps teachers know whether they are on track to full implementation, and enables the project to provide individualized, actionable feedback. If you question the need for feedback, consider how you could learn to play tennis or golf, play the French horn, or act in Shakespearean plays, if no one ever saw you do it and gave you useful and targeted feedback and suggestions for improvement. Yet teaching is much, much more difficult.

Sure, coaching is expensive. But poor implementation squanders not only the cost of the program, but also teachers’ enthusiasm and belief that things can be better.

Feedback. Coaches, building facilitators, or local experts should have opportunities to give regular feedback to schools using proven programs, on implementation as well as outcomes. This feedback should be focused on solving problems together, not on blaming or shaming, but it is essential in keeping schools on track toward goals. At the end of each quarter or at least annually, school staffs need an opportunity to consider how they are doing with a proven program and how they are going to make it better.

Proven programs plus thoughtful, thorough implementation are the most powerful tool we have to make a major difference in student achievement across whole schools and districts. They build on the strengths of schools and teachers, and create a lasting sense of efficacy. A team of teachers and administrators that has organized itself around a proven program, implemented it with pride and creativity, and saw enhanced outcomes, is a force to be reckoned with. A force for good.

Advertisements

Time Passes. Will You?

When I was in high school, one of my teachers posted a sign on her classroom wall under the clock:

Time passes. Will you?

Students spend a lot of time watching clocks, yearning for the period to be over. Yet educators and researchers often seem to believe that more time is of course beneficial to kids’ learning. Isn’t that obvious?

In a major review of secondary reading programs I am completing with my colleagues Ariane Baye, Cynthia Lake, and Amanda Inns, it turns out that the kids were right. More time, at least in remedial reading, may not be beneficial at all.

Our review identified 60 studies of extraordinary quality- mostly large-scale randomized experiments- evaluating reading programs for students in grades 6 to 12. In most of the studies, students reading 2 to 5 grade levels below expectations were randomly assigned to receive an extra class period of reading instruction every day all year, in some cases for two or three years. Students randomly assigned to the control group continued in classes such as art, music, or study hall. The strategies used in the remedial classes varied widely, including technology approaches, teaching focused on metacognitive skills (e.g., summarization, clarification, graphic organizers), teaching focused on phonics skills that should have been learned in elementary school, and other remedial approaches, all of which provided substantial additional time for reading instruction. It is also important to note that the extra-time classes were generally smaller than ordinary classes, in the range of 12 to 20 students.

In contrast, other studies provided whole class or whole school methods, many of which also focused on metacognitive skills, but none of which provided additional time.

Analyzing across all studies, setting aside five British tutoring studies, there was no effect of additional time in remedial reading. The effect size for the 22 extra-time studies was +0.08, while for 34 whole class/whole school studies, it was slightly higher, ES =+0.10. That’s an awful lot of additional teaching time for no additional learning benefit.

So what did work? Not surprisingly, one-to-one and small-group tutoring (up to one to four) were very effective. These are remedial and do usually provide additional teaching time, but in a much more intensive and personalized way.

Other approaches that showed particular promise simply made better use of existing class time. A program called The Reading Edge involves students in small mixed-ability teams where they are responsible for the reading success of all team members. A technology approach called Achieve3000 showed substantial gains for low-achieving students. A whole-school model called BARR focuses on social-emotional learning, building relationships between teachers and students, and carefully monitoring students’ progress in reading and math. Another model called ERWC prepares 12th graders to succeed on the tests used to determine whether students have to take remedial English at California State Universities.

What characterized these successful approaches? None were presented as remedial. All were exciting and personalized, and not at all like traditional instruction. All gave students social supports from peers and teachers, and reasons to hope that this time, they were going to be successful.

There is no magic to these approaches, and not every study of them found positive outcomes. But there was clearly no advantage of remedial approaches providing extra time.

In fact, according to the data, students would have done just as well to stay in art or music. And if you’d asked the kids, they’d probably agree.

Time is important, but motivation, caring, and personalization are what counts most in secondary reading, and surely in other subjects as well.

Time passes. Kids will pass, too, if we make such good use of our time with them that they won’t even notice the minutes going by.

Keep Up the Good Work (To Keep Up the Good Outcomes)

I just read an outstanding study that contains a hard but crucially important lesson. The study, by Woodbridge et al. (2014), evaluated a behavior management program for students with behavior problems. The program, First Step to Success, has been successfully evaluated many times. In the Woodbridge et al. study, 200 children in grades 1 to 3 with serious behavior problems were randomly assigned to experimental or control groups. On behavior and achievement measures, students in the experimental group scored much higher, with effect sizes of +0.44 to +0.87. Very impressive.

The researchers came back a year later to see if the outcomes were still there. Despite the substantial impacts seen at posttest, none of three prosocial/adaptive behavior measures, only one of three problem/maladaptive behaviors, and none of four academic achievement measures showed positive outcomes.

These findings were distressing to the researchers, but they contain a message. In this study, students passed from teachers who had been trained in the First Step method to teachers who had not. The treatment is well-established and inexpensive. Why should it ever be seen as a one-year intervention with a follow-up? Instead, imagine that all teachers in the school learned the program and all continued to implement it for many years. In this circumstance, it would be highly likely that the first-year positive impacts would be sustained and most likely improved over time.

Follow-up assessments are always interesting, and for interventions that are very expensive it may be crucial to demonstrate lasting impacts. But so often in education effective treatments can be maintained for many years, creating more effective school-wide environments and lasting impacts over time. Much as we might like to have one-shot treatments with long-lasting impacts, this does not correspond to the nature of children. The personal, family, or community problems that led children to have problems at a given point in time are likely to lead to problems in the future, too. But the solution is clear. Keep up the good work to keep up the good outcomes!

How Much Difference Does an Education Program Make?

When you use Consumer Reports car repair ratings to choose a reliable car, you are doing something a lot like what evidence-based reform in education is proposing. You look at the evidence and take it into account, but it does not drive you to a particular choice. There are other factors you’d also consider. For example, Consumer Reports might point you to reliable cars you can’t afford, or ones that are too large or too small or too ugly for your purposes and tastes, or ones with dealerships that are too far away. In the same way, there are many factors that school staffs or educational leaders might consider beyond effect size.

An effect size, or statistical significance, is only a starting point for estimating the impact a program or set of programs might have. I’d propose the term “potential impact” to subsume the following factors that a principal or staff might consider beyond effect size or statistical significance in adopting a program to improve education outcomes:

  • Cost-effectiveness
  • Evidence from similar schools
  • Immediate and long-term payoffs
  • Sustainability
  • Breadth of impact
  • Low-hanging fruit
  • Comprehensiveness

Cost-EffectivenessEconomists’ favorite criterion of effectiveness is cost-effectiveness. Cost-effectiveness is simple in concept (how much gain did the program cause at what cost?), but in fact there are two big elements of cost-effectiveness that are very difficult to determine:

1. Cost
2. Effectiveness

Cost should be easy, right? A school buys some service or technology and pays something for it. Well, it’s almost never so clear. When a school uses a given innovation, there are usually costs beyond the purchase price. For example, imagine that a school purchases digital devices for all students, loaded with all the software they will need. Easy, right? Wrong. Should you count in the cost of the time the teachers spend in professional development? The cost of tech support? Insurance? Security costs? The additional electricity required? Space for storage? Additional loaner units to replace lost or broken units? The opportunity costs for whatever else the school might have chosen to do?

Here is an even more difficult example. Imagine a school starts a tutoring program for struggling readers using paraprofessionals as tutors. Easy, right? Wrong. There is the cost for the paraprofessionals’ time, of course, but what if the paraprofessionals were already on the schools’ staff? If so, then a tutoring program may be very inexpensive, but if additional people must be hired as tutors, then tutoring is a far more expensive proposition. Also, if paraprofessionals already in the school are no longer doing what they used to do, might this diminish student outcomes? Then there is the problem with outcomes. As I explained in a recent blog, the meaning of effect sizes depends on the nature of the studies that produced them, so comparing apples to apples may be difficult. A principal might look at effect sizes for two programs and decide they look very similar. Yet one effect size might be from large-scale randomized experiments, which tend to produce smaller (and more meaningful) effect sizes, while the other might be from less rigorous studies.

Nevertheless, issues of cost and effectiveness do need to be considered. Somehow.

Evidence from Similar Schools
Clearly, a school staff would want to know that a given program has been successful in schools like theirs. For example, schools serving many English learners, or schools in rural areas, or schools in inner-city locations, might be particularly interested in data from similar schools. At a minimum, they should want to know that the developers have worked in schools like theirs, even if the evidence only exists from less similar schools.

Immediate and Long-Term Payoffs
Another factor in program impacts is the likelihood that a program will solve a very serious problem that may ultimately have a big effect on individual students and perhaps save a lot of money over time. For example, it may be that a very expensive parent training program may make a big difference for students with serious behavior problems. If this program produces lasting effects (documented in the research), its high cost might be justified, especially if it might reduce the need for even more expensive interventions, such as special education placement, expulsion, or incarceration.

Sustainability
Programs that either produce lasting impacts, or those that can be readily maintained over time, are clearly preferable to those that have short-term impacts only. In education, long-term impacts are not typically measured, but sustainability can be determined by the cost, effort, and other elements required to maintain an intervention. Most programs get a lot cheaper after the first year, so sustainability can usually be assumed. This means that even programs with modest effect sizes could bring about major changes over time.

Breadth of Impact
Some educational interventions with modest effect sizes might be justified because they apply across entire schools and for many years. For example, effective coaching for principals might have a small effect overall, but if that effect is seen across thousands of students over a period of years, it might be more than worthwhile. Similarly, training teachers in methods that become part of their permanent repertoire, such as cooperative learning, teaching metacognitive skills, or classroom management, might affect hundreds of students per teacher over time.

Low-Hanging Fruit
Some interventions may have either modest impacts on students in general, or strong outcomes for only a subset of students, but be so inexpensive or easy to adopt and implement that it would be foolish not to do so. One example might be making sure that disadvantaged students who need eyeglasses are assessed and given glasses. Not everyone needs glasses, but for those who do this makes a big difference at low cost. Another example might be implementing a whole-school behavior management approach like Positive Behavior Interventions and Support (PBIS), a low-cost, proven approach any school can implement.

Comprehensiveness
Schools have to solve many quite different problems, and they usually do this by pulling various solutions off of various shelves. The problem is that this approach can be uncoordinated and inefficient. The different elements may not link up well with each other, may compete for the time and attention of the staff, and may cost a lot more than a unified, comprehensive solution that addresses many objectives in a planful way. A comprehensive approach is likely to have a coherent plan for professional development, materials, software, and assessment across all program elements. It is likely to have a plan for sustaining its effects over time and extending into additional parts of the school or additional schools.

Potential Impact
Potential impact is the sum of all the factors that make a given program or a coordinated set of programs effective in the short and long term, broad in its impact, focused on preventing serious problems, and cost-effective. There is no numerical standard for potential impact, but the concept is just intended to give educators making important choices for their kids a set of things to consider, beyond effect size and statistical significance alone.

Sorry. I wish this were simple. But kids are complex, organizations are complex, and systems are complex. It’s always a good idea for education leaders to start with the evidence but then think through how programs can be used as tools to transform their particular schools.

Seeking Jewels, Not Boulders: Learning to Value Small, Well-Justified Effect Sizes

One of the most popular exhibits in the Smithsonian Museum of Natural History is the Hope Diamond, one of the largest and most valuable in the world. It’s always fun to see all the kids flow past it saying how wonderful it would be to own the Hope Diamond, how beautiful it is, and how such a small thing could make you rich and powerful.

The diamonds are at the end of the Hall of Minerals, which is crammed full of exotic minerals from all over the world. These are beautiful, rare, and amazing in themselves, yet most kids rush past them to get to the diamonds. But no one, ever, evaluates the minerals against one another according to their size. No one ever says, “you can have your Hope Diamond, but I’d rather have this giant malachite or feldspar.” Just getting into the Smithsonian, kids go by boulders on the National Mall far larger than anything in the Hall of Minerals, perhaps climbing on them but otherwise ignoring them completely.

Yet in educational research, we often focus on the size of study effects without considering their value. In a recent blog, I presented data from a paper with my colleague Alan Cheung analyzing effect sizes from 611 studies evaluating reading, math, and science programs, K-12, that met the inclusion standards of our Best Evidence Encyclopedia. One major finding was that in randomized evaluations with sample sizes of 250 students (10 classes) or more, the average effect size across 87 studies was only +0.11. Smaller randomized studies had effect sizes averaging +0.22, large matched quasi-experiments +0.17, and small quasi-experiments, +0.32. In this blog, I want to say more about how these findings should make us think differently about effect sizes as we increasingly apply evidence to policy and practice.

Large randomized experiments (RCTs) with significant positive outcomes are the diamonds of educational research: rare, often flawed, but incredibly valuable. The reason they are so valuable is that such studies are the closest indication of what will happen when a given program goes out into the real world of replication. Randomization removes the possibility that self-selection may account for program effects. The larger the sample size, the less likely it is that the experimenter or developer could closely monitor each class and mentor each teacher beyond what would be feasible in real-life scale up. Most large-scale RCTs use clustering, which usually means that the treatment and randomization take place at the level of the whole school. A cluster randomized experiment at the school level might require recruiting 40 to 50 schools, perhaps serving 20,000 to 25,000 students. Yet such studies might nevertheless be too “small” to detect an effect size of, say, 0.15, because it is the number of clusters, not the number of students, that matters most!

The problem is that we have been looking for much larger effect sizes, and all too often not finding them. Traditionally, researchers recruit enough schools or classes to reliably detect an effect size as small as +0.20. This means that many studies report effect sizes that turn out to be larger than average for large RCTs, but are not statistically significant (at p<.05), because they are less than +0.20. If researchers did recruit samples of schools large enough to detect an effect size of +0.15, this would greatly increase the costs of such studies. Large RCTs are already very expensive, so substantially increasing sample sizes could end up requiring resources far beyond what educational research is likely to see any time in the near future or greatly reducing the number of studies that are funded.

These issues have taken on greater importance recently due to the passage of the Every Student Succeeds Act, or ESSA, which encourages use of programs that meet strong, moderate, or promising levels of evidence. The “strong” category requires that a program have at least one randomized experiment that found a significant positive effect. Such programs are rare.

If educational researchers were mineralogists, we’d be pretty good at finding really big diamonds, but the little, million-dollar diamonds, not so much. This makes no sense in diamonds, and no sense in educational research.

So what do we do? I’m glad you asked. Here are several ways we could proceed to increase the number of programs successfully evaluated in RCTs.

1. For cluster randomized experiments at the school level, something has to give. I’d suggest that for such studies, the p value should be increased to .10 or even .20. A p value of .05 is a long-established convention, indicating that there is only one chance in 20 that the outcomes are due to luck. Yet one chance in 10 (p=.10) may be sufficient in studies likely to have tens of thousands of students.

2. For studies in the past, as well as in the future, replication should be considered the same as large sample size. For example, imagine that two studies of Program X each have 30 schools. Each gets a respectable effect size of +0.20, which would not be significant in either case. Put the two studies together, however, and voila! The combined study of 60 schools would be highly significant, even at p=.05.

3. Government or foundation funders might fund evaluations in stages. The first stage might involve a cluster randomized experiment of, say, 20 schools, which is very unlikely to produce a significant difference. But if the effect size were perhaps 0.20 or more, the funders might fund a second stage of 30 schools. The two samples together, 50 schools, would be enough to detect a small but important effect.

One might well ask why we should be interested in programs that only produce effect sizes of 0.15? Aren’t these effects too small to matter?

The answer is that they are not. Over time, I hope we will learn how to routinely produce better outcomes. Already, we know that much larger impacts are found in studies of certain approaches emphasizing professional development (e.g., cooperative learning, meta cognitive skills) and certain forms of technology. I hope and expect that over time, more studies will evaluate programs using methods like those that have been proven to work, and fewer will evaluate those that do not, thereby raising the average effects we find. But even as they are, small but reliable effect sizes are making meaningful differences in the lives of children, and will make much more meaningful differences as we learn from efforts at the Institute of Education Sciences (IES) and the Investing in Innovation (i3)/Education Innovation and Research (EIR) programs.

Small effect sizes from large randomized experiments are the Hope Diamonds of our profession. They also are the best hope for evidence-based improvements for all students.

Teachers as Professionals in Evidence-Based Reform

2014-12-18-HP5512_18_2014.jpg

In a February 2012 op-ed in Education Week, Don Peurach wrote about a 14-year investigation he carried out as part of a large University of Michigan study of comprehensive school reform. In the overall study, our Success for All program and the America’s Choice program did very well in terms of both implementation and outcomes, while an approach in which teachers largely made up their own instructional approaches did not bring about much change in teachers’ behaviors or student learning. Because both Success for All and America’s Choice have well-specified training, teacher’s manuals, and student materials, the findings support the idea that it is important for school-wide reform models to have a well-structured approach.

Peurach’s focus was on Success for All as an organization. He wanted to know how our network of hundreds of schools in 40 states contributes to the development of the approach and to each other’s success. His key finding was that Success for All is not a top-down approach, but is constantly learning from its teachers and principals and then spreading good practices throughout the network.

In our way of thinking, this is the very essence of professionalism. A teacher who does wonderful, innovative things in one class is perhaps benefiting 25 children each year, but one whose ideas scale up to inform the practices of hundreds of thousands of schools is making a real difference. Yet in order for teachers’ ideas and impact to be broadly impactful, it helps a great deal for the teachers to be part of a national or regional network that speaks a common language and has common standards of practice.

Teachers need not be researchers to contribute to their profession. By participating in networks of like-minded educators – implementing, continuously improving and communicating about practical approaches intended to improve outcomes of proven approaches – they play an essential role in the improvement of their profession.

Why Control Groups are Ethical and Necessary

2013-10-17-HP24Image10172013.jpg

A big reason that many educators don’t like to participate in experiments is that they don’t want to take a 50-50 chance of being assigned to a control group. That is, in randomized experiments, schools or teachers or students are often assigned at random to receive the innovative treatment (the experimental group) or to keep doing whatever they were doing (the control group). Educators often object, complaining that they do not think it is fair that they might be assigned to the control group.

This objection always comes up, and I’m always surprised by it. What the educators who are so concerned about being in the control group don’t seem to realize is that they are already – for all intents and purposes – in the control group, but we’re giving them the potential opportunity to receive the new services and move into the experimental group. If the coin comes up heads, they move to the experimental group; if the coin comes up tails, they simply continue doing what they’ve already been comfortably doing, perhaps for many years. Usually, schools can purchase materials and training to adopt an innovative program outside of the experiment; all the experiment typically offers is a 50-50 chance to get the treatment for free, or at a deep discount. So if schools want the treatment, and are willing to pay for it, they can get it. Ending up in the control group is not so bad, either. Incentives are usually offered to schools in the control group, so a school might receive several thousand dollars to do anything they want other than the experimental treatment. Further, many studies use a “delayed treatment” design in which the control group gets training and materials to implement the experimental program after the study is over, a year or two after the schools in the experimental group received the program. In this way, having been in the control group in the short term serves to improve the school in the long run.

But isn’t it unethical to deprive schools or children of effective methods? If the methods were so proven and so widely used that not to use them would truly deprive students, then this would be unethical. But every experiment has to be passed by an Institutional Review Board (IRB), usually located in a university. IRB regulations require that the control group receive a treatment that is at least “state of the art,” so that no one gets less than the current standard of best practice. The experiment is designed to test out innovations whose effectiveness has not yet been established and that are not yet standard best practice.

In fields that respect evidence, yesterday’s experimental treatment becomes the standard of practice, and thereby becomes the minimum acceptable offering for control groups. This cycle continues indefinitely, with experimental treatments being progressively compared to harder-to-beat control treatments as evidence-based practice advances. In other words, doing experiments using control groups to improve education based on evidence would put education into a virtuous cycle of innovation, evaluation, and progressive improvement like that which has transformed fields such as medicine, agriculture, and technology to the benefit of all.

Most educators would prefer not to be in the control group, but they should at least be consoled by the knowledge that control groups play an essential role in evidence-based reform. In fact, the importance of knowing whether or not new methods add to student outcomes is so great that one could argue that it is unethical not to agree to participate in experiments in which one might be assigned to the control group. In an education system offering many more opportunities to participate in research, individual schools or educators may be in control groups in some studies and experimental groups in others.

As teachers, principals, and superintendents get used to participating in experiments, they are losing some of their earlier reluctance. In particular, when educators are asked to play an active role in developing and evaluating new approaches and in choosing which experiments to volunteer for, they become more comfortable with the concept. And this comfort will enable education to join medicine, agriculture, technology, and other fields in which a respect for evidence and innovation drives rapid progress in meeting human needs.

If you like evidence-based reform in education, then be sure to tip your hat to the little-appreciated control group, without which most experiments would not be possible.