Summer 2021 Re-Imagined: A Grand Opening to a Successful Year

If you follow my blogs, you’ll note that I have been writing recently about the ineffectiveness of summer school (here, here, and here). Along with colleagues, I wrote a review of research on summer school, which is summarized here. The reason for the ineffectiveness of summer school, I proposed, is that when summer school resembles regular school, it can be boring. Kids are sitting in school while their friends are playing outside. As a result, attendance in summer school programs intended to help struggling students can be very poor, and the motivation of those who do attend may also be poor.

However, there are two major exceptions to the otherwise dismal outcomes of studies of summer school. One is a Los Angeles study by Schacter & Jo (2005), and the other is a study by Zvoch & Stevens (2013), done in a small city in the Northwest.

Both of these studies focused on disadvantaged students in grades 1 or K-1. Both provided small-group tutoring interventions. Schacter & Jo (2005) gave students phonics instruction in groups of 15, followed by small-group tutoring. The Gates-MacGinitie reading effect size was +1.16. Zvoch & Stevens (2013) also provided group phonics instruction followed by tutoring to groups of 3 to 5. The effect size on DIBELS measures was +0.69.

The large effect sizes seen in these two studies contrast sharply with all the other studies of summer classroom programs, which had a mean effect size near zero. What this suggests is that the best instructional use of summer may be to provide one-to-one or small-group tutoring to struggling students.

In summer, 2021, the rationale for summertime tutoring is particularly strong. If current trends maintain, most teachers will have received Covid vaccines by summer, and increasing numbers of schools will open by the end of the current semester. To close schools that could be open for summer vacation seems a waste. Also, assuming the American Rescue Plan is passed (as expected), it will make a great deal of money available to serve students who have lost ground due to Covid school closures, so schools will be able to afford to pay for tutoring during the summer.

The problem with summer school is that it cannot be made mandatory, and many students will not want to attend. However, in summer 2021, providing tutoring during the summer for students who do choose to attend (and keep attending regularly) could be of great value, even if most students who need tutoring do not attend. The reason is that there are so many students who will need tutoring in September, 2021, that not all of them can be tutored right away. Providing tutoring in the summer gives some students a full dose of tutoring before school officially opens, so that schools will not be under pressure to tutor more students than they are able to serve in fall, 2021.

How Can Summer Tutoring Work?

Summertime allows schools to provide more hours of tutoring each day than would be possible during the school year. For example, teaching and tutoring were provided 2 hours a day for 7 weeks in the Schacter & Jo (2005) study, and 3½ hours per day for 5 weeks in the Zvoch & Stevens (2013) study. If tutoring were alternated with sports or music or other fun activities, one might imagine providing two or three tutoring sessions each day, for as many as 8 weeks during the summer.

These sessions might be offered during a half day, so teachers and teaching assistants might teach one morning and one afternoon session each day. In fact, tutors might provide three two-hour sessions, and reach even more students.

The tutoring methods should be ones proven effective in rigorous experiments. While any whole-class teaching should be done by teachers, teaching assistants can be trained to be excellent tutors. They will need extensive training and in-class coaching, but this is worthwhile, especially because most of these tutors will continue working with additional students during the school day starting in the fall.

Tutoring in summer 2021 will provide a pilot opportunity for schools and districts to hit the ground running in September. It will provide time and resources for providers of tutoring to greatly increase their scale of operations. And it may attract students who have been out of school for many months by offering small group, supportive tutoring with caring tutors, to help ease the transition back into school.

Summertime need not be a time for summertime blues. Instead, it can serve as a “grand opening” for a successful re-entry to school for millions of students.

References

Schacter, J., & Jo, B. (2005). Learning when school is not in session: A reading summer day-camp intervention to improve the achievement of exiting first-grade students who are economically disadvantaged. Journal of Research in Reading, 28, 158-169. Doi:10.111/j.1467-9817.2005.00260.x

Zvoch, K., & Stevens, J. J. (2013). Summer school effects in a randomized field trial. Early Childhood Research Quarterly, 28(1), 24-32. Doi:10.1016/j.ecresq.2012,05.002

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Cooperative Learning and Achievement

Once upon a time, two teachers went together to an evening workshop on effective teaching strategies. The speaker was dynamic, her ideas were interesting, and everyone in the large audience enjoyed the speech. Afterwards, the two teachers drove back to the town where they lived. The driver talked excitedly with her friend about all the wonderful ideas they’d heard, raised questions about how to put them into practice, and related them to things she’d read, heard, and experienced before.

After an hour’s drive, however, the driver realized that her friend had been asleep for the whole return trip.

Now here’s my question: who learned the most from the speech? Both the driver and her friend were equally excited by the speech and paid equal attention to it. Yet no one would doubt that the driver learned much more, because after the lecture, she talked all about it, thinking her friend was awake.

Every teacher knows how much they learn about any topic by teaching it, or discussing it with others. Imagine how much more the driver and her friend would have learned from the lecture if they had both been participating fully, sharing ideas, perceptions, agreements, disagreements, and new ideas.

So far, this is all obvious, right? Everyone knows that people learn when they are engaged, when they have opportunities to discuss with others, explain to others, ask questions of others, and receive explanations.

Yet in traditionally organized classes, learning does not often happen like this. Teachers teach, students listen, and if genuine discussion takes place at all, it is between the teacher and a small minority of students who always raise their hands and ask good questions. Even in the most exciting and interactive of classes, many students, often a majority, say little or nothing. They may give an answer if called upon, but “giving an answer” is not at all the same as engagement. Even in classes that are organized in groups and encourage group interaction, some students do most of the participating, while others just watch, at best. Evidence from research, especially studies by Noreen Webb (2008), find that the students who learn the most in group settings are those who give full explanations to others. These are the drivers, returning to my opening story. Those who receive a lot of explanations also learn. Who learns least? Those who neither explain nor receive explanations.

For achievement outcomes, it is not enough to put students into groups and let them talk. Research finds that cooperative learning works best when there are group goals and individual accountability. That is, groups can earn recognition or small privileges (e.g., lining up first for recess) if the average of each team member’s score meets a high standard. The purpose of group goals and individual accountability is to incentivize team members to help and encourage each other to excel, and to avoid having, for example, one student do all the work while the others watch (Chapman, 2001). Students can be silent in groups, as they can be in class, but this is less likely if they are working with others toward a common goal that they can achieve only if all team members succeed.

blog_3-5-20_coopstudents_500x333

The effectiveness of cooperative learning for enhancing achievement has been known for a long time (see Rohrbeck et al., 2003; Roseth et al., 2008; Slavin, 1995, 2014). Forms of cooperative learning are frequently seen in elementary and secondary schools, but they are far from standard practice. Forms of cooperative learning that use group goals and individual accountability are even more rare.

There are many examples of programs that incorporate cooperative learning and meet the ESSA Strong or Moderate standards in reading, math, SEL, and attendance. You can see descriptions of the programs by visiting www.evidenceforessa.org and clicking on the cooperative learning filter. As you can see, it is remarkable how many of the programs identified as effective for improving student achievement by the What Works Clearinghouse or Evidence for ESSA make use of well-structured cooperative learning, usually with students working in teams or groups of 4-5 students, mixed in past performance. In fact, in reading and mathematics, only one-to-one or small-group tutoring are more effective than approaches that make extensive use of cooperative learning.

There are many successful approaches to cooperative learning adapted for different subjects, specific objectives, and age levels (see Slavin, 1995). There is no magic to cooperative learning; outcomes depend on use of proven strategies and high-quality implementation. The successful forms of cooperative learning provide at least a good start for educators seeking ways to make school engaging, exciting, social, and effective for learning. Students not only learn from cooperation in small groups, but they love to do so. They are typically eager to work with their classmates. Why shouldn’t we routinely give them this opportunity?

References

Chapman, E. (2001, April). More on moderations in cooperative learning outcomes. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.

Rohrbeck, C. A., Ginsburg-Block, M. D., Fantuzzo, J. W., & Miller, T. R. (2003). Peer-assisted learning interventions with elementary school students: A meta-analytic review. Journal of Educational Psychology, 94(2), 240–257.

Roseth, C., Johnson, D., & Johnson, R. (2008). Promoting early adolescents’ achievement and peer relationships: The effects of cooperative, competitive, and individualistic goal structures. Psychological Bulletin, 134(2), 223–246.

Slavin, R. E. (1995). Cooperative learning: Theory, research, and practice (2nd ed.). Boston, MA: Allyn & Bacon.

Slavin, R. E. (2014). Make cooperative learning powerful: Five essential strategies to make cooperative learning effective. Educational Leadership, 72 (2), 22-26.

Webb, N. M. (2008). Learning in small groups. In T. L. Good (Ed.), 21st century learning (Vol. 1, pp. 203–211). Thousand Oaks, CA: Sage.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

 

Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Programs and Practices

One issue I hear about all the time when I speak about evidence-based reform in education relates to the question of programs vs. practices. A program is a specific set of procedures, usually with materials, software, professional development, and other elements, designed to achieve one or more important outcomes, such as improving reading, math, or science achievement. Programs are typically created by non-profit organizations, though they may be disseminated by for-profits. Almost everything in the What Works Clearinghouse (WWC) and Evidence for ESSA is a program.

A practice, on the other hand, is a general principle that a teacher can use. It may not require any particular professional development or materials.  Examples of practices include suggestions to use more feedback, more praise, a faster pace of instruction, more higher-order questions, or more technology.

In general, educators, and especially teachers, love practices, but are not so crazy about programs. Programs have structure, requiring adherence to particular activities and use of particular materials. In contrast, every teacher can use practices as they wish. Educational leaders often say, “We don’t do programs.” What they mean is, “we give our teachers generic professional development and then turn them loose to interpret them.”

One problem with practices is that because they leave the details up to each teacher, teachers are likely to interpret them in a way that conforms to what they are already doing, and then no change happens. As an example of this, I once attended a speech by the late, great Madeline Hunter, extremely popular in the 1970s and ‘80s. She spoke and wrote clearly and excitingly in a very down-to-earth way. The auditorium she spoke to was stuffed to the rafters with teachers, who hung on her every word.

When her speech was over, I was swept out in a throng of happy teachers. They were all saying to each other, “Madeline Hunter supports absolutely everything I’ve ever believed about teaching!”

I love happy teachers, but I was puzzled by their reaction. If all the teachers were already doing the things Madeline Hunter recommended to the best of their ability, then how did her ideas improve their teaching? In actuality, a few studies of Hunters’ principles found no significant effects on student learning, and even more surprising, they found few differences between the teaching behaviors of teachers trained in Hunter’s methods and those who had not been. Essentially, one might argue, Madeline Hunter’s principles were popular precisely because they did not require teachers to change very much, and if teachers do not change their teaching, why would we expect their students’ learning to change?

blog_8-23-18_mowerparts_500x333

Another reason that practices rarely change learning is that they are usually small improvements that teachers are expected to assemble to improve their teaching. However, asking teachers to put together many pieces into major improvements is a bit like giving someone the pieces and parts of a lawnmower and asking them to put them together (see picture above). Some mechanically-minded people could do it, but why bother? Why not start with a whole lawnmower?

In the same way, there are gifted teachers who can assemble principles of effective practice into great instruction, but why make it so difficult? Great teachers who could assemble isolated principles into effective teaching strategies are also sure to be able to take a proven program and implement it very well. Why not start with something known to work and then improve it with effective implementation, rather than starting from scratch?

One problem with practices is that most are impossible to evaluate. By definition, everyone has their own interpretation of every practice. If practices become specific, with specific guides, supports, and materials, they become programs. So a practice is a practice exactly because it is too poorly specified to be a program. And practices that are difficult to clearly specify are also unlikely to improve student outcomes.

There are exceptions, where practices can be evaluated. For example, eliminating ability grouping or reducing class size or assigning (or not assigning) homework are practices that can be evaluated, and can be specified. But these are exceptions.

The squishiness of most practices is the reason that they rarely appear in the WWC or Evidence for ESSA. A proper evaluation contrasts one treatment (an experimental group) to a control group continuing current practices. The treatment group almost has to be a program, because otherwise it is impossible to tell what is being evaluated. For example, how can an experiment evaluate “feedback” if teachers make up their own definitions of “feedback”? How about higher-order questions? How about praise? Rapid pace? Use of these practices can be measured using observation, but differences between the treatment and control groups may be hard to detect because in each case teachers in the control group may also be using the same practices. What teacher does not provide feedback? What teacher does not praise children? What teacher does not use higher-order questions? Some may use these practices more than others, but the differences are likely to be subtle. And subtle differences rarely produce important outcomes.

The distinction between programs and practices has a lot to do with the practices (not programs) promoted by John Hattie. He wants to identify practices that can help teachers know about what works in instruction. That’s a noble goal, but it can rarely be done using real classroom research done over real periods of time. In order to isolate particular practices for study, researchers often do very brief, artificial lab studies that have nothing to do with classroom practices.  For example, some lab studies in Hattie’s own review of feedback contrast teachers giving feedback to teachers giving no feedback. What teacher would do that?

It is worthwhile to use what we know from research, experience, program evaluations, and theory to discuss what practices may be most useful for teachers. But claiming particular effect sizes for such studies is rarely justified. The strongest evidence for practical use in schools will almost always come from experiments evaluating programs. Practices have their place, but focusing on exposing teachers to a lot of practices and expecting them to put them together to improve student outcomes is not likely to work.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

New Findings on Tutoring: Four Shockers

blog_04 05 18_SURPRISE_500x353One-to-one and one-to-small group tutoring have long existed as remedial approaches for students who are performing far below expectations. Everyone knows that tutoring works, and nothing in this blog contradicts this. Although different approaches have their champions, the general consensus is that tutoring is very effective, and the problem with widespread use is primarily cost (and for tutoring by teachers, availability of sufficient teachers). If resources were unlimited, one-to-one tutoring would be the first thing most educators would recommend, and they would not be wrong. But resources are never unlimited, and the numbers of students performing far below grade level are overwhelming, so cost-effectiveness is a serious concern. Further, tutoring seems so obviously effective that we may not really understand what makes it work.

In recent reviews, my colleagues and I examined what is known about tutoring. Beyond the simple conclusion that “tutoring works,” we found some big surprises, four “shockers.” Prepare to be amazed! Further, I propose an explanation to account for these unexpected findings.

We have recently released three reviews that include thorough, up-to-date reviews of research on tutoring. One is a review of research on programs for struggling readers in elementary schools by Amanda Inns and colleagues (2018). Another is a review on programs for secondary readers by Ariane Baye and her colleagues (2017). Finally, there is a review on elementary math programs by Marta Pellegrini et al. (2018). All three use essentially identical methods, from the Best Evidence Encyclopedia (www.bestevidence.org). In addition to sections on tutoring strategies, all three also include other, non-tutoring methods directed at the same populations and outcomes.

What we found challenges much of what everyone thought they knew about tutoring.

Shocker #1: In all three reviews, tutoring by paraprofessionals (teaching assistants) was at least as effective as tutoring by teachers. This was found for reading and math, and for one-to-one and one-to-small group tutoring.  For struggling elementary readers, para tutors actually had higher effect sizes than teacher tutors. Effect sizes were +0.53 for paras and +0.36 for teachers in one-to-one tutoring. For one-to-small group, effect sizes were +0.27 for paras, +0.09 for teachers.

Shocker #2: Volunteer tutoring was far less effective than tutoring by either paras or teachers. Some programs using volunteer tutors provided them with structured materials and extensive training and supervision. These found positive impacts, but far less than those for paraprofessional tutors. Volunteers tutoring one-to-one had an effect size of +0.18, paras had an effect size of +0.53. Because of the need for recruiting, training, supervision, and management, and also because the more effective tutoring models provide stipends or other pay, volunteers were not much less expensive than paraprofessionals as tutors.

Shocker #3:  Inexpensive substitutes for tutoring have not worked. Everyone knows that one-to-one tutoring works, so there has long been a quest for approaches that simulate what makes tutoring work. Yet so far, no one, as far as I know, has found a way to turn lead into tutoring gold. Although tutoring in math was about as effective as tutoring in reading, a program that used online math tutors communicating over the Internet from India and Sri Lanka to tutor students in England, for example, had no effect. Technology has long been touted as a means of simulating tutoring, yet even when computer-assisted instruction programs have been effective, their effect sizes have been far below those of the least expensive tutoring models, one-to-small group tutoring by paraprofessionals. In fact, in the Inns et al. (2018) review, no digital reading program was found to be effective with struggling readers in elementary schools.

 Shocker #4: Certain whole-class and whole-school approaches work as well or better for struggling readers than tutoring, on average. In the Inns et al. (2018) review, the average effect size for one-to-one tutoring approaches was +0.31, and for one-to-small group approaches it was +0.14. Yet the mean for whole-class approaches, such as Ladders to Literacy (ES = +0.48), PALS (ES = +0.65), and Cooperative Integrated Reading and Composition (ES = +0.19) averaged +0.33, similar to one-to-one tutoring by teachers (ES = +0.36). The mean effect sizes for comprehensive tiered school approaches, such as Success for All (ES = +0.41) and Enhanced Core Reading Instruction (ES = +0.22) was +0.43, higher than any category of tutoring (note that these models include tutoring as part of an integrated response to implementation approach). Whole-class and whole-school approaches work with many more students than do tutoring models, so these impacts are obtained at a much lower cost per pupil.

Why does tutoring work?

Most researchers and others would say that well-structured tutoring models work primarily because they allow tutors to fully individualize instruction to the needs of students. Yet if this were the only explanation, then other individualized approaches, such as computer-assisted instruction, would have outcomes similar to those of tutoring. Why is this not the case? And why do paraprofessionals produce at least equal outcomes to those produced by teachers as tutors? None of this squares with the idea that the impact of tutoring is entirely due to the tutor’s ability to recognize and respond to students’ unique needs. If that were so, other forms of individualization would be a lot more effective, and teachers would presumably be a lot more effective at diagnosing and responding to students’ problems than would less highly trained paraprofessionals. Further, whole-class and whole-school reading approaches, which are not completely individualized, would have much lower effect sizes than tutoring.

My theory to account for the positive effects of tutoring in light of the four “shockers” is this:

  • Tutoring does not work due to individualization alone. It works due to individualization plus nurturing and attention.

This theory begins with the fundamental and obvious assumption that children, perhaps especially low achievers, are highly motivated by nurturing and attention, perhaps far more than by academic success. They are eager to please adults who relate to them personally.  The tutoring setting, whether one-to-one or one-to-very small group, gives students the undivided attention of a valued adult who can give them personal nurturing and attention to a degree that a teacher with 20-30 students cannot. Struggling readers may be particularly eager to please a valued adult, because they crave recognition for success in a skill that has previously eluded them.

Nurturing and attention may explain the otherwise puzzling equality of outcomes obtained by teachers and paraprofessionals as tutors. Both types of tutors, using structured materials, may be equally able to individualize instruction, and there is no reason to believe that paras will be any less nurturing or attentive. The assumption that teachers would be more effective as tutors depends on the belief that tutoring is complicated and requires the extensive education a teacher receives. This may be true for very unusual learners, but for most struggling students, a paraprofessional may be as capable as a teacher in providing individualization, nurturing, and attention. This is not to suggest that paraprofessionals are as capable as teachers in every way. Teachers have to be good at many things: preparing and delivering lessons, managing and motivating classes, and much more. However, in their roles as tutors, teachers and paraprofessionals may be more similar.

Volunteers certainly can be nurturing and attentive, and can be readily trained in structured programs to individualize instruction. The problem, however, is that studies of volunteer programs report difficulties in getting volunteers to attend every day and to avoid dropping out when they get a paying job. This is may be less of a problem when volunteers receive a stipend; paid volunteers are much more effective than unpaid ones.

The failure of tutoring substitutes, such as individualized technology, is easy to predict if the importance of nurturing and attention is taken into account. Technology may be fun, and may be individualized, but it usually separates students from the personal attention of caring adults.

Whole-Class and Whole-School Approaches.

Perhaps the biggest shocker of all is the finding that for struggling readers, certain non-technology approaches to instruction for whole classes and schools can be as effective as tutoring. Whole-class and whole-school approaches can serve many more students at much lower cost, of course. These classroom approaches mostly use cooperative learning and phonics-focused teaching, or both, and the whole-school models especially Success for All,  combine these approaches with tutoring for students who need it.

The success of certain whole-class programs, of certain tutoring approaches, and of whole-school approaches that combine proven teaching strategies with tutoring for students who need more, argues for response to intervention (RTI), the policy that has been promoted by the federal government since the 1990s. So what’s new? What’s new is that the approach I’m advocating is not just RTI. It’s RTI done right, where each component of  the strategy has strong evidence of effectiveness.

The good news is that we have powerful and cost-effective tools at our disposal that we could be putting to use on a much more systematic scale. Yet we rarely do this, and as a result far too many students continue to struggle with reading, even ending up in special education due to problems schools could have prevented. That is the real shocker. It’s up to our whole profession to use what works, until reading failure becomes a distant memory. There are many problems in education that we don’t know how to solve, but reading failure in elementary school isn’t one of them.

Practical Implications.

Perhaps the most important practical implication of this discussion is a realization that benefits similar or greater than those of one-to-one tutoring by teachers can be obtained in other ways that can be cost-effectively extended to many more students: Using paraprofessional tutors, using one-to-small group tutoring, or using whole-class and whole-school tiered strategies. It is no longer possible to say with a shrug, “of course tutoring works, but we can’t afford it.” The “four shockers” tell us we can do better, without breaking the bank.

 

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (2017). Effective reading programs for secondary students. Manuscript submitted for publication. Also see Baye, A., Lake, C., Inns, A. & Slavin, R. E. (2017, August). Effective Reading Programs for Secondary Students. Baltimore, MD: Johns Hopkins University, Center for Research and Reform in Education.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Photo by Westsara (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

 

Implementing Proven Programs

There is an old joke that goes like this. A door-to-door salesman is showing a housewife the latest, fanciest, most technologically advanced vacuum cleaner. “Ma’am,” says the salesman, “this machine will do half your work!”

“Great!” says the housewife. “I’ll take two!”

All too often, when school leaders decide to adopt proven programs, they act like the foolish housewife. The program is going to take care of everything, they think. Or if it doesn’t, it’s the program’s fault, not theirs.

I wish I could tell you that you could just pick a program from our Evidence for ESSA site (launching on February 28! Next week!), wind it up, and let it teach all your kids, sort of the way a Roomba is supposed to clean your carpets. But I can’t.

Clearly, any program, no matter how good the evidence behind it is, has to be implemented with the buy-in and participation of all involved, planning, thoughtfulness, coordination, adequate professional development, interim assessment and data-based adjustments, and final assessment of program outcomes. In reality, implementing proven programs is difficult, but so is implementing ordinary unproven programs. All teachers and administrators go home every day dead tired, no matter what programs they use. The advantage of proven programs is that they hold out promise that this time, teachers’ and administrators’ efforts will pay off. Also, almost all effective programs provide extensive, high-quality professional development, and most teachers and administrators are energized and enthusiastic about engaging professional development. Finally, whole-school innovations, done right, engage the whole staff in common activities, exchanging ideas, strategies, successes, challenges, and insights.

So how can schools implement proven programs with the greatest possible chance of success? Here are a few pointers (from 43 years of experience!).

Get Buy-In. No one likes to be forced to do anything and no one puts in their best effort or imagination for an activity they did not choose.

When introducing a proven program to a school staff, have someone from the program provider’s staff come to explain it to the staff, and then get staff members to vote by secret ballot. Require an 80% majority.

This does several things. First, it ensures that the school staff is on board, willing to give the program their best shot. Second, it effectively silences the small minority in every school that opposes everything. After the first year, additional schools that did not select the program in the first round should be given another opportunity, but by then they will have seen how well the program works in neighboring schools.

Plan, Plan, Plan. Did you ever see the Far Side cartoon in which there is a random pile of horses and cowboys and a sheriff says, “You don’t just throw a posse together, dadgummit!” (or something like that). School staffs should work with program providers to carefully plan every step of program introduction. The planning should focus on how the program needs to be adapted to the specific requirements of this particular school or district, and make best use of human, physical, technological, and financial resources.

Professional Development. Perhaps the most common mistake in implementing proven programs is providing too little on-site, up-front training, and too little on-site, ongoing coaching. Professional development is expensive, especially if travel is involved, and users of proven programs often try to minimize costs by doing less professional development, or doing all or most of it electronically, or using “trainer-of-trainer” models (in which someone from the school or district learns the model and then teaches it to colleagues).

Here’s a dark secret. Developers of proven programs almost never use any of these training models in their own research. Quite the contrary, they are likely to have top-quality coaches swarming all over schools, visiting classes and ensuring high-quality implementation any way they can. Yet when it comes time for dissemination, they keep costs down by providing much, much less than what was needed (which is why they provided it in their studies). This is such a common problem that Evidence for ESSA excludes programs that used a lot of professional development in their research, but today just send an online manual, for example. Evidence for ESSA tries to describe dissemination requirements in terms of what was done in the research, not what is currently offered.

Coaching. Coaching means having experts visit teachers’ classes and give them individual or schoolwide feedback on their quality of implementation.

Coaching is essential because it helps teachers know whether they are on track to full implementation, and enables the project to provide individualized, actionable feedback. If you question the need for feedback, consider how you could learn to play tennis or golf, play the French horn, or act in Shakespearean plays, if no one ever saw you do it and gave you useful and targeted feedback and suggestions for improvement. Yet teaching is much, much more difficult.

Sure, coaching is expensive. But poor implementation squanders not only the cost of the program, but also teachers’ enthusiasm and belief that things can be better.

Feedback. Coaches, building facilitators, or local experts should have opportunities to give regular feedback to schools using proven programs, on implementation as well as outcomes. This feedback should be focused on solving problems together, not on blaming or shaming, but it is essential in keeping schools on track toward goals. At the end of each quarter or at least annually, school staffs need an opportunity to consider how they are doing with a proven program and how they are going to make it better.

Proven programs plus thoughtful, thorough implementation are the most powerful tool we have to make a major difference in student achievement across whole schools and districts. They build on the strengths of schools and teachers, and create a lasting sense of efficacy. A team of teachers and administrators that has organized itself around a proven program, implemented it with pride and creativity, and saw enhanced outcomes, is a force to be reckoned with. A force for good.

Time Passes. Will You?

When I was in high school, one of my teachers posted a sign on her classroom wall under the clock:

Time passes. Will you?

Students spend a lot of time watching clocks, yearning for the period to be over. Yet educators and researchers often seem to believe that more time is of course beneficial to kids’ learning. Isn’t that obvious?

In a major review of secondary reading programs I am completing with my colleagues Ariane Baye, Cynthia Lake, and Amanda Inns, it turns out that the kids were right. More time, at least in remedial reading, may not be beneficial at all.

Our review identified 60 studies of extraordinary quality- mostly large-scale randomized experiments- evaluating reading programs for students in grades 6 to 12. In most of the studies, students reading 2 to 5 grade levels below expectations were randomly assigned to receive an extra class period of reading instruction every day all year, in some cases for two or three years. Students randomly assigned to the control group continued in classes such as art, music, or study hall. The strategies used in the remedial classes varied widely, including technology approaches, teaching focused on metacognitive skills (e.g., summarization, clarification, graphic organizers), teaching focused on phonics skills that should have been learned in elementary school, and other remedial approaches, all of which provided substantial additional time for reading instruction. It is also important to note that the extra-time classes were generally smaller than ordinary classes, in the range of 12 to 20 students.

In contrast, other studies provided whole class or whole school methods, many of which also focused on metacognitive skills, but none of which provided additional time.

Analyzing across all studies, setting aside five British tutoring studies, there was no effect of additional time in remedial reading. The effect size for the 22 extra-time studies was +0.08, while for 34 whole class/whole school studies, it was slightly higher, ES =+0.10. That’s an awful lot of additional teaching time for no additional learning benefit.

So what did work? Not surprisingly, one-to-one and small-group tutoring (up to one to four) were very effective. These are remedial and do usually provide additional teaching time, but in a much more intensive and personalized way.

Other approaches that showed particular promise simply made better use of existing class time. A program called The Reading Edge involves students in small mixed-ability teams where they are responsible for the reading success of all team members. A technology approach called Achieve3000 showed substantial gains for low-achieving students. A whole-school model called BARR focuses on social-emotional learning, building relationships between teachers and students, and carefully monitoring students’ progress in reading and math. Another model called ERWC prepares 12th graders to succeed on the tests used to determine whether students have to take remedial English at California State Universities.

What characterized these successful approaches? None were presented as remedial. All were exciting and personalized, and not at all like traditional instruction. All gave students social supports from peers and teachers, and reasons to hope that this time, they were going to be successful.

There is no magic to these approaches, and not every study of them found positive outcomes. But there was clearly no advantage of remedial approaches providing extra time.

In fact, according to the data, students would have done just as well to stay in art or music. And if you’d asked the kids, they’d probably agree.

Time is important, but motivation, caring, and personalization are what counts most in secondary reading, and surely in other subjects as well.

Time passes. Kids will pass, too, if we make such good use of our time with them that they won’t even notice the minutes going by.

Keep Up the Good Work (To Keep Up the Good Outcomes)

I just read an outstanding study that contains a hard but crucially important lesson. The study, by Woodbridge et al. (2014), evaluated a behavior management program for students with behavior problems. The program, First Step to Success, has been successfully evaluated many times. In the Woodbridge et al. study, 200 children in grades 1 to 3 with serious behavior problems were randomly assigned to experimental or control groups. On behavior and achievement measures, students in the experimental group scored much higher, with effect sizes of +0.44 to +0.87. Very impressive.

The researchers came back a year later to see if the outcomes were still there. Despite the substantial impacts seen at posttest, none of three prosocial/adaptive behavior measures, only one of three problem/maladaptive behaviors, and none of four academic achievement measures showed positive outcomes.

These findings were distressing to the researchers, but they contain a message. In this study, students passed from teachers who had been trained in the First Step method to teachers who had not. The treatment is well-established and inexpensive. Why should it ever be seen as a one-year intervention with a follow-up? Instead, imagine that all teachers in the school learned the program and all continued to implement it for many years. In this circumstance, it would be highly likely that the first-year positive impacts would be sustained and most likely improved over time.

Follow-up assessments are always interesting, and for interventions that are very expensive it may be crucial to demonstrate lasting impacts. But so often in education effective treatments can be maintained for many years, creating more effective school-wide environments and lasting impacts over time. Much as we might like to have one-shot treatments with long-lasting impacts, this does not correspond to the nature of children. The personal, family, or community problems that led children to have problems at a given point in time are likely to lead to problems in the future, too. But the solution is clear. Keep up the good work to keep up the good outcomes!

How Much Difference Does an Education Program Make?

When you use Consumer Reports car repair ratings to choose a reliable car, you are doing something a lot like what evidence-based reform in education is proposing. You look at the evidence and take it into account, but it does not drive you to a particular choice. There are other factors you’d also consider. For example, Consumer Reports might point you to reliable cars you can’t afford, or ones that are too large or too small or too ugly for your purposes and tastes, or ones with dealerships that are too far away. In the same way, there are many factors that school staffs or educational leaders might consider beyond effect size.

An effect size, or statistical significance, is only a starting point for estimating the impact a program or set of programs might have. I’d propose the term “potential impact” to subsume the following factors that a principal or staff might consider beyond effect size or statistical significance in adopting a program to improve education outcomes:

  • Cost-effectiveness
  • Evidence from similar schools
  • Immediate and long-term payoffs
  • Sustainability
  • Breadth of impact
  • Low-hanging fruit
  • Comprehensiveness

Cost-EffectivenessEconomists’ favorite criterion of effectiveness is cost-effectiveness. Cost-effectiveness is simple in concept (how much gain did the program cause at what cost?), but in fact there are two big elements of cost-effectiveness that are very difficult to determine:

1. Cost
2. Effectiveness

Cost should be easy, right? A school buys some service or technology and pays something for it. Well, it’s almost never so clear. When a school uses a given innovation, there are usually costs beyond the purchase price. For example, imagine that a school purchases digital devices for all students, loaded with all the software they will need. Easy, right? Wrong. Should you count in the cost of the time the teachers spend in professional development? The cost of tech support? Insurance? Security costs? The additional electricity required? Space for storage? Additional loaner units to replace lost or broken units? The opportunity costs for whatever else the school might have chosen to do?

Here is an even more difficult example. Imagine a school starts a tutoring program for struggling readers using paraprofessionals as tutors. Easy, right? Wrong. There is the cost for the paraprofessionals’ time, of course, but what if the paraprofessionals were already on the schools’ staff? If so, then a tutoring program may be very inexpensive, but if additional people must be hired as tutors, then tutoring is a far more expensive proposition. Also, if paraprofessionals already in the school are no longer doing what they used to do, might this diminish student outcomes? Then there is the problem with outcomes. As I explained in a recent blog, the meaning of effect sizes depends on the nature of the studies that produced them, so comparing apples to apples may be difficult. A principal might look at effect sizes for two programs and decide they look very similar. Yet one effect size might be from large-scale randomized experiments, which tend to produce smaller (and more meaningful) effect sizes, while the other might be from less rigorous studies.

Nevertheless, issues of cost and effectiveness do need to be considered. Somehow.

Evidence from Similar Schools
Clearly, a school staff would want to know that a given program has been successful in schools like theirs. For example, schools serving many English learners, or schools in rural areas, or schools in inner-city locations, might be particularly interested in data from similar schools. At a minimum, they should want to know that the developers have worked in schools like theirs, even if the evidence only exists from less similar schools.

Immediate and Long-Term Payoffs
Another factor in program impacts is the likelihood that a program will solve a very serious problem that may ultimately have a big effect on individual students and perhaps save a lot of money over time. For example, it may be that a very expensive parent training program may make a big difference for students with serious behavior problems. If this program produces lasting effects (documented in the research), its high cost might be justified, especially if it might reduce the need for even more expensive interventions, such as special education placement, expulsion, or incarceration.

Sustainability
Programs that either produce lasting impacts, or those that can be readily maintained over time, are clearly preferable to those that have short-term impacts only. In education, long-term impacts are not typically measured, but sustainability can be determined by the cost, effort, and other elements required to maintain an intervention. Most programs get a lot cheaper after the first year, so sustainability can usually be assumed. This means that even programs with modest effect sizes could bring about major changes over time.

Breadth of Impact
Some educational interventions with modest effect sizes might be justified because they apply across entire schools and for many years. For example, effective coaching for principals might have a small effect overall, but if that effect is seen across thousands of students over a period of years, it might be more than worthwhile. Similarly, training teachers in methods that become part of their permanent repertoire, such as cooperative learning, teaching metacognitive skills, or classroom management, might affect hundreds of students per teacher over time.

Low-Hanging Fruit
Some interventions may have either modest impacts on students in general, or strong outcomes for only a subset of students, but be so inexpensive or easy to adopt and implement that it would be foolish not to do so. One example might be making sure that disadvantaged students who need eyeglasses are assessed and given glasses. Not everyone needs glasses, but for those who do this makes a big difference at low cost. Another example might be implementing a whole-school behavior management approach like Positive Behavior Interventions and Support (PBIS), a low-cost, proven approach any school can implement.

Comprehensiveness
Schools have to solve many quite different problems, and they usually do this by pulling various solutions off of various shelves. The problem is that this approach can be uncoordinated and inefficient. The different elements may not link up well with each other, may compete for the time and attention of the staff, and may cost a lot more than a unified, comprehensive solution that addresses many objectives in a planful way. A comprehensive approach is likely to have a coherent plan for professional development, materials, software, and assessment across all program elements. It is likely to have a plan for sustaining its effects over time and extending into additional parts of the school or additional schools.

Potential Impact
Potential impact is the sum of all the factors that make a given program or a coordinated set of programs effective in the short and long term, broad in its impact, focused on preventing serious problems, and cost-effective. There is no numerical standard for potential impact, but the concept is just intended to give educators making important choices for their kids a set of things to consider, beyond effect size and statistical significance alone.

Sorry. I wish this were simple. But kids are complex, organizations are complex, and systems are complex. It’s always a good idea for education leaders to start with the evidence but then think through how programs can be used as tools to transform their particular schools.

Seeking Jewels, Not Boulders: Learning to Value Small, Well-Justified Effect Sizes

One of the most popular exhibits in the Smithsonian Museum of Natural History is the Hope Diamond, one of the largest and most valuable in the world. It’s always fun to see all the kids flow past it saying how wonderful it would be to own the Hope Diamond, how beautiful it is, and how such a small thing could make you rich and powerful.

The diamonds are at the end of the Hall of Minerals, which is crammed full of exotic minerals from all over the world. These are beautiful, rare, and amazing in themselves, yet most kids rush past them to get to the diamonds. But no one, ever, evaluates the minerals against one another according to their size. No one ever says, “you can have your Hope Diamond, but I’d rather have this giant malachite or feldspar.” Just getting into the Smithsonian, kids go by boulders on the National Mall far larger than anything in the Hall of Minerals, perhaps climbing on them but otherwise ignoring them completely.

Yet in educational research, we often focus on the size of study effects without considering their value. In a recent blog, I presented data from a paper with my colleague Alan Cheung analyzing effect sizes from 611 studies evaluating reading, math, and science programs, K-12, that met the inclusion standards of our Best Evidence Encyclopedia. One major finding was that in randomized evaluations with sample sizes of 250 students (10 classes) or more, the average effect size across 87 studies was only +0.11. Smaller randomized studies had effect sizes averaging +0.22, large matched quasi-experiments +0.17, and small quasi-experiments, +0.32. In this blog, I want to say more about how these findings should make us think differently about effect sizes as we increasingly apply evidence to policy and practice.

Large randomized experiments (RCTs) with significant positive outcomes are the diamonds of educational research: rare, often flawed, but incredibly valuable. The reason they are so valuable is that such studies are the closest indication of what will happen when a given program goes out into the real world of replication. Randomization removes the possibility that self-selection may account for program effects. The larger the sample size, the less likely it is that the experimenter or developer could closely monitor each class and mentor each teacher beyond what would be feasible in real-life scale up. Most large-scale RCTs use clustering, which usually means that the treatment and randomization take place at the level of the whole school. A cluster randomized experiment at the school level might require recruiting 40 to 50 schools, perhaps serving 20,000 to 25,000 students. Yet such studies might nevertheless be too “small” to detect an effect size of, say, 0.15, because it is the number of clusters, not the number of students, that matters most!

The problem is that we have been looking for much larger effect sizes, and all too often not finding them. Traditionally, researchers recruit enough schools or classes to reliably detect an effect size as small as +0.20. This means that many studies report effect sizes that turn out to be larger than average for large RCTs, but are not statistically significant (at p<.05), because they are less than +0.20. If researchers did recruit samples of schools large enough to detect an effect size of +0.15, this would greatly increase the costs of such studies. Large RCTs are already very expensive, so substantially increasing sample sizes could end up requiring resources far beyond what educational research is likely to see any time in the near future or greatly reducing the number of studies that are funded.

These issues have taken on greater importance recently due to the passage of the Every Student Succeeds Act, or ESSA, which encourages use of programs that meet strong, moderate, or promising levels of evidence. The “strong” category requires that a program have at least one randomized experiment that found a significant positive effect. Such programs are rare.

If educational researchers were mineralogists, we’d be pretty good at finding really big diamonds, but the little, million-dollar diamonds, not so much. This makes no sense in diamonds, and no sense in educational research.

So what do we do? I’m glad you asked. Here are several ways we could proceed to increase the number of programs successfully evaluated in RCTs.

1. For cluster randomized experiments at the school level, something has to give. I’d suggest that for such studies, the p value should be increased to .10 or even .20. A p value of .05 is a long-established convention, indicating that there is only one chance in 20 that the outcomes are due to luck. Yet one chance in 10 (p=.10) may be sufficient in studies likely to have tens of thousands of students.

2. For studies in the past, as well as in the future, replication should be considered the same as large sample size. For example, imagine that two studies of Program X each have 30 schools. Each gets a respectable effect size of +0.20, which would not be significant in either case. Put the two studies together, however, and voila! The combined study of 60 schools would be highly significant, even at p=.05.

3. Government or foundation funders might fund evaluations in stages. The first stage might involve a cluster randomized experiment of, say, 20 schools, which is very unlikely to produce a significant difference. But if the effect size were perhaps 0.20 or more, the funders might fund a second stage of 30 schools. The two samples together, 50 schools, would be enough to detect a small but important effect.

One might well ask why we should be interested in programs that only produce effect sizes of 0.15? Aren’t these effects too small to matter?

The answer is that they are not. Over time, I hope we will learn how to routinely produce better outcomes. Already, we know that much larger impacts are found in studies of certain approaches emphasizing professional development (e.g., cooperative learning, meta cognitive skills) and certain forms of technology. I hope and expect that over time, more studies will evaluate programs using methods like those that have been proven to work, and fewer will evaluate those that do not, thereby raising the average effects we find. But even as they are, small but reliable effect sizes are making meaningful differences in the lives of children, and will make much more meaningful differences as we learn from efforts at the Institute of Education Sciences (IES) and the Investing in Innovation (i3)/Education Innovation and Research (EIR) programs.

Small effect sizes from large randomized experiments are the Hope Diamonds of our profession. They also are the best hope for evidence-based improvements for all students.