Could Proven Programs Eliminate Gaps in Elementary Reading Achievement?

What if every child in America could read at grade level or better? What if the number of students in special education for learning disabilities, or retained in grade, could be cut in half?

What if students who become behavior problems or give up on learning because of nothing more than reading difficulties could instead succeed in reading and no longer be frustrated by failure?

Today these kinds of outcomes are only pipe dreams. Despite decades of effort and billions of dollars directed toward remedial and special education, reading levels have barely increased.  Gaps between middle class and economically disadvantaged students remain wide, as do gaps between ethnic groups. We’ve done so much, you might think, and nothing has really worked at scale.

Yet today we have many solutions to the problems of struggling readers, solutions so effective that if widely and effectively implemented, they could substantially change not only the reading skills, but the life chances of students who are struggling in reading.

blog_4-25-19_teacherreading_500x333

How do I know this is possible? The answer is that the evidence is there for all to see.

This week, my colleagues and I released a review of research on programs for struggling readers. The review, written by Amanda Inns, Cynthia Lake, Marta Pellegrini, and myself, uses academic language and rigorous review methods. But you don’t have to be a research expert to understand what we found out. In ten minutes, just reading this blog, you will know what needs to be done to have a powerful impact on struggling readers.

Everyone knows that there are substantial gaps in student reading performance according to social class and race. According to the National Assessment of Educational Progress, or NAEP, here are key gaps in terms of effect sizes at fourth grade:

Gap in Effect Sizes
No Free/Reduced lunch/

Free/Reduced lunch

0.56
White/African American 0.52
White/Hispanic 0.46

These are big differences. In order to eliminate these gaps, we’d have to provide schools serving disadvantaged and minority students with programs or services sufficient to increase their reading scores by about a half standard deviation. Is this really possible?

Can We Really Eliminate Such Big and Longstanding Gaps?

Yes, we can. And we can do it cost-effectively.

Our review examined thousands of studies of programs intended to improve the reading performance of struggling readers. We found 59 studies of 39 different programs that met very high standards of research quality. 73% of the qualifying studies used random assignment to experimental or control groups, just as the most rigorous medical studies do. We organized the programs into response to intervention (RTI) tiers:

Tier 1 means whole-class programs, not just for struggling readers

Tier 2 means targeted services for students who are struggling to read

Tier 3 means intensive services for students who have serious difficulties.

Our categories were as follows:

Multi-Tier (Tier 1 + tutoring for students who need it)

Tier 1:

  • Whole-class programs

Tier 2:

  • Technology programs
  • One-to-small group tutoring

Tier 3:

  • One-to-one tutoring

We are not advocating for RTI itself, because the data on RTI are unclear. But it is just common sense to use proven programs with all students, then proven remedial approaches with struggling readers, then intensive services for students for whom Tier 2 is not sufficient.

Do We Have Proven Programs Able to Overcome the Gaps?

The table below shows average effect sizes for specific reading approaches. Wherever you see effect sizes that approach or exceed +0.50, you are looking at proven solutions to the gaps, or at least programs that could become a component in a schoolwide plan to ensure the success of all struggling readers.

Programs That Work for Struggling Elementary Readers

Multi-Tier Approaches Grades Proven No. of Studies Mean Effect Size
      Success for All K-5 3 +0.35
      Enhanced Core Reading Instruction 1 1 +0.24
Tier 1 – Classroom Approaches      
     Cooperative Integrated Reading                        & Composition (CIRC) 2-6 3 +0.11
      PALS 1 1 +0.65
Tier 2 – One-to-Small Group Tutoring      
      Read, Write, & Type (T 1-3) 1 1 +0.42
      Lindamood (T 1-3) 1 1 +0.65
      SHIP (T 1-3) K-3 1 +0.39
      Passport to Literacy (TA 1-4/7) 4 4 +0.15
      Quick Reads (TA 1-2) 2-3 2 +0.22
Tier 3 One-to-One Tutoring
      Reading Recovery (T) 1 3 +0.47
      Targeted Reading Intervention (T) K-1 2 +0.50
      Early Steps (T) 1 1 +0.86
      Lindamood (T) K-2 1 +0.69
      Reading Rescue (T or TA) 1 1 +0.40
      Sound Partners (TA) K-1 2 +0.43
      SMART (PV) K-1 1 +0.40
      SPARK (PV) K-2 1 +0.51

Key:    T: Certified teacher tutors

TA: Teaching assistant tutors

PV: Paid volunteers (e.g., AmeriCorps members)

1-X: For small group tutoring, the usual group size for tutoring (e.g., 1-2, 1-4)

(For more information on each program, see www.evidenceforessa.org)

The table is a road map to eliminating the achievement gaps that our schools have wrestled with for so long. It only lists programs that succeeded at a high level, relative to others at the same tier levels. See the full report or www.evidenceforessa for information on all programs.

It is important to note that there is little evidence of the effectiveness of tutoring in grades 3-5. Almost all of the evidence is from grades K-2. However, studies done in England in secondary schools have found positive effects of three reading tutoring programs in the English equivalent of U.S. grades 6-7. These findings suggest that when well-designed tutoring programs for grades 3-5 are evaluated, they will also show very positive impacts. See our review on secondary reading programs at www.bestevidence.org for information on these English middle school tutoring studies. On the same website, you can also see a review of research on elementary mathematics programs, which reports that most of the successful studies of tutoring in math took place in grades 2-5, another indicator that reading tutoring is also likely to be effective in these grades.

Some of the individual programs have shown effects large enough to overcome gaps all by themselves if they are well implemented (i.e., ES = +0.50 or more). Others have effect sizes lower than +0.50 but if combined with other programs elsewhere on the list, or if used over longer time periods, are likely to eliminate gaps. For example, one-to-one tutoring by certified teachers is very effective, but very expensive. A school might implement a Tier 1 or multi-tier approach to solve all the easy problems inexpensively, then use cost-effective one-to-small group methods for students with moderate reading problems, and only then use one-to-one tutoring with the small number of students with the greatest needs.

Schools, districts, and states should consider the availability, practicality, and cost of these solutions to arrive at a workable solution. They then need to make sure that the programs are implemented well enough and long enough to obtain the outcomes seen in the research, or to improve on them.

But the inescapable conclusion from our review is that the gaps can be closed, using proven models that already exist. That’s big news, news that demands big changes.

Photo credit: Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Moneyball for Education

When I was a kid, growing up in the Maryland suburbs of Washington, DC, everyone I knew rooted for the hapless Washington Senators, one of the worst baseball teams ever. At that time, however, the Baltimore Orioles were one of the best teams in baseball, and every once in a while a classmate would snap. He (always “he”) would decide to become an Orioles fan. This would cause him to be shamed and ostracized for the rest of his life by all true Senators fans.

I’ve now lived in Baltimore for most of my life. I wonder if I came here in part because of my youthful impression of Baltimore as a winning franchise?

blog_3-14-19_moneyball_500x435

Skipping forward in time to now, I recently saw in the New York Times an article about the collapse of the Baltimore Orioles. In 2018, they had the worst record of any team in history. Worse than even the Washington Senators ever were. Why did this happen? According to the NYT, the Orioles are one of the last teams to embrace analytics, which means using evidence to decide which players to recruit or drop, to put on the field or on the bench. Some teams have analytics departments of 15. The Orioles? Zero, although they have just started one.

It’s not as though the benefits of analytics are a secret. A 2003 book by Michael Lewis, Moneyball, explained how the underfunded Oakland As used analytics to turn themselves around. A hugely popular 2011 movie told the same story.

In case anyone missed the obvious linkage of analytics in baseball to analytics in education, Results for America (RfA), a group that promotes the use of evidence in government social programs, issued a 2015 book called, you guessed it, Moneyball for Government (Nussle & Orszag, 2015). This Moneyball focused on success stories and ideas from key thinkers and practitioners in government and education. RfA was instrumental in encouraging the U.S. Congress to include in ESSA definitions of strong, moderate, and promising evidence of effectiveness, and to specify a few areas of federal funding that require or incentivize use of proven programs.

The ESSA evidence standards are a giant leap forward in supporting the use of evidence in education. Yet, like the Baltimore Orioles, the once-admired U.S. education system has been less than swept away by the idea that using proven programs and practices could improve outcomes for children. Yes, the situation is better than it was, but things are going very slowly. I’m worried that because of this, the whole evidence movement in education will someday be dismissed: “Evidence? Yeah, we tried that. Didn’t work.”

There are still good reasons for hope. The amount of high-quality evidence continues to grow at an unprecedented pace. The ESSA evidence standards have at least encouraged federal, state, and local leaders to pay some attention to evidence, though moving to action based on this evidence is a big lift.

Perhaps I’m just impatient. It took the Baltimore Orioles a book, a movie, and 16 years to arrive at the conclusion that maybe, just maybe, it was time to use evidence, as winning teams have been doing for a long time. Education is much bigger, and its survival does not depend on its success (as baseball teams do). Education will require visionary leadership to embrace the use of evidence. But I am confident that when it does, we will be overwhelmed by visits from educators from Finland, Singapore, China, and other countries that currently clobber us in international comparisons. They’ll want to know how the U.S. education system became the best in the world. Perhaps we’ll have to write a book and a movie to explain it all.  I’d suggest we call it . . . “Learnball.”

References

Nussle, J., & Orszag, P. (2015). Moneyball for Government (2nd Ed.). Washington, DC: Disruption Books.

Photo credit: Keith Allison [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Replication

The holy grail of science is replication. If a finding cannot be repeated, then it did not happen in the first place. There is a reason that the humor journal in the hard sciences is called the Journal of Irreproducible Results. For scientists, results that are irreproducible are inherently laughable, therefore funny. In many hard science experiments, replication is pretty much guaranteed. If you heat an iron bar, it gets longer. If you cross parents with the same recessive gene, one quarter of their progeny will express the recessive trait (think blue eyes).

blog_1-24-19_bunnies_500x363

In educational research, we care about replication just as much as our colleagues in the lab coats across campus. However, when we’re talking about evaluating instructional programs and practices, replication is a lot harder, because students and schools differ. Positive outcomes obtained in one experiment may or may not replicate in a second trial. Sometimes this is true because the first experiment had features known to contribute to bias: small sample sizes, brief study durations, extraordinary amounts of resources or expert time to help the experimental schools or classes, use of measures made by the developers or researchers or otherwise overaligned with the experimental group (but not the control group), or use of matched rather than randomized assignment to conditions, can all contribute to successful-appearing outcomes in a first experiment. Second or third experiments are more likely to be larger, longer, and more stringent than the first study, and therefore may not replicate. Even when the first study has none of these problems, it may not replicate because of differences in the samples of schools, teachers, or students, or for other, perhaps unknowable problems. A change in the conditions of education may cause a failure to replicate. Our Success for All whole-school reform model has been found to be effective many times, mostly by third party evaluators. However, Success for All has always specified a full-time facilitator and at least one tutor for each school. An MDRC i3 evaluation happened to fall in the middle of the recession, and schools, which were struggling to afford classroom teachers, could not afford facilitators or tutors. The results were still positive on some measures, especially for low achievers, but the effect sizes were less than half of what others had found in many studies. Stuff happens.

Replication has taken on more importance recently because the ESSA evidence standards only require a single positive study. To meet the strong, moderate, or promising standards, programs must have at least one “well-designed and well-implemented” study using randomized (strong), matched (moderate), or correlational (promising) designs and finding significantly positive outcomes. Based on the “well-designed and well-implemented” language, our Evidence for ESSA website requires features of experiments similar to those also required by the What Works Clearinghouse (WWC). These requirements make it difficult to be approved, but they remove many of the experimental design features that typically cause first studies to greatly overstate program impacts: small size, brief durations, overinvolved experimenters, and developer-made measures. They put (less rigorous) matched and correlational studies in lower categories. So one study that meets ESSA or Evidence for ESSA requirements is at least likely to be a very good study. But many researchers have expressed discomfort with the idea that a single study could qualify a program for one of the top ESSA categories, especially if (as sometimes happens) there is one study with a positive outcomes and many with zero or at least nonsignificant outcomes.

The pragmatic problem is that if ESSA had required even two studies showing positive outcomes, this would wipe out a very large proportion of current programs. If research continues to identify effective programs, it should only be a matter of time before ESSA (or its successors) requires more than one study with a positive outcomes.

However, in the current circumstance, there is a way researchers and educators might at least estimate the replicability of given programs when they have only a single study with a significant positive outcomes. This would involve looking at the findings for entire genres of programs. The logic here is that if a program has only one ESSA-qualifying study, but it closely resembles other programs that also have positive outcomes, that program should be taken a lot more seriously than a program that obtained a positive outcome that differs considerably from outcomes of very similar programs.

As one example, there is much evidence from many studies by many researchers indicating positive effects of one-to-one and one-to-small group tutoring, in reading and mathematics. If a tutoring program has only one study, but this one study has significant positive findings, I’d say thumbs up. I’d say the same about cooperative learning approaches, classroom management strategies using behavioral principles, and many others, where a whole category of programs has had positive outcomes.

In contrast, if a program has a single positive outcome and there are few if any similar approaches that obtained positive outcomes, I’d be much more cautious. An example might be textbooks in mathematics, which rarely make any difference because control groups are also likely to be using textbooks, and textbooks considerably resemble each other. In our recent elementary mathematics review (Pellegrini, Lake, Inns, & Slavin, 2018), only one textbook program available in the U.S. had positive outcomes (out of 16 studies). As another example, there have been several large randomized evaluations of the use of interim assessments. Only one of them found positive outcomes. I’d be very cautious about putting much faith in benchmark assessments based on this single anomalous finding.

Looking for findings from similar studies is facilitated by looking at reviews we make available at www.bestevidence.org. These consist of reviews of research organized by categories of programs. Looking for findings from similar programs won’t help with the ESSA law, which often determines its ratings based on the findings of a single study, regardless of other findings on the same program or similar programs. However, for educators and researchers who really want to find out what works, I think checking similar programs is not quite as good as finding direct replication of positive findings on the same programs, but perhaps, as we like to say, close enough for social science.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Tutoring Works. But Let’s Learn How It Can Work Better and Cheaper

I was once at a meeting of the British Education Research Association, where I had been invited to participate in a debate about evidence-based reform. We were having what journalists often call “a frank exchange of views” in a room packed to the rafters.

At one point in the proceedings, a woman stood up and, in a furious tone of voice, informed all and sundry that (I’m paraphrasing here) “we don’t need to talk about all this (very bad word). Every child should just get Reading Recovery.” She then stomped out.

I don’t know how widely her view was supported in the room or anywhere else in Britain or elsewhere, but what struck me at the time, and what strikes even more today, is the degree to which Reading Recovery has long defined, and in many ways limited, discussions about tutoring. Personally, I have nothing against Reading Recovery, and I have always admired the commitment Reading Recovery advocates have had to professional development and to research. I’ve also long known that the evidence for Reading Recovery is very impressive, but you’d be amazed if one-to-one tutoring by well-trained teachers did not produce positive outcomes. On the other hand, Reading Recovery insists on one-to-one instruction by certified teachers with a lot of cost for all that admirable professional development, so it is very expensive. A British study estimated the cost per child at $5400 (in 2018 dollars). There are roughly one million Year 1 students in the U.K., so if the angry woman had her way, they’d have to come up with the equivalent of $5.4 billion a year. In the U.S., it would be more like $27 billion a year. I’m not one to shy away from very expensive proposals if they provide also extremely effective services and there are no equally effective alternatives. But shouldn’t we be exploring alternatives?

If you’ve been following my blogs on tutoring, you’ll be aware that, at least at the level of research, the Reading Recovery monopoly on tutoring has been broken in many ways. Reading Recovery has always insisted on certified teachers, but many studies have now shown that well-trained teaching assistants can do just as well, in mathematics as well as reading. Reading Recovery has insisted that tutoring should just be for first graders, but numerous studies have now shown positive outcomes of tutoring through seventh grade, in both reading and mathematics. Reading Recovery has argued that its cost was justified by the long-lasting impacts of first-grade tutoring, but their own research has not documented long-lasting outcomes. Reading Recovery is always one-to-one, of course, but now there are numerous one-to-small group programs, including a one-to-three adaptation of Reading Recovery itself, that produce very good effects. Reading Recovery has always just been for reading, but there are now more than a dozen studies showing positive effects of tutoring in math, too.

blog_12-20-18_tutornkid_500x333

All of this newer evidence opens up new possibilities for tutoring that were unthinkable when Reading Recovery ruled the tutoring roost alone. If tutoring can be effective using teaching assistants and small groups, then it is becoming a practicable solution to a much broader range of learning problems. It also opens up a need for further research and development specific to the affordances and problems of tutoring. For example, tutoring can be done a lot less expensively than $5,400 per child, but it is still expensive. We created and evaluated a one-to-six, computer-assisted tutoring model that produced effect sizes of around +0.40 for $500 per child. Yet I just got a study from the Education Endowment Fund (EEF) in England evaluating one-to-three math tutoring by college students and recent graduates. They only provided tutoring one hour per week for 12 weeks, to sixth graders. The effect size was much smaller (ES=+0.19), but the cost was only about $150 per child.

I am not advocating this particular solution, but isn’t it interesting? The EEF also evaluated another means of making tutoring inexpensive, using online tutors from India and Sri Lanka, and another, using cross-age peer tutors, both in math. Both failed miserably, but isn’t that interesting?

I can imagine a broad range of approaches to tutoring, designed to enhance outcomes, minimize costs, or both. Out of that research might come a diversity of approaches that might be used for different purposes. For example, students in deep trouble, headed for special education, surely need something different from what is needed by students with less serious problems. But what exactly is it that is needed in each situation?

In educational research, reliable positive effects of any intervention are rare enough that we’re usually happy to celebrate anything that works. We might say, “Great, tutoring works! But we knew that.”  However, if tutoring is to become a key part of every school’s strategies to prevent or remediate learning problems, then knowing that “tutoring works” is not enough. What kind of tutoring works for what purposes?  Can we use technology to make tutors more effective? How effective could tutoring be if it is given all year or for multiple years? Alternatively, how effective could we make small amounts of tutoring? What is the optimal group size for small group tutoring?

We’ll never satisfy the angry woman who stormed out of my long-ago symposium at BERA. But for those who can have an open mind about the possibilities, building on the most reliable intervention we have for struggling learners and creating and evaluating effective and cost-effective tutoring approaches seems like a worthwhile endeavor.

Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Government Plays an Essential Role in Diffusion of Innovations

Lately I’ve been hearing a lot of concern in reform circles about how externally derived evidence can truly change school practices and improve outcomes. Surveys of principals, for example, routinely find that principals rarely consult research in making key decisions, including decisions about adopting materials, software, or professional development intended to improve student outcomes. Instead, principals rely on their friends in similar schools serving similar students. In the whole process, research rarely comes up, and if it does, it is often generic research on how children learn rather than high-quality evaluations of specific programs they might adopt.

Principals and other educational leaders have long been used to making decisions without consulting research. It would be difficult to expect otherwise, because of three conditions that have prevailed roughly from the beginning of time to very recently: a) There was little research of practical value on practical programs; b) The research that did exist was of uncertain quality, and school leaders did not have the time or training to determine studies’ validity; c) There were no resources provided to schools to help them adopt proven programs, so doing so required that they spend their own scarce resources.

Under these conditions, it made sense for principals to ask around among their friends before selecting programs or practices. When no one knows anything about a program’s effectiveness, why not ask your friends, who at least (presumably) have your best interests at heart and know your context? Since conditions a, b, and c have defined the context for evidence use nearly up to the present, it is not surprising that school leaders have built a culture of distrust for anyone outside of their own circle when it comes to choosing programs.

However, all three of conditions a, b, and c have changed substantially in recent years, and they are continuing to change in a positive direction at a rapid rate:

a) High-quality research on practical programs for elementary and secondary schools is growing at an extraordinary rate. As shown in Figure 1, the number of rigorous randomized or quasi-experimental studies in elementary and secondary reading and in elementary math have skyrocketed since about 2003, due mostly to investments by the Institute for Education Sciences (IES) and Investing in Innovation (i3). There has been a similar explosion of evidence in England, due to funding from the Education Endowment Foundation (EEF). Clearly, we know a lot more about which programs work and which do not than we once did.

blog_1-10-19_graph2_1063x650

b) Principals, teachers, and the public can now easily find reliable and accessible information on practical programs on the What Works Clearinghouse (WWC), Evidence for ESSA, and other sites. No one can complain any more that information is inaccessible or incomprehensible.

c) Encouragement and funding are becoming available for schools eager to use proven programs. Most importantly, the federal ESSA law is providing school improvement funding for low-achieving schools that agree to implement programs that meet the top three ESSA evidence standards (strong, moderate, or promising). ESSA also provides preference points for applications for certain sources of federal funding if they promise to use the money to implement proven programs. Some states have extended the same requirement to apply to eligibility for state funding for schools serving students who are disadvantaged or are ethnic or linguistic minorities. Even schools that do not meet any of these demographic criteria are, in many states, being encouraged to use proven programs.

blog_1-10-19_uscapitol_500x375

Photo credit: Jorge Gallo [Public domain], from Wikimedia Commons

I think the current situation is like that which must have existed in, say, 1910, with cars and airplanes. Anyone could see that cars and airplanes were the future. But I’m sure many horse-owners pooh-poohed the whole thing. “Sure there are cars,” they’d say, “but who will build all those paved roads? Sure there are airplanes, but who will build airports?” The answer was government, which could see the benefits to the entire economy of systems of roads and airports to meet the needs of cars and airplanes.

Government cannot solve all problems, but it can create conditions to promote adoption and use of proven innovations. And in education, federal, state, and local governments are moving rapidly to do this. Principals may still prefer to talk to other principals, and that’s fine. But with ever more evidence on ever more programs and with modest restructuring of funds governments are already awarding, conditions are coming together to utterly transform the role of evidence in educational practice.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

A Warm Welcome From Babe Ruth’s Home Town to the Registry of Efficacy and Effectiveness Studies (REES)

Every baseball season, many home runs are hit by various players across the major leagues. But in all of history, there is one home run that stands out for baseball fans. In the 1932 World Series, Babe Ruth (born in Baltimore!) pointed to the center field fence. He then hit the next pitch over that fence, exactly where he said he would.

Just 86 years later, the U.S. Department of Education, in collaboration with the Society for Research on Educational Effectiveness (SREE), launched a new (figurative) center field fence for educational evaluation. It’s called the Registry of Efficacy and Effectiveness Studies (REES). The purpose of REES is to ask evaluators of educational programs to register their research designs, measures, analyses, and other features in advance. This is roughly the equivalent of asking researchers to point to the center field fence, announcing their intention to hit the ball right there. The reason this matters is that all too often, evaluators carry out evaluations that do not produce desired, positive outcomes on some measures or some analyses. They then report outcomes only on the measures that did show positive outcomes, or they might use different analyses from those initially planned, or only report outcomes for a subset of their full sample. On this last point, I remember a colleague long ago who obtained and re-analyzed data from a large and important national study that studied several cities but only reported data for Detroit. In her analyses of data from the other cities, she found that the results the authors claimed were seen only in Detroit, not in any other city.

REES pre-registration will, over time, make it possible for researchers, reviewers, and funders to find out whether evaluators are reporting all of the findings and all of the analyses as they originally planned them.  I would assume that within a period of years, review facilities such as the What Works Clearinghouse will start requiring pre-registration before accepting studies for its top evidence categories. We will certainly do so for Evidence for ESSA. As pre-registration becomes common (as it surely will, if IES is suggesting or requiring it), review facilities such as WWC and Evidence for ESSA will have to learn how to use the pre-registration information. Obviously, minor changes in research designs or measures may be allowed, especially small changes made before posttests are known. For example, if some schools named in pre-registration are not in the posttest sample, the evaluators might explain that the schools closed (not a problem if this did not upset pretest equivalence), but if they withdrew for other reasons, reviewers would want to know why, and would insist that withdrawn schools be included in any intent-to-treat (ITT) analysis. Other fields, including much of medical research, have been using pre-registration for many years, and I’m sure REES and review facilities in education could learn from their experiences and policies.

What I find most heartening in REES and pre-registration is that it is an indication of how much and how rapidly educational research has matured in a short time. Ten years ago REES could not have been realistically proposed. There was too little high-quality research to justify it, and frankly, few educators or policy makers cared very much about the findings of rigorous research. There is still a long way to go in this regard, but embracing pre-registration is one way we say to our profession and ourselves that the quality of evidence in education can stand up to that in any other field, and that we are willing to hold ourselves accountable for the highest standards.

blog_11-29-18_BabeRuth_374x500

In baseball history, Babe Ruth’s “pre-registered” home run in the 1932 series is referred to as the “called shot.” No one had ever done it before, and no one ever did it again. But in educational evaluation, we will soon be calling our shots all the time. And when we say in advance exactly what we are going to do, and then do it, just as we promised, showing real benefits for children, then educational evaluation will take a major step forward in increasing users’ confidence in the outcomes.

 

 

 

Photo credit: Babe Ruth, 1920, unattributed photo [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Programs and Practices

One issue I hear about all the time when I speak about evidence-based reform in education relates to the question of programs vs. practices. A program is a specific set of procedures, usually with materials, software, professional development, and other elements, designed to achieve one or more important outcomes, such as improving reading, math, or science achievement. Programs are typically created by non-profit organizations, though they may be disseminated by for-profits. Almost everything in the What Works Clearinghouse (WWC) and Evidence for ESSA is a program.

A practice, on the other hand, is a general principle that a teacher can use. It may not require any particular professional development or materials.  Examples of practices include suggestions to use more feedback, more praise, a faster pace of instruction, more higher-order questions, or more technology.

In general, educators, and especially teachers, love practices, but are not so crazy about programs. Programs have structure, requiring adherence to particular activities and use of particular materials. In contrast, every teacher can use practices as they wish. Educational leaders often say, “We don’t do programs.” What they mean is, “we give our teachers generic professional development and then turn them loose to interpret them.”

One problem with practices is that because they leave the details up to each teacher, teachers are likely to interpret them in a way that conforms to what they are already doing, and then no change happens. As an example of this, I once attended a speech by the late, great Madeline Hunter, extremely popular in the 1970s and ‘80s. She spoke and wrote clearly and excitingly in a very down-to-earth way. The auditorium she spoke to was stuffed to the rafters with teachers, who hung on her every word.

When her speech was over, I was swept out in a throng of happy teachers. They were all saying to each other, “Madeline Hunter supports absolutely everything I’ve ever believed about teaching!”

I love happy teachers, but I was puzzled by their reaction. If all the teachers were already doing the things Madeline Hunter recommended to the best of their ability, then how did her ideas improve their teaching? In actuality, a few studies of Hunters’ principles found no significant effects on student learning, and even more surprising, they found few differences between the teaching behaviors of teachers trained in Hunter’s methods and those who had not been. Essentially, one might argue, Madeline Hunter’s principles were popular precisely because they did not require teachers to change very much, and if teachers do not change their teaching, why would we expect their students’ learning to change?

blog_8-23-18_mowerparts_500x333

Another reason that practices rarely change learning is that they are usually small improvements that teachers are expected to assemble to improve their teaching. However, asking teachers to put together many pieces into major improvements is a bit like giving someone the pieces and parts of a lawnmower and asking them to put them together (see picture above). Some mechanically-minded people could do it, but why bother? Why not start with a whole lawnmower?

In the same way, there are gifted teachers who can assemble principles of effective practice into great instruction, but why make it so difficult? Great teachers who could assemble isolated principles into effective teaching strategies are also sure to be able to take a proven program and implement it very well. Why not start with something known to work and then improve it with effective implementation, rather than starting from scratch?

One problem with practices is that most are impossible to evaluate. By definition, everyone has their own interpretation of every practice. If practices become specific, with specific guides, supports, and materials, they become programs. So a practice is a practice exactly because it is too poorly specified to be a program. And practices that are difficult to clearly specify are also unlikely to improve student outcomes.

There are exceptions, where practices can be evaluated. For example, eliminating ability grouping or reducing class size or assigning (or not assigning) homework are practices that can be evaluated, and can be specified. But these are exceptions.

The squishiness of most practices is the reason that they rarely appear in the WWC or Evidence for ESSA. A proper evaluation contrasts one treatment (an experimental group) to a control group continuing current practices. The treatment group almost has to be a program, because otherwise it is impossible to tell what is being evaluated. For example, how can an experiment evaluate “feedback” if teachers make up their own definitions of “feedback”? How about higher-order questions? How about praise? Rapid pace? Use of these practices can be measured using observation, but differences between the treatment and control groups may be hard to detect because in each case teachers in the control group may also be using the same practices. What teacher does not provide feedback? What teacher does not praise children? What teacher does not use higher-order questions? Some may use these practices more than others, but the differences are likely to be subtle. And subtle differences rarely produce important outcomes.

The distinction between programs and practices has a lot to do with the practices (not programs) promoted by John Hattie. He wants to identify practices that can help teachers know about what works in instruction. That’s a noble goal, but it can rarely be done using real classroom research done over real periods of time. In order to isolate particular practices for study, researchers often do very brief, artificial lab studies that have nothing to do with classroom practices.  For example, some lab studies in Hattie’s own review of feedback contrast teachers giving feedback to teachers giving no feedback. What teacher would do that?

It is worthwhile to use what we know from research, experience, program evaluations, and theory to discuss what practices may be most useful for teachers. But claiming particular effect sizes for such studies is rarely justified. The strongest evidence for practical use in schools will almost always come from experiments evaluating programs. Practices have their place, but focusing on exposing teachers to a lot of practices and expecting them to put them together to improve student outcomes is not likely to work.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.