How Can You Tell When The Findings of a Meta-Analysis Are Likely to Be Valid?

In Baltimore, Faidley’s, founded in 1886, is a much loved seafood market inside Lexington Market. Faidley’s used to be a real old-fashioned market, with sawdust on the floor and an oyster bar in the center. People lined up behind their favorite oyster shucker. In a longstanding tradition, the oyster shuckers picked oysters out of crushed ice and tapped them with their oyster knives. If they sounded full, they opened them. But if they did not, the shuckers discarded them.

I always noticed that the line was longer behind the shucker who was discarding the most oysters. Why? Because everyone knew that the shucker who was pickier was more likely to come up with a dozen fat, delicious oysters, instead of say, nine great ones and three…not so great.

I bring this up today to tell you how to pick full, fair meta-analyses on educational programs. No, you can’t tap them with an oyster knife, but otherwise, the process is similar. You want meta-analysts who are picky about what goes into their meta-analyses. Your goal is to make sure that a meta-analysis produces results that truly represent what teachers and schools are likely to see in practice when they thoughtfully implement an innovative program. If instead you pick the meta-analysis with the biggest effect sizes, you will always be disappointed.

As a special service to my readers, I’m going to let you in on a few trade secrets about how to quickly evaluate a meta-analysis in education.

One very easy way to evaluate a meta-analysis is to look at the overall effect size, probably shown in the abstract. If the overall mean effect size is more than about +0.40, you probably don’t have to read any further. Unless the treatment is tutoring or some other treatment that you would expect to make a massive difference in student achievement, it is rare to find a single legitimate study with an effect size that large, much less an average that large. A very large effect size is almost a guarantee that a meta-analysis is full of studies with design features that greatly inflate effect sizes, not studies with outstandingly effective treatments.

Next, go to the Methods section, which will have within it a section on inclusion (or selection) criteria. It should list the types of studies that were or were not accepted into the study. Some of the criteria will have to do with the focus of the meta-analysis, specifying, for example, “studies of science programs for students in grades 6 to 12.” But your focus is on the criteria that specify how picky the meta-analysis is. As one example of a picky set of critera, here are the main ones we use in Evidence for ESSA and in every analysis we write:

  1. Studies had to use random assignment or matching to assign students to experimental or control groups, with schools and students in each specified in advance.
  2. Students assigned to the experimental group had to be compared to very similar students in a control group, which uses business-as-usual. The experimental and control students must be well matched, within a quarter standard deviation at pretest (ES=+0.25), and attrition (loss of subjects) must be no more than 15% higher in one group than the other at the end of the study. Why? It is essential that experimental and control groups start and remain the same in all ways other than the treatment. Controls for initial differences do not work well when the differences are large.
  3. There must be at least 30 experimental and 30 control students. Analyses of combined effect sizes must control for sample sizes. Why? Evidence finds substantial inflation of effect sizes in very small studies.
  4. The treatments must be provided for at least 12 weeks. Why? Evidence finds major inflation of effect sizes in very brief studies, and brief studies do not represent the reality of the classroom.
  5. Outcome measures must be measures independent of the program developers and researchers. Usually, this means using national tests of achievement, though not necessarily standardized tests. Why? Research has found that tests made by researchers can inflate effect sizes by double, or more, and research-made measures do not represent the reality of classroom assessment.

There may be other details, but these are the most important. Note that there is a double focus of these standards. Each is intended both to minimize bias, but also to maximize similarity to the conditions faced by schools. What principal or teacher who cares about evidence would be interested in adopting a program evaluated in comparison to a very different control group? Or in a study with few subjects, or a very brief duration? Or in a study that used measures made by the developers or researchers? This set is very similar to what the What Works Clearinghouse (WWC) requires, except #5 (the WWC requires exclusion of “overaligned” measures, but not developer-/researcher-made measures).

If these criteria are all there in the “Inclusion Standards,” chances are you are looking at a top-quality meta-analysis. As a rule, it will have average effect sizes lower than those you’ll see in reviews without some or all of these standards, but the effect sizes you see will probably be close to what you will actually get in student achievement gains if your school implements a given program with fidelity and thoughtfulness.

What I find astonishing is how many meta-analyses do not have standards this high. Among experts, these criteria are not controversial, except for the last one, which shouldn’t be. Yet meta-analyses are often written, and accepted by journals, with much lower standards, thereby producing greatly inflated, unrealistic effect sizes.

As one example, there was a meta-analysis of Direct Instruction programs in reading, mathematics, and language, published in the Review of Educational Research (Stockard et al., 2016). I have great respect for Direct Instruction, which has been doing good work for many years. But this meta-analysis was very disturbing.

The inclusion and exclusion criteria in this meta-analysis did not require experimental-control comparisons, did not require well-matched samples, and did not require any minimum sample size or duration. It was not clear how many of the outcomes measures were made by program developers or researchers, rather than independent of the program.

With these minimal inclusion standards, and a very long time span (back to 1966), it is not surprising that the review found a great many qualifying studies. 528, to be exact. The review also reported extraordinary effect sizes: +0.51 for reading, +0.55 for math, and +0.54 for language. If these effects were all true and meaningful, it would mean that DI is much more effective than one-to-one tutoring, for example.

But don’t get your hopes up. The article included an online appendix that showed the sample sizes, study designs, and outcomes of every study.

First, the authors identified eight experimental designs (plus single-subject designs, which were treated separately). Only two of these would meet anyone’s modern standards of meta-analysis: randomized and matched. The others included pre-post gains (no control group), comparisons to test norms, and other pre-scientific designs.

Sample sizes were often extremely small. Leaving aside single-case experiments, there were dozens of single-digit sample sizes (e.g., six students), often with very large effect sizes. Further, there was no indication of study duration.

What is truly astonishing is that RER accepted this study. RER is the top-rated journal in all of education, based on its citation count. Yet this review, and the Kulik & Fletcher (2016) review I cited in a recent blog, clearly did not meet minimal standards for meta-analyses.

My colleagues and I will be working in the coming months to better understand what has gone wrong with meta-analysis in education, and to propose solutions. Of course, our first step will be to spend a lot of time at oyster bars studying how they set such high standards. Oysters and beer will definitely be involved!

Photo credit: Annette White / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

References

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Stockard, J., Wood, T. W., Coughlin, C., & Rasplica Khoury, C. (2018). The effectiveness of Direct Instruction curricula: A meta-analysis of a half century of research. Review of Educational Research88(4), 479–507. https://doi.org/10.3102/0034654317751919

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

After the Pandemic: Can We Welcome Students Back to Better Schools?

I am writing in March, 2020, at what may be the scariest point in the COVID-19 pandemic in the U.S. We are just now beginning to understand the potential catastrophe, and also to begin taking actions most likely to reduce the incidence of the disease.

One of the most important preventive measures is school closure. At this writing, thirty entire states have closed their schools, as have many individual districts, including Los Angeles. It is clear that school closures will go far beyond this, both in the U.S. and elsewhere.

I am not an expert on epidemiology, but I did want to make some observations about how widespread school closure could affect education, and (ever the optimist) how this disaster could provide a basis for major improvements in the long run.

Right now, schools are closing for a few weeks, with an expectation that after spring break, all will be well again, and schools might re-open. From what I read, this is unlikely. The virus will continue to spread until it runs out of vulnerable people. The purpose of school closures is to reduce the rate of transmission. Children themselves tend not to get the disease, for some reason, but they do transmit the disease, mostly at school (and then to adults). Only when there are few new cases to transmit can schools be responsibly re-opened. No one knows for sure, but a recent article in Education Week predicted that schools will probably not re-open this school year (Will, 2020). Kansas is the first state to announce that schools will be closed for the rest of the school year, but others will surely follow.

Will students suffer from school closure? There will be lasting damage if students lose parents, grandparents, and other relatives, of course. Their achievement may take a dip, but a remarkable study reported by Ceci (1991) examined the impact of two or more years of school closures in the Netherlands in World War II, and found an initial loss in IQ scores that quickly rebounded after schools re-opened after the war. From an educational perspective, the long-term impact of closure itself may not be so bad. A colleague, Nancy Karweit (1989), studied achievement in districts with long teacher strikes, and did not find much of a lasting impact.

In fact, there is a way in which wise state and local governments might use an opportunity presented by school closures. If schools closing now stay closed through the end of the school year, that could leave large numbers of teachers and administrators with not much to do (assuming they are not furloughed, which could happen). Imagine that, where feasible, this time were used for school leaders to consider how they could welcome students back to much improved schools, and to blog_3-26_20_teleconference2_500x334provide teachers with (electronic) professional development to implement proven programs. This might involve local, regional, or national conversations focused on what strategies are known to be effective for each of the key objectives of schooling. For example, a national series of conversations could take place on proven strategies for beginning reading, for middle school mathematics, for high school science, and so on. By design, the conversations would be focused not just on opinions, but on rigorous evidence of what works. A focus on improving health and disease prevention would be particularly relevant to the current crisis, along with implementing proven academic solutions.

Particular districts might decide to implement proven programs, and then use school closure to provide time for high-quality professional development on instructional strategies that meet the ESSA evidence standards.

Of course, all of the discussion and professional development would have to be done using electronic communications, for obvious reasons of public health. But might it be possible to make wise use of school closure to improve the outcomes of schooling using professional development in proven strategies? With rapid rollout of existing proven programs and dedicated funding, it certainly seems possible.

States and districts are making a wide variety of decisions about what to do during the time that schools are closed. Many are moving to e-learning, but this may be of little help in areas where many students lack computers or access to the internet at home. In some places, a focus on professional development for next school year may be the best way to make the best of a difficult situation.

There have been many times in the past when disasters have led to lasting improvements in health and education. This could be one of these opportunities, if we seize the moment.

Photo credit: Liam Griesacker

References

Ceci, S. J. (1991). How much does schooling influence general intelligence and its cognitive components? A reassessment of the evidence. Developmental Psychology, 27(5), 703–722. https://doi.org/10.1037/0012-1649.27.5.703

Karweit, N. (1989). Time and learning: A review. In R. E. Slavin (Ed.), School and Classroom Organization. Hillsdale, NJ: Erlbaum.

Will, M. (2020, March 15). School closure for the coronavirus could extend to the end of the school year, some say. Education Week.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Cooperative Learning and Achievement

Once upon a time, two teachers went together to an evening workshop on effective teaching strategies. The speaker was dynamic, her ideas were interesting, and everyone in the large audience enjoyed the speech. Afterwards, the two teachers drove back to the town where they lived. The driver talked excitedly with her friend about all the wonderful ideas they’d heard, raised questions about how to put them into practice, and related them to things she’d read, heard, and experienced before.

After an hour’s drive, however, the driver realized that her friend had been asleep for the whole return trip.

Now here’s my question: who learned the most from the speech? Both the driver and her friend were equally excited by the speech and paid equal attention to it. Yet no one would doubt that the driver learned much more, because after the lecture, she talked all about it, thinking her friend was awake.

Every teacher knows how much they learn about any topic by teaching it, or discussing it with others. Imagine how much more the driver and her friend would have learned from the lecture if they had both been participating fully, sharing ideas, perceptions, agreements, disagreements, and new ideas.

So far, this is all obvious, right? Everyone knows that people learn when they are engaged, when they have opportunities to discuss with others, explain to others, ask questions of others, and receive explanations.

Yet in traditionally organized classes, learning does not often happen like this. Teachers teach, students listen, and if genuine discussion takes place at all, it is between the teacher and a small minority of students who always raise their hands and ask good questions. Even in the most exciting and interactive of classes, many students, often a majority, say little or nothing. They may give an answer if called upon, but “giving an answer” is not at all the same as engagement. Even in classes that are organized in groups and encourage group interaction, some students do most of the participating, while others just watch, at best. Evidence from research, especially studies by Noreen Webb (2008), find that the students who learn the most in group settings are those who give full explanations to others. These are the drivers, returning to my opening story. Those who receive a lot of explanations also learn. Who learns least? Those who neither explain nor receive explanations.

For achievement outcomes, it is not enough to put students into groups and let them talk. Research finds that cooperative learning works best when there are group goals and individual accountability. That is, groups can earn recognition or small privileges (e.g., lining up first for recess) if the average of each team member’s score meets a high standard. The purpose of group goals and individual accountability is to incentivize team members to help and encourage each other to excel, and to avoid having, for example, one student do all the work while the others watch (Chapman, 2001). Students can be silent in groups, as they can be in class, but this is less likely if they are working with others toward a common goal that they can achieve only if all team members succeed.

blog_3-5-20_coopstudents_500x333

The effectiveness of cooperative learning for enhancing achievement has been known for a long time (see Rohrbeck et al., 2003; Roseth et al., 2008; Slavin, 1995, 2014). Forms of cooperative learning are frequently seen in elementary and secondary schools, but they are far from standard practice. Forms of cooperative learning that use group goals and individual accountability are even more rare.

There are many examples of programs that incorporate cooperative learning and meet the ESSA Strong or Moderate standards in reading, math, SEL, and attendance. You can see descriptions of the programs by visiting www.evidenceforessa.org and clicking on the cooperative learning filter. As you can see, it is remarkable how many of the programs identified as effective for improving student achievement by the What Works Clearinghouse or Evidence for ESSA make use of well-structured cooperative learning, usually with students working in teams or groups of 4-5 students, mixed in past performance. In fact, in reading and mathematics, only one-to-one or small-group tutoring are more effective than approaches that make extensive use of cooperative learning.

There are many successful approaches to cooperative learning adapted for different subjects, specific objectives, and age levels (see Slavin, 1995). There is no magic to cooperative learning; outcomes depend on use of proven strategies and high-quality implementation. The successful forms of cooperative learning provide at least a good start for educators seeking ways to make school engaging, exciting, social, and effective for learning. Students not only learn from cooperation in small groups, but they love to do so. They are typically eager to work with their classmates. Why shouldn’t we routinely give them this opportunity?

References

Chapman, E. (2001, April). More on moderations in cooperative learning outcomes. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.

Rohrbeck, C. A., Ginsburg-Block, M. D., Fantuzzo, J. W., & Miller, T. R. (2003). Peer-assisted learning interventions with elementary school students: A meta-analytic review. Journal of Educational Psychology, 94(2), 240–257.

Roseth, C., Johnson, D., & Johnson, R. (2008). Promoting early adolescents’ achievement and peer relationships: The effects of cooperative, competitive, and individualistic goal structures. Psychological Bulletin, 134(2), 223–246.

Slavin, R. E. (1995). Cooperative learning: Theory, research, and practice (2nd ed.). Boston, MA: Allyn & Bacon.

Slavin, R. E. (2014). Make cooperative learning powerful: Five essential strategies to make cooperative learning effective. Educational Leadership, 72 (2), 22-26.

Webb, N. M. (2008). Learning in small groups. In T. L. Good (Ed.), 21st century learning (Vol. 1, pp. 203–211). Thousand Oaks, CA: Sage.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

 

New Sections on Social Emotional Learning and Attendance in Evidence for ESSA!

We are proud to announce the launch of two new sections of our Evidence for ESSA website (www.evidenceforessa.org): K-12 social-emotional learning and attendance. Funded by a grant from the Bill and Melinda Gates Foundation, the new sections represent our first foray beyond academic achievement.

blog_2-6-20_evidenceessa_500x333

The social-emotional learning section represents the greatest departure from our prior work. This is due to the nature of SEL, which combines many quite diverse measures. We identified 17 distinct measures, which we grouped in four overarching categories, as follows:

Academic Competence

  • Academic performance
  • Academic engagement

Problem Behaviors

  • Aggression/misconduct
  • Bullying
  • Disruptive behavior
  • Drug/alcohol abuse
  • Sexual/racial harassment or aggression
  • Early/risky sexual behavior

Social Relationships

  • Empathy
  • Interpersonal relationships
  • Pro-social behavior
  • Social skills
  • School climate

Emotional Well-Being

  • Reduction of anxiety/depression
  • Coping skills/stress management
  • Emotional regulation
  • Self-esteem/self-efficacy

Evidence for ESSA reports overall effect sizes and ratings for each of the four categories, as well as the 17 individual measures (which are themselves composed of many measures used by various qualifying studies). So in contrast to reading and math, where programs are rated based on the average of all qualifying  reading or math measures, an SEL program could be rated “strong” in one category, “promising” in another, and “no qualifying evidence” or “qualifying studies found no significant positive effects” on others.

Social-Emotional Learning

The SEL review, led by Sooyeon Byun, Amanda Inns, Cynthia Lake, and Liz Kim at Johns Hopkins University, located 24 SEL programs that both met our inclusion standards and had at least one study that met strong, moderate, or promising standards on at least one of the four categories of outcomes.

There is much more evidence at the elementary and middle school levels than at the high school level. Recognizing that some programs had qualifying outcomes at multiple levels, there were 7 programs with positive evidence for pre-K/K, 10 for 1-2, 13 for 3-6, and 9 for middle school. In contrast, there were only 4 programs with positive effects in senior high schools. Fourteen studies took place in urban locations, 5 in suburbs, and 5 in rural districts.

The outcome variables most often showing positive impacts include social skills (12), school climate (10), academic performance (10), pro-social behavior (8), aggression/misconduct (7), disruptive behavior (7), academic engagement (7), interpersonal relationships (7), anxiety/depression (6), bullying (6), and empathy (5). Fifteen of the programs targeted whole classes or schools, and 9 targeted individual students.

Several programs stood out in terms of the size of the impacts. Take the Lead found effect sizes of +0.88 for social relationships and +0.51 for problem behaviors. Check, Connect, and Expect found effect sizes of +0.51 for emotional well-being, +0.29 for problem behaviors, and +0.28 for academic competence. I Can Problem Solve found effect sizes of +0.57 on school climate. The Incredible Years Classroom and Parent Training Approach reported effect sizes of +.57 for emotional regulation, +0.35 for pro-social behavior, and +0.21 for aggression/misconduct. The related Dinosaur School classroom management model reported effect sizes of +0.31 for aggression/misbehavior. Class-Wide Function-Related Intervention Teams (CW-FIT), an intervention for elementary students with emotional and behavioral disorders, had effect sizes of +0.47 and +0.30 across two studies for academic engagement and +0.38 and +0.21 for disruptive behavior. It also reported effect sizes of +0.37 for interpersonal relationships, +0.28 for social skills, and +0.26 for empathy. Student Success Skills reported effect sizes of +0.30 for problem behaviors, +0.23 for academic competence, and +0.16 for social relationships.

In addition to the 24 highlighted programs, Evidence for ESSA lists 145 programs that were no longer available, had no qualifying studies (e.g., no control group), or had one or more qualifying studies but none that met the ESSA Strong, Moderate, or Promising criteria. These programs can be found by clicking on the “search” bar.

There are many problems inherent to interpreting research on social-emotional skills. One is that some programs may appear more effective than others because they use measures such as self-report, or behavior ratings by the teachers who taught the program. In contrast, studies that used more objective measures, such as independent observations or routinely collected data, may obtain smaller impacts. Also, SEL studies typically measure many outcomes and only a few may have positive impacts.

In the coming months, we will be doing analyses and looking for patterns in the data, and will have more to say about overall generalizations. For now, the new SEL section provides a guide to what we know now about individual programs, but there is much more to learn about this important topic.

Attendance

Our attendance review was led by Chenchen Shi, Cynthia Lake, and Amanda Inns. It located ten attendance programs that met our standards. Only three of these reported on chronic absenteeism, which refers to students missing more than 10% of days. Many more focused on average daily attendance (ADA). Among programs focused on average daily attendance, a Milwaukee elementary school program called SPARK had the largest impact (ES=+0.25). This is not an attendance program per se, but it uses AmeriCorps members to provide tutoring services across the school, as well as involving families. SPARK has been shown to have strong effects on reading, as well as its impressive effects on attendance. Positive Action is another schoolwide approach, in this case focused on SEL. It has been found in two major studies in grades K-8 to improve student reading and math achievement, as well as overall attendance, with a mean effect size of +0.20.

The one program to report data on both ADA and chronic absenteeism is called Attendance and Truancy Intervention and Universal Procedures, or ATI-UP. It reported an effect size in grades K-6 of +0.19 for ADA and +0.08 for chronic attendance. Talent Development High School (TDHS) is a ninth grade intervention program that provides interdisciplinary learning communities and “double dose” English and math classes for students who need them. TDHS reported an effect size of +0.17.

An interesting approach with a modest effect size but very modest cost is now called EveryDay Labs (formerly InClass Today). This program helps schools organize and implement a system to send postcards to parents reminding them of the importance of student attendance. If students start missing school, the postcards include this information as well. The effect size across two studies was a respectable +0.16.

As with SEL, we will be doing further work to draw broader lessons from research on attendance in the coming months. One pattern that seems clear already is that effective attendance improvement models work on building close relationships between at-risk students and concerned adults. None of the effective programs primarily uses punishment to improve attendance, but instead they focus on providing information to parents and students and on making it clear to students that they are welcome in school and missed when they are gone.

Both SEL and attendance are topics of much discussion right now, and we hope these new sections will be useful and timely in helping schools make informed choices about how to improve social-emotional and attendance outcomes for all students.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Do School Districts Really Have Difficulty Meeting ESSA Evidence Standards?

The Center for Educational Policy recently released a report on how school districts are responding to the Every Student Succeeds Act (ESSA) requirement that schools seeking school improvement grants select programs that meet ESSA’s strong, moderate, or promising standards of evidence. Education Week ran a story on the CEP report.

The report noted that many states, districts, and schools are taking the evidence requirements seriously, and are looking at websites and consulting with researchers to help them identify programs that meet the standards. This is all to the good.

However, the report also notes continuing problems districts and schools are having finding out “what works.” Two particular problems were cited. One was that districts and schools were not equipped to review research to find out what works. The other was that rural districts and schools found few programs proven effective in rural schools.

I find these concerns astounding. The same concerns were expressed when ESSA was first passed, in 2015. But that was almost four years ago. Since 2015, the What Works Clearinghouse has added information to help schools identify programs that meet the top two ESSA evidence categories, strong and moderate. Our own Evidence for ESSA, launched in February, 2017, has up-to-date information on virtually all PK-12 reading and math programs currently in dissemination. Among hundreds of programs examined, 113 meet ESSA standards for strong, moderate, or promising evidence of effectiveness. WWC, Evidence for ESSA, and other sources are available online at no cost. The contents of the entire Evidence for ESSA website were imported into Ohio’s own website on this topic, and dozens of states, perhaps all of them, have informed their districts and schools about these sources.

The idea that districts and schools could not find information on proven programs if they wanted to do so is difficult to believe, especially among schools eligible for school improvement grants. Such schools, and the districts in which they are located, write a lot of grant proposals for federal and state funding. The application forms for school improvement grants always explain the evidence requirements, because that is the law. Someone in every state involved with federal funding knows about the WWC and Evidence for ESSA websites. More than 90,000 unique users have used Evidence for ESSA, and more than 800 more sign on each week.

blog_10-10-19_generickids_500x333

As to rural schools, it is true that many studies of educational programs have taken place in urban areas. However, 47 of the 113 programs qualified by Evidence for ESSA were validated in at least one rural study, or a study including a large enough rural sample to enable researchers to separately report program impacts for rural students. Also, almost all widely disseminated programs have been used in many rural schools. So rural districts and schools that care about evidence can find programs that have been evaluated in rural locations, or at least that were evaluated in urban or suburban schools but widely disseminated in rural schools.

Also, it is important to note that if a program was successfully evaluated only in urban or suburban schools, the program still meets the ESSA evidence standards. If no studies of a given outcome were done in rural locations, a rural school in need of better outcomes could, in effect, be asked to choose between a program proven to work somewhere and probably used in dissemination in rural schools, or they could choose a program not proven to work anywhere. Every school and district has to make the best choices for their kids, but if I were a rural superintendent or principal, I’d read up on proven programs, and then go visit some rural schools using that program nearby. Wouldn’t you?

I have no reason to suspect that the CEP survey is incorrect. There are many indications that district and school leaders often do feel that the ESSA evidence rules are too difficult to meet. So what is really going on?

My guess is that there are many district and school leaders who do not want to know about evidence on proven programs. For example, they may have longstanding, positive relationships with representatives of publishers or software developers, or they may be comfortable and happy with the materials and services they are already using, evidence-proven or not. If they do not have evidence of effectiveness that would pass muster with WWC or Evidence for ESSA, the publishers and software developers may push hard on state and district officials, put forward dubious claims for evidence (such as studies with no control groups), and do their best to get by in a system that increasingly demands evidence that they lack. In my experience, district and state officials often complain about having inadequate staff to review evidence of effectiveness, but their concern may be less often finding out what works as it is defending themselves from publishers, software developers, or current district or school users of programs, who maintain that they have been unfairly rated by WWC, Evidence for ESSA, or other reviews. State and district leaders who stand up to this pressure may have to spend a lot of time reviewing evidence or hearing arguments.

On the plus side, at the same time that publishers and software producers may be seeking recognition for their current products, many are also sponsoring evaluations of some of their products that they feel are mostly likely to perform well in rigorous evaluations. Some may be creating new programs that resemble programs that have met evidence standards. If the federal ESSA law continues to demand evidence for certain federal funding purposes, or even to expand this requirement to additional parts of federal grant-making, then over time the ESSA law will have its desired effect, rewarding the creation and evaluation of programs that do meet standards by making it easier to disseminate such programs. The difficulties the evidence movement is experiencing are likely to diminish over time as more proven programs appear, and as federal, state, district, and school leaders get comfortable with evidence.

Evidence-based reform was always going to be difficult, because of the amount of change it entails and the stakes involved. But sooner or later, it is the right thing to do, and leaders who insist on evidence will see increasing levels of learning among their students, at minimal cost beyond what they already spend on untested or ineffective approaches. Medicine went through a similar transition in 1962, when the U.S. Congress first required that medicines be rigorously evaluated for effectiveness and safety. At first, many leaders in the medical profession resisted the changes, but after a while, they came to insist on them. The key is political leadership willing to support the evidence requirement strongly and permanently, so that educators and vendors alike will see that the best way forward is to embrace evidence and make it work for kids.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Send Us Your Evaluations!

In last week’s blog, I wrote about reasons that many educational leaders are wary of the ESSA evidence standards, and the evidence-based reform movement more broadly. Chief among these concerns was a complaint that few educational leaders had the training in education research methods to evaluate the validity of educational evaluations. My response to this was to note that it should not be necessary for educational leaders to read and assess individual evaluations of educational programs, because free, easy-to-interpret review websites, such as the What Works Clearinghouse and Evidence for ESSA, already do such reviews. Our Evidence for ESSA website (www.evidenceforessa.org) lists reading and math programs available for use anywhere in the U.S., and we are constantly on the lookout for any we might have missed. If we have done our job well, you should be able to evaluate the evidence base for any program, in perhaps five minutes.

Other evidence-based fields rely on evidence reviews. Why not education? Your physician may or may not know about medical research, but most rely on websites that summarize the evidence. Farmers may be outstanding in their fields, but they rely on evidence summaries. When you want to know about the safety and reliability of cars you might buy, you consult Consumer Reports. Do you understand exactly how they get their ratings? Neither do I, but I trust their expertise. Why should this not be the same for educational programs?

At Evidence for ESSA, we are aiming to provide information on every program available to you, if you are a school or district leader. At the moment, we cover reading and mathematics, grades pre-k to 12. We want to be sure that if a sales rep or other disseminator offers you a program, you can look it up on Evidence for ESSA and it will be there. If there are no studies of the program that meet our standards, we will say so. If there are qualifying studies that either do or do not have evidence of positive outcomes that meet ESSA evidence standards, we will say so. On our website, there is a white box on the homepage. If you type in the name of any reading or math program, the website should show you what we have been able to find out.

What we do not want to happen is that you type in a program title and find nothing. In our website, “nothing” has no useful meaning. We have worked hard to find every program anyone has heard of, and we have found hundreds. But if you know of any reading or math program that does not appear when you type in its name, please tell us. If you have studies of that program that might meet our inclusion criteria, please send them to us, or citations to them. We know that there are always additional programs entering use, and additional research on existing programs.

Why is this so important to us? The answer is simple, Evidence for ESSA exists because we believe it is essential for the progress of evidence-based reform for educators and policy makers to be confident that they can easily find the evidence on any program, not just the most widely used. Our vision is that someday, it will be routine for educators thinking of adopting educational programs to quickly consult Evidence for ESSA (or other reviews) to find out what has been proven to work, and what has not. I heard about a superintendent who, before meeting with any sales rep, asked them to show her the evidence for the effectiveness of their program on Evidence for ESSA or the What Works Clearinghouse. If they had it, “Come on in,” she’d say. If not, “Maybe later.”

Only when most superintendents and other school officials do this will program publishers and other providers know that it is worth their while to have high-quality evaluations done of each of their programs. Further, they will find it worthwhile to invest in the development of programs likely to work in rigorous evaluations, to provide enough quality professional development to give their programs a chance to succeed, and to insist that schools that adopt their proven programs incorporate the methods, materials, and professional development that their own research has told them are needed for success. Insisting on high-quality PD, for example, adds cost to a program, and providers may worry that demanding sufficient PD will price them out of the market. But if all programs are judged on their proven outcomes, they all will require adequate PD, to be sure that the programs will work when evaluated. That is how evidence will transform educational practice and outcomes.

So our attempt to find and fairly evaluate every program in existence is not due to our being nerds or obsessive compulsive neurotics (though these may be true, too). But thorough, rigorous review of the whole body of evidence in every subject and grade level, and for attendance, social emotional learning, and other non-academic outcomes, is part of a plan.

You can help us on this part of our plan. Tell us about anything we have missed, or any mistakes we have made. You will be making an important contribution to the progress of our profession, and to the success of all children.

blog_6-6-19_mail_500x381
Send us your evaluations!

Photo credit: George Grantham Bain Collection, Library of Congress [Public domain]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Why Do Some Educators Push Back Against Evidence?

In December, 2015, the U.S. Congress passed the Every Student Succeeds Act, or ESSA. Among many other provisions, ESSA defined levels of evidence supporting educational programs: Strong (at least one randomized experiment with positive outcomes), moderate (at least one quasi-experimental study with positive outcomes), and promising (at least one correlational study with positive outcomes). For various forms of federal funding, schools are required (in school improvement) or encouraged (in seven other funding streams) to use programs falling into one of these top three categories. There is also a fourth category, “demonstrates a rationale,” but this one has few practical consequences.

3 ½  years later, the ESSA evidence standards are increasing interest in evidence of effectiveness for educational programs, especially among schools applying for school improvement funding and in state departments of education, which are responsible for managing the school improvement grant process. All of this is to the good, in my view.

On the other hand, evidence is not yet transforming educational practice. Even in portions of ESSA that encourage or require use of proven programs among schools seeking federal funding, schools and districts often try to find ways around the evidence requirements rather than truly embracing them. Even when schools do say they used evidence in their proposals, they may have just accepted assurances from publishers or developers stating that their programs meet ESSA standards, even when this is clearly not so.

blog_5-30-19_pushingcar_500x344
Why are these children in India pushing back on a car?  And why do many educators in our country push back on evidence?

Educators care a great deal about their children’s achievement, and they work hard to ensure their success. Implementing proven, effective programs does not guarantee success, but it greatly increases the chances. So why has evidence of effectiveness played such a limited role in program selection and implementation, even when ESSA, the national education law, defines evidence and requires use of proven programs under certain circumstances?

The Center on Education Policy Report

Not long ago, the Center on Education Policy (CEP) at George Washington University published a report of telephone interviews of state leaders in seven states. The interviews focused on problems states and districts were having with implementation of the ESSA evidence standards. Six themes emerged:

  1. Educational leaders are not comfortable with educational research methods.
  2. State leaders feel overwhelmed serving large numbers of schools qualifying for school improvement.
  3. Districts have to seriously re-evaluate longstanding relationships with vendors of education products.
  4. State and district staff are confused about the prohibition on using Title I school improvement funds on “Tier 4” programs (ones that demonstrate a rationale, but have not been successfully evaluated in a rigorous study).
  5. Some state officials complained that the U.S. Department of Education had not been sufficiently helpful with implementation of ESSA evidence standards.
  6. State leaders had suggestions to make education research more accessible to educators.

What is the Reality?

I’m sure that the concerns expressed by the state and district leaders in the CEP report are sincerely felt. But most of them raise issues that have already been solved at the federal, state, and/or district levels. If these concerns are as widespread as they appear to be, then we have serious problems of communication.

  1. The first theme in the CEP report is one I hear all the time. I find it astonishing, in light of the reality.

No educator needs to be a research expert to find evidence of effectiveness for educational programs. The federal What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/) and our Evidence for ESSA (www.evidenceforessa.org) provide free information on the outcomes of programs, at least in reading and mathematics, that is easy to understand and interpret. Evidence for ESSA provides information on programs that do meet ESSA standards as well as those that do not. We are constantly scouring the literature for studies of replicable programs, and when asked, we review entire state and district lists of adopted programs and textbooks, at no cost. The What Works Clearinghouse is not as up-to-date and has little information on programs lacking positive findings, but it also provides easily interpreted information on what works in education.

In fact, few educational leaders anywhere are evaluating the effectiveness of individual programs by reading research reports one at a time. The What Works Clearinghouse and Evidence for ESSA employ experts who know how to find and evaluate outcomes of valid research and to describe the findings clearly. Why would every state and district re-do this job for themselves? It would be like having every state do its own version of Consumer Reports, or its own reviews of medical treatments. It just makes no sense. In fact, at least in the case of Evidence for ESSA, we know that more than 80,000 unique readers have used Evidence for ESSA since it launched in 2017. I’m sure even larger numbers have used the What Works Clearinghouse and other reviews. The State of Ohio took our entire Evidence for ESSA website and put it on its own state servers with some other information. Several other states have strongly promoted the site. The bottom line is that educational leaders do not have to be research mavens to know what works, and tens of thousands of them know where to find fair and useful information.

  1. State leaders are overwhelmed. I’m sure this is true, but most state departments of education have long been understaffed. This problem is not unique to ESSA.
  2. Districts have to seriously re-evaluate longstanding relationships with vendors. I suspect that this concern is at the core of the problem on evidence. The fact is that most commercial programs do not have adequate evidence of effectiveness. Either they have no qualifying studies (by far the largest number), or they do have qualifying evidence that is not significantly positive. A vendor with programs that do not meet ESSA standards is not going to be a big fan of evidence, or ESSA. These are often powerful organizations with deep personal relationships with state and district leaders. When state officials adhere to a strict definition of evidence, defined in ESSA, local vendors push back hard. Understaffed state departments are poorly placed to fight with vendors and their friends in district offices, so they may be forced to accept weak or no evidence.
  3. Confusions about Tier 4 evidence. ESSA is clear that to receive certain federal funds schools must use programs with evidence in Tiers 1, 2, or 3, but not 4. The reality is that definitions of Tier 4 are so weak that any program on Earth can meet this standard. What program anywhere does not have a rationale? The problem is that districts, states, and vendors have used confusion about Tier 4 to justify any program they wish. Some states are more sophisticated than others and do not allow this, but the very existence of Tier 4 in ESSA language creates a loophole that any clever sales rep or educator can use, or at least try to get away with.
  4. The U. S. Department of Education is not helpful enough. In reality, USDoE is understaffed and overwhelmed on many fronts. In any case, ESSA puts a lot of emphasis on state autonomy, so the feds feel unwelcome in performing oversight.

The Future of Evidence in Education

Despite the serious problems in implementation of ESSA, I still think it is a giant step forward. Every successful field, such as medicine, agriculture, and technology, has started its own evidence revolution fighting entrenched interests and anxious stakeholders. As late as the 1920s, surgeons refused to wash their hands before operations, despite substantial evidence going back to the 1800s that handwashing was essential. Evidence eventually triumphs, though it often takes many years. Education is just at the beginning of its evidence revolution, and it will take many years to prevail. But I am unaware of any field that embraced evidence, only to retreat in the face of opposition. Evidence eventually prevails because it is focused on improving outcomes for people, and people vote. Sooner or later, evidence will transform the practice of education, as it has in so many other fields.

Photo credit: Roger Price from Hong Kong, Hong Kong [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Measuring Social Emotional Skills in Schools: Return of the MOOSES

Throughout the U. S., there is huge interest in improving students’ social emotional skills and related behaviors. This is indeed important as a means of building tomorrow’s society. However, measuring SEL skills is terribly difficult. Not that measuring reading, math, or science learning is easy, but there are at least accepted measures in those areas. In SEL, almost anything goes, and measures cover an enormous range. Some measures might be fine for theoretical research and some would be all right if they were given independently of the teachers who administered the treatment, but SEL measures are inherently squishy.

A few months ago, I wrote a blog on measurement of social emotional skills. In it, I argued that social emotional skills should be measured in pragmatic school research as objectively as possible, especially to avoid measures that merely reflect having students in experimental groups repeating back attitudes or terminology they learned in the program. I expressed the ideal for social emotional measurement in school experiments as MOOSES: Measurable, Observable, Objective, Social Emotional Skills.

Since that time, our group at Johns Hopkins University has received a generous grant from the Gates Foundation to add research on social emotional skills and attendance to our Evidence for ESSA website. This has enabled our group to dig a lot deeper into measures for social emotional learning. In particular, JHU graduate student Sooyeon Byun created a typology of SEL measures arrayed from least to most MOOSE-like. This is as follows.

  1. Cognitive Skills or Low-Level SEL Skills.

Examples include executive functioning tasks such as pencil tapping, the Stroop test, and other measures of cognitive regulation, as well as recognition of emotions. These skills may be of importance as part of theories of action leading to social emotional skills of importance to schools, but they are not goals of obvious importance to educators in themselves.

  1. Attitudes toward SEL (non-behavioral).

These include agreement with statements such as “bullying is wrong,” and statements about why other students engage in certain behaviors (e.g., “He spilled the milk because he was mean.”).

  1. Intention for SEL behaviors (quasi-behavioral).

Scenario-based measures (e.g., what would you do in this situation?).

  1. SEL behaviors based on self-report (semi-behavioral).

Reports of actual behaviors of self, or observations of others, often with frequencies (e.g., “How often have you seen bullying in this school during this school year?”) or “How often do you feel anxious or afraid in class in this school?”)

This category was divided according to who is reporting:

4a. Interested party (e.g., report by teachers or parents who implemented the program and may have reason to want to give a positive report)

4b. Disinterested party (e.g., report by students or by teachers or parents who did not administer the treatment)

  1. MOOSES (Measurable, Observable, Objective Social Emotional Skills)
  • Behaviors observed by independent observers, either researchers, ideally unaware of treatment assignment, or by school officials reporting on behaviors as they always would, not as part of a study (e.g., regular reports of office referrals for various infractions, suspensions, or expulsions).
  • Standardized tests
  • Other school records

blog_2-21-19_twomoose_500x333

Uses for MOOSES

All other things being equal, school researchers and educators should want to know about measures as high as possible on the MOOSES scale. However, all things are never equal, and in practice, some measures lower on the MOOSES scale may be all that exists or ever could exist. For example, it is unlikely that school officials or independent observers could determine students’ anxiety or fear, so self-report (level 4b) may be essential. MOOSES measures (level 5) may be objectively reported by school officials, but limiting attention to such measures may limit SEL measurement to readily observable behaviors, such as aggression, truancy, and other behaviors of importance to school management, and not on difficult-to-observe behaviors such as bullying.

Still, we expect to find in our ongoing review of the SEL literature that there will be enough research on outcomes measured at level 3 or above to enable us to downplay levels 1 and 2 for school audiences, and in many cases to downplay reports by interested parties in level 4a, where teachers or parents who implement a program then rate the behavior of the children they served.

Social emotional learning is important, and we need measures that reflect their importance, minimizing potential bias and staying as close as possible to independent, meaningful measures of behaviors that are of the greatest importance to educators. In our research team, we have very productive arguments about these measurement issues in the course of reviewing individual articles. I placed a cardboard cutout of a “principal” called “Norm” in our conference room. Whenever things get too theoretical, we consult “Norm” for his advice. For example, “Norm” is not too interested in pencil tapping and Stroop tests, but he sure cares a lot about bullying, aggression, and truancy. Of course, as part of our review we will be discussing our issues and initial decisions with real principals and educators, as well as other experts on SEL.

The growing number of studies of SEL in recent years enables reviewers to set higher standards than would have been feasible even just a few years ago. We still have to maintain a balance in which we can be as rigorous as possible but not end up with too few studies to review.  We can all aspire to be MOOSES, but that is not practical for some measures. Instead, it is useful to have a model of the ideal and what approaches the ideal, so we can make sense of the studies that exist today, with all due recognition of when we are accepting measures that are nearly MOOSES but not quite the real Bullwinkle

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Replication

The holy grail of science is replication. If a finding cannot be repeated, then it did not happen in the first place. There is a reason that the humor journal in the hard sciences is called the Journal of Irreproducible Results. For scientists, results that are irreproducible are inherently laughable, therefore funny. In many hard science experiments, replication is pretty much guaranteed. If you heat an iron bar, it gets longer. If you cross parents with the same recessive gene, one quarter of their progeny will express the recessive trait (think blue eyes).

blog_1-24-19_bunnies_500x363

In educational research, we care about replication just as much as our colleagues in the lab coats across campus. However, when we’re talking about evaluating instructional programs and practices, replication is a lot harder, because students and schools differ. Positive outcomes obtained in one experiment may or may not replicate in a second trial. Sometimes this is true because the first experiment had features known to contribute to bias: small sample sizes, brief study durations, extraordinary amounts of resources or expert time to help the experimental schools or classes, use of measures made by the developers or researchers or otherwise overaligned with the experimental group (but not the control group), or use of matched rather than randomized assignment to conditions, can all contribute to successful-appearing outcomes in a first experiment. Second or third experiments are more likely to be larger, longer, and more stringent than the first study, and therefore may not replicate. Even when the first study has none of these problems, it may not replicate because of differences in the samples of schools, teachers, or students, or for other, perhaps unknowable problems. A change in the conditions of education may cause a failure to replicate. Our Success for All whole-school reform model has been found to be effective many times, mostly by third party evaluators. However, Success for All has always specified a full-time facilitator and at least one tutor for each school. An MDRC i3 evaluation happened to fall in the middle of the recession, and schools, which were struggling to afford classroom teachers, could not afford facilitators or tutors. The results were still positive on some measures, especially for low achievers, but the effect sizes were less than half of what others had found in many studies. Stuff happens.

Replication has taken on more importance recently because the ESSA evidence standards only require a single positive study. To meet the strong, moderate, or promising standards, programs must have at least one “well-designed and well-implemented” study using randomized (strong), matched (moderate), or correlational (promising) designs and finding significantly positive outcomes. Based on the “well-designed and well-implemented” language, our Evidence for ESSA website requires features of experiments similar to those also required by the What Works Clearinghouse (WWC). These requirements make it difficult to be approved, but they remove many of the experimental design features that typically cause first studies to greatly overstate program impacts: small size, brief durations, overinvolved experimenters, and developer-made measures. They put (less rigorous) matched and correlational studies in lower categories. So one study that meets ESSA or Evidence for ESSA requirements is at least likely to be a very good study. But many researchers have expressed discomfort with the idea that a single study could qualify a program for one of the top ESSA categories, especially if (as sometimes happens) there is one study with a positive outcomes and many with zero or at least nonsignificant outcomes.

The pragmatic problem is that if ESSA had required even two studies showing positive outcomes, this would wipe out a very large proportion of current programs. If research continues to identify effective programs, it should only be a matter of time before ESSA (or its successors) requires more than one study with a positive outcomes.

However, in the current circumstance, there is a way researchers and educators might at least estimate the replicability of given programs when they have only a single study with a significant positive outcomes. This would involve looking at the findings for entire genres of programs. The logic here is that if a program has only one ESSA-qualifying study, but it closely resembles other programs that also have positive outcomes, that program should be taken a lot more seriously than a program that obtained a positive outcome that differs considerably from outcomes of very similar programs.

As one example, there is much evidence from many studies by many researchers indicating positive effects of one-to-one and one-to-small group tutoring, in reading and mathematics. If a tutoring program has only one study, but this one study has significant positive findings, I’d say thumbs up. I’d say the same about cooperative learning approaches, classroom management strategies using behavioral principles, and many others, where a whole category of programs has had positive outcomes.

In contrast, if a program has a single positive outcome and there are few if any similar approaches that obtained positive outcomes, I’d be much more cautious. An example might be textbooks in mathematics, which rarely make any difference because control groups are also likely to be using textbooks, and textbooks considerably resemble each other. In our recent elementary mathematics review (Pellegrini, Lake, Inns, & Slavin, 2018), only one textbook program available in the U.S. had positive outcomes (out of 16 studies). As another example, there have been several large randomized evaluations of the use of interim assessments. Only one of them found positive outcomes. I’d be very cautious about putting much faith in benchmark assessments based on this single anomalous finding.

Looking for findings from similar studies is facilitated by looking at reviews we make available at www.bestevidence.org. These consist of reviews of research organized by categories of programs. Looking for findings from similar programs won’t help with the ESSA law, which often determines its ratings based on the findings of a single study, regardless of other findings on the same program or similar programs. However, for educators and researchers who really want to find out what works, I think checking similar programs is not quite as good as finding direct replication of positive findings on the same programs, but perhaps, as we like to say, close enough for social science.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Evidence, Standards, and Chicken Feathers

In 1509, John Damian, an alchemist in the court of James IV of Scotland proclaimed that he had developed a way for humans to fly. He made himself some wings from chicken feathers and jumped from the battlements of Stirling Castle, the Scottish royal residence at the time. His flight was brief but not fatal.  He landed in a pile of manure, and only broke his thigh.  Afterward, he explained that the problem was that he used the wrong kind of feathers.  If only he had used eagle feathers, he could have flown, he asserted.  Fortunately for him, he never tried flying again, with any kind of feathers.

blog_11-15-18_humanornithopter_500x314

The story of John Damian’s downfall is humorous, and in fact the only record of it is a contemporary poem making fun of it. Yet there are important analogies to educational policy today from this incident in Scottish history. These are as follows:

  1. Damian proclaimed the success of his plan for human flight before he or anyone else had tried it and found it effective.
  2. After his flight ended in the manure pile, he proclaimed (again without evidence) that if only he’d used eagle feathers, he would have succeeded. This makes sense, of course, because eagles are much better flyers than chickens.
  3. He was careful never to actually try flying with eagle feathers.

All of this is more or less what we do all the time in educational policy, with one big exception.  In education, based on Damian’s experience, we might have put forward policies stating that from now on human powered flight must only be done with eagle feathers, not chicken feathers.

What I am referring to in education is our obsession with standards as a basis for selecting textbooks, software, and professional development, and the relative lack of interest in evidence. Whole states and districts spend a lot of time devising standards and then reviewing materials and services to be sure that they align with these standards. In contrast, the idea of checking to see that texts, software, and PD have actually been evaluated and found to be effective in real classrooms with real teachers and students has been a hard slog.

Shouldn’t textbooks and programs that meet modern standards also produce higher student performance on tests closely aligned with those standards? This cannot be assumed. Not long ago, my colleagues and I examined every reading and math program rated “meets expectations” (the highest level) on EdReports, a website that rates programs in terms of their alignment with college- and career-ready standards.  A not so grand total of two programs had any evidence of effectiveness on any measure not made by the publishers. Most programs rated “meets expectations” had no evidence at all, and a smaller number had been evaluated and found to make no difference.

I am not in any way criticizing EdReports.  They perform a very valuable service in helping schools and districts know which programs meet current standards. It makes no sense for every state and district to do this for themselves, especially in the cases where there are very few or no proven programs. It is useful to at least know about programs aligned with standards.

There is a reason that so few products favorably reviewed on EdReports have any positive outcomes in rigorous research. Most are textbooks, and very few textbooks have evidence of effectiveness. Why? The fact is that standards or no standards, EdReports or no EdReports, textbooks do not differ very much from each other in aspects that matter for student learning. Textbooks differ (somewhat) in content, but if there is anything we have learned from our many reviews of research on what works in education, what matters is pedagogy, not content. Yet since decisions about textbooks and software depend on standards and content, decision makers almost invariably select textbooks and software that have never been successfully evaluated.

Even crazy John Damian did better than we do. Yes, he claimed success in flying before actually trying it, but at last he did try it. He concluded that his flying plan would have worked if he’d used eagle feathers, but he never imposed this untested standard on anyone.

Untested textbooks and software probably don’t hurt anyone, but millions of students desperately need higher achievement, and focusing resources on untested or ineffective textbooks, software, and PD does not move them forward. The goal of education is to help all students succeed, not to see that they use aligned materials. If a program has been proven to improve learning, isn’t that a lot more important than proving that it aligns with standards? Ideally, we’d want schools and districts to use programs that are both proven effective and aligned with standards, but if no programs meet both criteria, shouldn’t those that are proven effective be preferred? Without evidence, aren’t we just giving students and teachers eagle feathers and asking them to take a leap of faith?

Photo credit: Humorous portrayal of a man who flies with wings attached to his tunic, Unknown author [Public domain], via Wikimedia Commons/Library of Congress

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.