Maximizing the Promise of “Promising” in ESSA

As anyone who reads my blogs is aware, I’m a big fan of the ESSA evidence standards. Yet there are many questions about the specific meaning of the definitions of strength of evidence for given programs. “Strong” is pretty clear: at least one study that used a randomized design and found a significant positive effect. “Moderate” requires at least one study that used a quasi-experimental design and found significant positive effects. There are important technical questions with these, but the basic intention is clear.

Not so with the third category, “promising.” It sounds clear enough: At least one correlational study with significantly positive effects, controlling for pretests or other variables. Yet what does this mean in practice?

The biggest problem is that correlation does not imply causation. Imagine, for example, that a study found a significant correlation between the numbers of iPads in schools and student achievement. Does this imply that more iPads cause more learning? Or could wealthier schools happen to have more iPads (and children in wealthy families have many opportunities to learn that have nothing to do with their schools buying more iPads)? The ESSA definitions do require controlling for other variables, but correlational studies lend themselves to error when they try to control for big differences.

Another problem is that a correlational study may not specify how much of a given resource is needed to show an effect. In the case of the iPad study, did positive effects depend on one iPad per class, or thirty (one per student)? It’s not at all clear.

Despite these problems, the law clearly defines “promising” as requiring correlational studies, and as law-abiding citizens, we must obey. But the “promising” category also allows for some additional categories of studies that can fill some important gaps that otherwise lurk in the ESSA evidence standards.

The most important category involves studies in which schools or teachers (not individual students) were randomly assigned to experimental or control groups. Current statistical norms require that such studies use multilevel analyses, such as Hierarchical Linear Modeling (HLM). In essence, these are analyses at the cluster level (school or teacher), not the student level. The What Works Clearinghouse (WWC) requires use of statistics like HLM in clustered designs.

The problem is that it takes a lot of schools or teachers to have enough power to find significant effects. As a result, many otherwise excellent studies fail to find significant differences, and are not counted as meeting any standard in the WWC.

The Technical Working Group (TWG) that set the standards for our Evidence for ESSA website suggested a solution to this problem. Cluster randomized studies that fail to find significant effects are re-analyzed at the student level. If the student-level outcome is significantly positive, the program is rated as “promising” under ESSA. Note that all experiments are also correlational studies (just using a variable with only two possible values, experimental or control), and experiments in education almost always control for pretests and other factors, so our procedure meets the ESSA evidence standards’ definition for “promising.”

Another situation in which “promising” is used for “just-missed” experiments is in the case of quasi-experiments. Like randomized experiments, these should be analyzed at the cluster level if treatment was at the school or classroom level. So if a quasi-experiment did not find significantly positive outcomes at the cluster level but did find significant positive effects at the student level, we include it as “promising.”

These procedures are important for the ESSA standards, but they are also useful for programs that are not able to recruit a large enough sample of schools or teachers to do randomized or quasi-experimental studies. For example, imagine that a researcher evaluating a school-wide math program for tenth graders could only afford to recruit and serve 10 schools. She might deliberately use a design in which the 10 schools are randomly assigned to use the innovative math program (n=5) or serve as a control group (n=5). A cluster randomized experiment with only 10 clusters is extremely unlikely to find a significant positive effect at the school level, but with perhaps 1000 students per condition, would be very likely to find a significant effect at the student level, if the program is in fact effective. In this circumstance, the program could be rated, using our standard, as “promising,” an outcome true to the ordinary meaning of the word: not proven, but worth further investigation and investment.

Using the “promising” category in this way may encourage smaller-scale, less well funded researchers to get into the evidence game, albeit at a lower rung on the ladder. But it is not good policy to set such high standards that few programs will qualify. Defining “promising” as we have in Evidence for ESSA does not promise anyone the Promised Land, but it broadens the number of programs schools may select, knowing that they must give up a degree of certainty in exchange for a broader selection of programs.

Advertisements

The WWC’s 25% Loophole

I am a big fan of the concept of the What Works Clearinghouse (WWC), though I have concerns about various WWC policies and practices. For example, I have written previously with concerns about WWC’s acceptance of measures made by researchers and developers and WWC’s policy against weighting effect sizes by sample sizes when computing mean effect sizes for various programs. However, there is another WWC policy that is a problem in itself, but this problem is made more serious in light of recent Department of Education guidance on the ESSA evidence standards.

The WWC Standards and Procedures 3.0 manual sets rather tough standards for programs to be rated as having positive effects in studies meeting standards “without reservations” (essentially, randomized experiments) and “with reservations” (essentially, quasi-experiments, or matched studies). However, the WWC defines a special category of programs for which all caution is thrown to the winds. Such studies are called “substantively important,” and are treated as though they met WWC standards. Quoting from Standards and Procedures 3.0: “For the WWC, effect sizes of +0.25 standard deviations or larger are considered to be substantively important…even if they might not reach statistical significance…” The “effect size greater than +0.25” loophole (the >0.25 loophole, for short) is problematic in itself, but could lead to catastrophe for the ESSA evidence standards that now identify programs that meet “strong,” “moderate,” and “promising” levels of evidence.

The problem with the >0.25 loophole is that studies that meet the loophole criterion without meeting the usual methodological criteria are usually very, very, very bad studies, usually with a strong positive bias. These studies are often very small (far too small for statistical significance). They usually use measures made by the developers or researchers, or ones that are excessively aligned with the content of the experimental group but not the control group.

One example of the >0.25 loophole is a Brady (1990) study accepted as “substantively important” by the WWC. In it, 12 students in rural Alaska were randomly assigned to Reciprocal Teaching or to a control group. The literacy treatment was built around specific science content, but the control group never saw this content. Yet one of the outcome measures, focused on this content, was made by Mr. Brady, and two others were scored by him. Mr. Brady also happened to be the teacher of the experimental group. The effect size in this awful study was an extraordinary +0.65, though outcomes in other studies assessed on measures more fair to the control group were much smaller.

Because the WWC does not weight studies by sample size, this tiny, terrible study had the same impact in the WWC summary as studies with hundreds or thousands of students.

For the ESSA evidence standards, the >0.25 loophole can lead to serious errors. A single study meeting standards makes a program qualify for one of the top-three ESSA standards (strong, moderate, or promising). There can be financial consequences for schools using programs in the top three categories (for example, use of such programs is required for schools seeking school improvement grants). Yet a single study meeting the standards, including the awful 12-student study of Reciprocal Teaching, qualify the program for the ESSA category, no matter what is found in all other studies (unless there are qualifying studies with negative impacts). Also, the loophole works in the negative direction too, so a small, terrible study could find an effect size less than -0.25, and no amount or quality of positive findings could make that program meet WWC standards.

The >0.25 loophole is bad enough for research that already exists, but for the future, the problem is even more serious. Program developers or commercial publishers could do many small studies of their programs or could commission studies using developer-made measures. Once a single study exceeds an effect size of +0.25, the program may be considered validated forever.

To add to the problem, in recent guidance from the U. S. Department of Education, a definition of the ESSA “promising” definition specifically mentions the idea that programs can meet the promising definition if they can report statistically significant or substantively important outcomes. The guidance refers to the WWC standards for the “strong” and “moderate” categories, and the WWC standards themselves allow for the >0.25 loophole (even though this is not mentioned or implied by the law itself, which consistently requires statistically significant outcomes, not “substantially important”). In other words, programs that meet WWC standards for “positive” or “potentially positive” based on substantively important evidence alone explicitly do not meet ESSA standards, which require statistical significance. Yet the recent regulations do not recognize this problem.

The >0.25 loophole began, I’d assume, when the WWC was young and few programs met its standards. It was jokingly called the “Nothing Works Clearinghouse.” The loophole was probably added to increase the numbers of included programs. This loophole produced misleading conclusions, but since the WWC did not matter very much to educators, there were few complaints. Today, however, the WWC has greater importance because of the ESSA evidence standards.

Bad loopholes make bad laws. It is time to close this loophole, and eliminate the category of “substantively important.”

Research and Development Saved Britain. Maybe They Will Save U.S. Education

One of my summer goals is to read the entire 6 volume history of the Second World War by Winston Churchill. So far, I’m about halfway through the first volume, The Gathering Storm, about the period leading up to 1939.

The book is more or less a wonderfully written rant about the Allies’ shortsightedness. As Hitler built up his armaments, Britain, France, and their allies maintained a pacifist insistence on reducing theirs. Only in the mid-thirties, when war was inevitable, did Britain start investing in armaments, but even then at a very modest pace.

Churchill was a Member of Parliament but was out of government. However, he threw himself into the one thing he could do to help Britain prepare: research and development. In particular, he worked with top scientists to develop the capacity to track, identify, and shoot down enemy aircraft.

When the 1940 Battle of Britain came and German planes tried to destroy and demoralize Britain in advance of an invasion, the inventions by Churchill’s group were a key factor in defeating them.

Churchill’s story is a good analogue to the situation of education research and development. In the current environment, the best-evaluated, most effective programs are not in wide use in U.S. schools. But the research and development that creates and evaluates these programs is essential. It is useful right away in hundreds of schools that do use proven programs already. But imagine what would happen if federal, state, or local governments anywhere decided to use proven programs to combat their most important education problems at scale. Such a decision would be laudable in principle, but where would the proven programs come from? How would they generate convincing evidence of effectiveness?  How would they build robust and capable organizations to provide high-quality professional development materials, and software?

The answer is research and development, of course. Just as Churchill and his scientific colleagues had to create new technologies before Britain was willing to invest in air defenses and air superiority at scale, so American education needs to prepare for the day when government at all levels is ready to invest seriously in proven educational programs.

I once visited a secondary school near London. It’s an ordinary school now, but in 1940 it was a private girls’ school. A German plane, shot down in the Battle of Britain, crash landed near the school. The girls ran out and captured the pilot!

The girls were courageous, as was the British pilot who shot down the German plane. But the advanced systems the British had worked out and tested before the war were also important to saving Britain. In education reform we are building and testing effective programs and organizations to support them. When government decides to improve student learning nationwide, we will be ready, if investments in research and development continue.

This blog is sponsored by the Laura and John Arnold Foundation

Research and Practice: “Tear Down This Wall”

I was recently in Berlin. Today, it’s a lively, entirely normal European capital. But the first time I saw it, it was 1970, and the wall still divided it. Like most tourists, I went through Checkpoint Charlie to the east side. The two sides were utterly different. West Berlin was pleasant, safe, and attractive. East Berlin was a different world. On my recent trip, I met a young researcher who grew up in West Berlin. He recalls his father being taken in for questioning because he accidentally brought a West Berlin newspaper across the border. Western people could visit, but western newspapers could get you arrested.

I remember John F. Kennedy’s “Ich bin ein Berliner” speech, and Ronald Reagan’s “Mr. Gorbechev, tear down this wall.” And one day, for reasons no one seems to understand, the wall was gone. Even today, I find it thrilling and incredible to walk down Unter den Linden under the Brandenburg Gate. Not so long ago, this was impossible, even fatal.

The reason I bring up the Berlin Wall is that I want to use it as an analogy to another wall of less geopolitical consequence, perhaps, but very important to our profession. This is the wall between research and practice.

It is not my intention to disrespect the worlds on either side of the research/practice wall. People on both sides care deeply about children and bring enormous knowledge, skill, and effort to improving educational outcomes. In fact, that’s what is so sad about this wall. People on both sides have so much to teach and learn from the other, but all too often, they don’t.

What has been happening in recent years is that the federal government, at least, has been reinforcing the research/practice divide in many ways, at least until the passage of the Every Student Succeeds Act (ESSA) (more on this later). On one hand, government has invested in high-quality educational research and development, especially through Investing in Innovation (i3) and the Institute of Education Sciences (IES). As a result, over on the research side of the wall there is a growing stockpile of rigorously evaluated, ready-to-implement education programs for most subjects and grade levels.

On the practice side of the wall, however, government has implemented national policies that may or may not have a basis in research, but definitely do not focus on use of proven programs. Examples include accountability, teacher evaluation, and Common Core. Even federal School Improvement Grants (SIG) for the lowest-achieving 5% of schools in each state had loads of detailed requirements for schools to follow but said nothing at all about using proven programs or practices, until a proven whole-school reform option was permitted as one of six alternatives at the very end of No Child Left Behind. The huge Race to the Top funding program was similarly explicit about standards, assessments, teacher evaluations, and other issues, but said nothing about use of proven programs.

On the research side of the wall, developers and researchers were being encouraged by the U.S. Department of Education to write their findings clearly and “scale up” their findings to presumably eager potential adopters on the practice side. Yet the very same department was, at the same time, keeping education leaders on the practice side of the wall scrambling to meet federal standards to obtain Race to the Top, School Improvement Grants, and other funding, none of which had anything much to do with the evidence base building up on the research side of the wall. The problem posed by the Berlin Wall was not going to be resolved by sneaking well-written West Berlin newspapers into East Berlin, or East Berlin newspapers into West Berlin. Rather, someone had to tear down the wall.

The Every Student Succeeds Act (ESSA) is one attempt to tear down the research/practice wall. Its definitions of strong, moderate, and promising levels of evidence, and provision of funding incentives for using proven programs (especially in applications for school improvement), could go a long way toward tearing down the research/practice wall, but it’s too soon to tell. So far, these definitions are just words on a page. It will take national, state, and local leadership to truly make evidence central to education policy and practice.

On National Public Radio, I recently heard recorded recollections from people who were in Berlin the day the wall came down. One of them really stuck with me. West Berliners had climbed to the top of the wall and were singing and cheering as gaps were opened. Then, an East German man headed for a gap. The nearby soldiers, unsure what to do, pointed their rifles at him and told him to stop. He put his hands in the air. The West Germans on the wall fell silent, anxiously watching.

A soldier went to find the captain. The captain came out of a guardhouse and walked over to the East German man. He put his arm around his shoulders and personally walked him through the gap in the wall.

That’s leadership. That’s courage. It’s what we need to tear down our wall: leaders at all levels who actively encourage the world of research and the world of practice to become one. To do it by personal and public examples, so that educators can understand that the rules have changed, and that communication between research and practice, and use of proven programs and practices, will be encouraged and facilitated.

Our wall can come down. It’s only a question of leadership, and commitment to better outcomes for children.

This blog is sponsored by the Laura and John Arnold Foundation

Thoughtful Needs Assessments + Proven Programs = Better Outcomes

I’ve been writing lately about our Evidence for ESSA web site, due to be launched at the end of February.  It will make information on programs meeting ESSA evidence standards easy to access and use.  I think it will be a wonderful tool.  But today, I want to talk about what educational leaders need to do to make sure that the evidence they get from Evidence for ESSA will actually make a difference for students.  Knowing what works is essential, but before searching for proven programs, it is important to know the problems you are trying to solve.  The first step in any cycle of instructional improvement is conducting a needs assessment.  You can’t fix a problem you don’t acknowledge and understand.  Most implementation models also ask leaders to do a “root cause” analysis in order to understand what causes the problems that need to be solved.

Needs assessments and root cause analyses are necessary, but they are often, as Monty Python used to say, “a privileged glimpse into the perfectly obvious.”  Any school, district, or state leadership team is likely to sit down with the data and conclude, for example, that on average, low achieving students are from less advantaged homes, or that they have learning disabilities, or that they are limited in English proficiency.  They might dig deeper and find that low achievers or dropouts are concentrated among students with poor attendance, behavior problems, poor social-emotional skills, and low aspirations.

Please raise your hand if any of these are surprising, or if you haven’t been working on them your whole professional life.  Seeing no hands raised, I’ll continue.

The problem with needs assessments and root cause analyses is that they usually do not suggest a pragmatic solution that isn’t already in use or hasn’t already been tried.  And some root causes cannot, as a practical matter, be solved by schools.  For example, if students live in substandard housing, or suffer from lead poisoning or chronic diseases, schools can help reduce the educational impact of these problems but cannot solve them.

Further, needs assessments often lead to solutions that are too narrow, and may therefore not be optimal or cost-effective.  For example, a school improvement team might conclude that one-third of kindergartners are unlikely to reach third grade reading at grade level.  This might lead the school to invest in a one-to-one tutoring program.  Yet few schools can afford one-to-one tutoring for as many as one third of their children.  A more cost-effective approach might be to invest in professional development in proven core instructional strategies for teachers of grades K-3, and then provide proven small-group tutoring for students for whom enhanced classroom reading instruction is not enough, and then provide one-to-one tutoring for the hopefully small number of students still not succeeding despite receiving proven whole-class and then small-group instruction.  The school might check students’ vision and hearing to be sure that problems in these areas are not what is holding back some of the students.

In this example, note that the needs assessment might not lead directly to the best solution.  A needs assessment might conclude that there is a big problem with early reading, and it might note the nature of the students likely to fail.  But the needs assessment might not lead to improving classroom instruction or checking vision and hearing, because these may not seem directly targeted to the original problem.  Improving classroom instruction or checking vision and hearing for all would end up benefitting students who never had a problem with reading.  But so what?  Some solutions, such as professional development for teachers, are so inexpensive (compared to, say, tutoring or special education) that it may be better to invest in the broader solution and let the benefits apply to all or most students rather than focus narrowly on the students with the problems.

An excellent example of this perspective relates to English learners.  In many schools and districts, students who enter school with poor English skills are at particular risk, perhaps throughout their school careers.  A needs assessment involving such schools would of course point to language proficiency as a key factor in students’ likelihood of success, on average.  Yet if you look at the evidence on what works with English learners to improve their learning of English, reading, and other subjects, most solutions take place in heterogeneous classrooms and involve a lot of cooperative learning, where English learners have a lot of opportunities every day to use English in school contexts.  A narrow interpretation of a needs assessment might try to focus on interventions for English learners alone, but alone is the last place they should be.

Needs assessments are necessary, but they should be carried out in light of the practical, proven solutions that are available. For example, imagine that a school leadership team carries out a needs assessment that arrays documented needs, and considers proven solutions, perhaps categorized as expensive (e.g., tutoring, summer school, after school), moderate (e.g., certain technology approaches, professional development with live, on-site coaching, vision and hearing services), or inexpensive (e.g., professional development without live coaching).  The idea would be to bring together data on the problems and the solutions, leading to a systemic approach to change rather than either picking programs off a shelf or picking out needs and choosing solutions narrowly focused on those needs alone.

Doing needs assessments without proven solutions as part of the process from the outset would be like making a wish list of features you’d like in your next car without knowing anything about cars actually on the market, and without considering Consumer Reports ratings of their reliability.  The result could be ending up with a car that breaks down a lot or one that costs a million dollars, or both.

Having easy access to reliable information on the effectiveness of proven programs should greatly change, but not dominate, the conversation about school improvement.  It should facilitate intelligent, informed conversations among caring leaders, who need to build workable systems to use innovation to enhance outcomes.  Those systems need to consider needs, proven programs, and requirements for effective implementation together, and create schools built around the needs of students, schools, and communities.

The Rapid Advance of Rigorous Research

My colleagues and I have been reviewing a lot of research lately, as you may have noticed in recent blogs on our reviews of research on secondary reading and our work on our web site, Evidence for ESSA, which summarizes research on all of elementary and secondary reading and math according to ESSA evidence standards.  In the course of this work, I’ve noticed some interesting trends, with truly revolutionary implications.

The first is that reports of rigorous research are appearing very, very fast.  In our secondary reading review, there were 64 studies that met our very stringent standards.  55 of these used random assignment, and even the 9 quasi-experiments all specified assignment to experimental or control conditions in advance.  We eliminated all researcher-made measures.  But the most interesting fact is that of the 64 studies, 19 had publication or report dates of 2015 or 2016.  Fifty-one have appeared since 2011.  This surge of recent publications on rigorous studies was greatly helped by the publication of many studies funded by the federal Striving Readers program, but Striving Readers was not the only factor.  Seven of the studies were from England, funded by the Education Endowment Foundation (EEF).  Others were funded by the Institute of Education Sciences at the U.S. Department of Education (IES), the federal Investing in Innovation (i3) program, and many publishers, who are increasingly realizing that the future of education belongs to those with evidence of effectiveness.  With respect to i3 and EEF, we are only at the front edge of seeing the fruits of these substantial investments, as there are many more studies in the pipeline right now, adding to the continuing build-up in the number and quality of studies started by IES and other funders.  Looking more broadly at all subjects and grade levels, there is an unmistakable conclusion: high-quality research on practical programs in elementary and secondary education is arriving in amounts we never could have imagined just a few years ago.

Another unavoidable conclusion from the flood of rigorous research is that in large-scale randomized experiments, effect sizes are modest.  In a recent review I did with my colleague Alan Cheung, we found that the mean effect size for large, randomized experiments across all of elementary and second reading, math, and science is only +0.13, much smaller than effect sizes from smaller studies and from quasi-experiments.  However, unlike small and quasi-experimental studies, rigorous experiments using standardized outcome measures replicate.  These effect sizes may not be enormous, but you can take them to the bank.

In our secondary reading review, we found an extraordinary example of this. The University of Kansas has an array of programs for struggling readers in middle and high schools, collectively called the Strategic Instruction Model, or SIM.  In the Striving Readers grants, several states and districts used methods based on SIM.  In all, we found six large, randomized experiments, and one large quasi-experiment (which matched experimental and control groups).  The effect sizes across the seven varied from a low of 0.00 to +0.15, but most clustered closely around the weighted mean of +0.09.  This consistency was remarkable given that the contexts varied considerably.  Some studies were in middle schools, some in high schools, some in both.  Some studies gave students an extra period of reading each day, some did not.  Some studies went for multiple years, some did not.  Settings included inner-city and rural locations, and all parts of the U.S.

One might well argue that the SIM findings are depressing, because the effect sizes were quite modest (though usually statistically significant).  This may be true, but once we can replicate meaningful impacts, we can also start to make solid improvements.  Replication is the hallmark of a mature science, and we are getting there.  If we know how to replicate our findings, then the developers of SIM and many other programs can create better and better programs over time with confidence that once designed and thoughtfully implemented, better programs will reliably produce better outcomes, as measured in large, randomized experiments.  This means a lot.

Of course, large, randomized studies may also be reliable in telling us what does not work, or does not work yet.  When researchers get zero impacts and then seek funding to do the same treatment again, hoping for better luck, they and their funders are sure to be disappointed.  Researchers who find zero impacts may learn a lot, which may help them create something new that will, in fact, move the needle.  But they have to then use those learnings to do something meaningfully different if they expect to see meaningfully different outcomes.

Our reviews are finding that in every subject and grade level, there are programs right now that meet high standards of evidence and produce reliable impacts on student achievement.  Increasing numbers of these proven programs have been replicated with important positive outcomes in multiple high-quality studies.  If all 52,000 Title I schools adopted and implemented the best of these programs, those that reliably produce impacts of more than +0.20, the U.S. would soon rise in international rankings, achievement gaps would be cut in half, and we would have a basis for further gains as research and development build on what works to create approaches that work better.  And better.  And then better still.

There is bipartisan, totally non-political support for the idea that America’s schools should be using evidence to enhance outcomes.  However a school came into being, whoever governs it, whoever attends it, wherever it is located, at the end of the day the school exists to make a difference in the lives of children.  In every school there are teachers, principals, and parents who want and need to ensure that every child succeeds.  Research and development does not solve all problems, but it helps leverage the efforts of all educators and parents so that they can have a maximum positive impact on their children’s learning.  We have to continue to invest in that research and development, especially as we get smarter about what works and what does not, and as we get smarter about research designs that can produce reliable, replicable outcomes.  Ones you can take to the bank.

Brilliant Errors

On a recent visit to Sweden, my wife Nancy and I went to the lovely university city of Uppsala. There, one of the highlights of our trip was a tour of the house and garden of the great 18th century botanist, Carl Linnaeus, who invented the system of naming plants and animals we use today. Whenever we say Homo Sapiens, for example, we are honoring Linnaeus. His system uses two Latin words, first the genus and then the species. This replaced long, descriptive, but non-standardized naming systems that made it difficult to work out the relationships among plants and animals. Linnaeus was the most famous botanist of his time, and he is generally considered the most famous botanist in all of history. He wrote hundreds of books and papers, and he inspired the work of generations of botanists and biologists to follow, right up to today.

But he was dead wrong.

What Linnaeus was primarily trying to do was to create a comprehensive system to organize plants by their characteristics. In this, he developed what he called a “sexual system,” emphasizing the means by which plants reproduce. This was a reasonable guess, but later research showed that his organization system was incorrect.

But the fact that his specific model was wrong does not subtract one mustard seed from the power and importance of Linnaeus’ contribution.

Linnaeus’ lasting contribution was in his systematic approach, carefully analyzing plants to observe similarities and differences. Before Linnaeus, botany involved discovery, description, and categorization of plants, but there was no overarching system of relationships, and no scientifically useful naming system to facilitate seeing relationships.

The life and work of Linnaeus provides an interesting case for educators and educational research.

Being wrong is not shameful, as long as you can learn from your errors. In the history of education, the great majority of research began with a set of assumptions, but research methods did not adequately test these assumptions. There was an old saying that all educational research was “doomed to success.” As a result, we had little ability to tell when theories or methods were truly impactful, and when they were not. For this reason, it was rarely possible to learn from errors, or even from apparent successes.

In recent years, the rise of experimental research, in real schools over real periods of time measured by real assessments, has produced a growing set of proven replicable programs, and this is crucial for improving practice right now. But in the longer run, using methods that also identify failures or incorrect or unrealistic ideas is just as important. In the absence of methods that can disconfirm current beliefs, nothing ever changes.

It is becoming apparent that most large-scale randomized experiments in education fail to produce statistically significant outcomes on achievement. We can celebrate and replicate those that do make a significant difference in students’ learning, but we can also learn from those that do not. Often, studies find no difference overall but do find positive effects for particular subgroups, or when particular forms of a program are used, or when implementation meets a high standard. These after-the-fact findings provide clues, not proof, but if researchers use the lessons from a non-significant experiment in a new study and find that under well-specified conditions the treatment is effective for improving learning, then we’ve made a great advance.

It is important to set up experiments so that they can tell us more than “yes/no” but can instead tell us what factors did or did not contribute to positive impacts. This information is crucial whatever the overall impacts may be.

In every field that uses experiments, failures to find positive effects are common. Our task is to plan for this and learn from our own failures as well as successes. Like Linnaeus, we will make progress by learning from “brilliant errors.”

Linnaeus’ methods created the means of disconfirming his own taxonomy system. His taxonomy was indeed overthrown by later work, but his insistence on observation, categorization, and systematization, the very methods that undermined his own system of relationships among plants and animals, were his real contribution. In educational research, we must learn to celebrate high-quality rigorous research that finds what does not work, and include sufficient qualitative methods to help us learn how and why educational programs either work or do not work for children.

May we all have opportunities to fail as brilliantly as Linnaeus did!