Brilliant Errors

On a recent visit to Sweden, my wife Nancy and I went to the lovely university city of Uppsala. There, one of the highlights of our trip was a tour of the house and garden of the great 18th century botanist, Carl Linnaeus, who invented the system of naming plants and animals we use today. Whenever we say Homo Sapiens, for example, we are honoring Linnaeus. His system uses two Latin words, first the genus and then the species. This replaced long, descriptive, but non-standardized naming systems that made it difficult to work out the relationships among plants and animals. Linnaeus was the most famous botanist of his time, and he is generally considered the most famous botanist in all of history. He wrote hundreds of books and papers, and he inspired the work of generations of botanists and biologists to follow, right up to today.

But he was dead wrong.

What Linnaeus was primarily trying to do was to create a comprehensive system to organize plants by their characteristics. In this, he developed what he called a “sexual system,” emphasizing the means by which plants reproduce. This was a reasonable guess, but later research showed that his organization system was incorrect.

But the fact that his specific model was wrong does not subtract one mustard seed from the power and importance of Linnaeus’ contribution.

Linnaeus’ lasting contribution was in his systematic approach, carefully analyzing plants to observe similarities and differences. Before Linnaeus, botany involved discovery, description, and categorization of plants, but there was no overarching system of relationships, and no scientifically useful naming system to facilitate seeing relationships.

The life and work of Linnaeus provides an interesting case for educators and educational research.

Being wrong is not shameful, as long as you can learn from your errors. In the history of education, the great majority of research began with a set of assumptions, but research methods did not adequately test these assumptions. There was an old saying that all educational research was “doomed to success.” As a result, we had little ability to tell when theories or methods were truly impactful, and when they were not. For this reason, it was rarely possible to learn from errors, or even from apparent successes.

In recent years, the rise of experimental research, in real schools over real periods of time measured by real assessments, has produced a growing set of proven replicable programs, and this is crucial for improving practice right now. But in the longer run, using methods that also identify failures or incorrect or unrealistic ideas is just as important. In the absence of methods that can disconfirm current beliefs, nothing ever changes.

It is becoming apparent that most large-scale randomized experiments in education fail to produce statistically significant outcomes on achievement. We can celebrate and replicate those that do make a significant difference in students’ learning, but we can also learn from those that do not. Often, studies find no difference overall but do find positive effects for particular subgroups, or when particular forms of a program are used, or when implementation meets a high standard. These after-the-fact findings provide clues, not proof, but if researchers use the lessons from a non-significant experiment in a new study and find that under well-specified conditions the treatment is effective for improving learning, then we’ve made a great advance.

It is important to set up experiments so that they can tell us more than “yes/no” but can instead tell us what factors did or did not contribute to positive impacts. This information is crucial whatever the overall impacts may be.

In every field that uses experiments, failures to find positive effects are common. Our task is to plan for this and learn from our own failures as well as successes. Like Linnaeus, we will make progress by learning from “brilliant errors.”

Linnaeus’ methods created the means of disconfirming his own taxonomy system. His taxonomy was indeed overthrown by later work, but his insistence on observation, categorization, and systematization, the very methods that undermined his own system of relationships among plants and animals, were his real contribution. In educational research, we must learn to celebrate high-quality rigorous research that finds what does not work, and include sufficient qualitative methods to help us learn how and why educational programs either work or do not work for children.

May we all have opportunities to fail as brilliantly as Linnaeus did!


‘We Don’t Do Programs’

When speaking with educational leaders, I frequently hear them say, “We don’t do programs.” They lament that teachers and principals are often too driven by programs, and when outcomes are not what they would like, they drop their program and bring in another, learning little in the process.

I can sympathize with the sentiment. The sad secret known to quantitative researchers is that most programs don’t make any difference in achievement. In particular, commercial textbooks almost never make a difference. It’s not that textbooks are worthless, but they are usually so similar to each other that few textbooks produce better outcomes than any others. So the experience of school and district leaders is a cycle of adopting new textbooks, implementing them with enthusiasm, and then gradually being disappointed in the outcomes. The schools and districts are then stuck with the old texts until the books wear out, and then the whole cycle starts again.

A very similar cycle of adoption, enthusiasm, frustration, and abandonment, also happens with technology, though it may happen faster.

Another part of the negative experience with programs that many school leaders share is an observation that educators often believe that they are on the right track just because they have adopted a new program. This may be particularly true if the school or district has adopted a program that is seen as innovative or “in,” such as the technology of the moment.

What’s left out of the “we don’t do programs” conversation is, of course, evidence. Would people who say “We don’t do programs” also say “we don’t do effective programs?” I certainly hope not. Yet the categorical rejection of all programs makes no sense, so I’m going to assume that since most educators are sensible people, those who say “we don’t do programs” must just not be aware that there are in fact effective and ineffective programs.

Of course, simply adopting a program is not a guarantee of positive outcomes. Programs must be implemented with fidelity, thoughtfulness, and appropriate adaptations to local needs and resources. What evidence of effectiveness provides is not certainty, but rather a valid reason to believe that if teachers and principals put in the time, effort, and resources to implement a program well, outcomes will be positive.

If you follow my blogs, you are aware that there are many proven programs and many programs that lack evidence of effectiveness. I’ll consider my life goal to have been achieved when I start hearing educational leaders saying “we don’t do ineffective programs. We do effective ones.” Or more succinctly, “Show me the evidence!”

The Curse of the Cluster

If you follow my blogs, you’ve probably noticed that I stay away from three topics on which reasoned discourse is impossible: religion, politics, and statistics. However, just this once I’d like to break my own rule and talk about statistics, or rather research design. And I promise not to be too nerdy.

While there is little argument about basic principles of statistics and research design, things do get a bit dicey in the real world. Some of my colleagues resolve any situation that is less than ideal by ignoring studies with the slightest flaw. I think that can be a huge waste of (usually) government money, and can deprive researchers and educators alike of valuable information.

My personal position is that all flaws are not created equal. In particular, some flaws introduce bias and some do not. For example, use of researcher-made measures, small sample sizes, and matched rather than randomized designs introduce bias, so they should be avoided or minimized in importance.

On the other hand, accounting for clustering in designs in which students are grouped in classes or schools is now considered essential. That is, if you randomly assign 20 schools to experimental (n=10) or control (n=10) conditions, you might have 5000 students per treatment. Randomly assigning 5000 students one at a time would be a huge study. In fact, 300 students might be enough. However, in a clustered study, 5000 per treatment may be too small. Current statistical principles demand that you use a method called Hierarchical Linear Modeling (HLM) to analyze the data, and unless the effect size is very large, 20 schools will not be sufficient for statistical significance.

Yet here’s the rub: failing to account for clustering does not introduce bias. That is, if you (mistakenly) analyzed at the student level in a study in which treatments were implemented at the class or school level, the effect size would be about the same. All that would change would be statistical significance. That is, you would overstate the number of experimental-control differences claimed to be significant (i.e., beyond what you’d expect by chance).

All right, let’s accept that clustered data should be analyzed using HLM, which accounts for clustering. But while we are straining at the clustering gnat, what camels are we swallowing?

My personal bugbear is researcher-made measures. Often, the very same researchers who take an unyielding position on clustering happily accept research designs in which the researcher made the test, even if the test is clearly aligned with the content the experimental group (but not the control group) was taught. In some studies, the teachers who provided tutoring, for example, also gave the tests. Strict-on-clustering researchers also often accept studies that were very brief, sometimes a week or less, or often just an hour. They may accept studies in which conditions in the experimental groups were substantially enhanced beyond what could ever be done in real life, as in technology studies in which a graduate student is placed in every class or even every small group every day to “help with the technology.”

All of these research designs are far more likely to produce misleading findings than are studies that only suffer from clustering problems, and worse, these effects introduce bias, while failing to attend to clustering does not.

Why is this of importance to non-statisticians? It matters because in education, students are usually taught in large groups, so except for studies of one-to-one or small-group tutoring, clustering almost always has to be accounted for, and as a consequence, randomized experiments typically must involve 40-50 schools (20-25 per treatment) to detect an effect size as small as 0.20. Such experiments are very expensive, and they are difficult to do if you are not an expert already. The clustering requirement, therefore, makes it difficult for researchers early in their careers to get funding and to show success if they do, because managing implementation and collecting data in 50 schools is really, really hard.

I do not have a good solution for this problem, and I upset my colleagues when I bring it up. But we have to face it. Making accounting for clustering an absolute makes educational research too expensive, and put another way it means that we can do too few studies for the dollars we do invest. And this requirement bars entry to the field to those unable to get multi-million dollar grants or to manage large field experiments.

One solution to the cluster problem might be to have research funders fund step-by-step studies. For example, imagine that funding were offered for studies of 10 schools to be analyzed at the cluster level (correct but hopelessly underpowered) and at the student level (Bad! But affordable.). If the outcomes are promising, funders could fund another 10-school study, and researchers could combine the samples, repeating this process until there are enough schools to collectively justify a proper clustered analysis. This would also enable neophyte researchers to learn from experience, it would allow everyone to learn over time what the potential impacts are, and it could save billions of dollars now being spent on monster randomized studies of programs never before having shown promising effects (which then turn out to be ineffective).

A gradual approach to clustering might enable the field of education to focus on the real enemy, which is bias. If we systematically stamp out design elements that add bias, then over time the field will converge upon truth, and will cost-effectively move forward knowledge of what works, in time to benefit today’s children. The curse of the cluster is holding back the whole field. With all due respect to the real problems clustered designs present, let’s find ways to compromise so we can learn from unbiased but modest-sized studies and go step-by-step toward better information for practice.

The Maryland Challenge

As the Olympic Games earlier this summer showed, Americans love to compare ourselves with other countries. Within the U.S., we like to compare our states with other states. When Ohio State plays the University of Michigan, it’s not just a football game.

In education, we also like to compare, and we usually don’t like what we see. Comparisons can be useful in giving us a point of reference for what is possible, but a point of reference doesn’t help if it is not seen as a peer. For example, U. S. students are in the middle of the pack of developed nations on Program for International Student Assessment (PISA) tests for 15 year olds, but Americans expect to do a lot better than that. The National Assessment of Educational Progress (NAEP) allows us to compare scores within the U.S., and unless you’re in Massachusetts, which usually scores highest, you probably don’t like those comparisons either. When we don’t like our ranking, we explain it away as best as we can. Countries with higher PISA scores have fewer immigrants, or pay their teachers better, or have cultures that value education more. States that do better are richer, or have other unfair advantages. These explanations may or may not have an element of truth, but the bottom line is that comparisons on such a grand scale are just not that useful. There are far too many factors that are different between nations or states, some of which are changeable and some not, at least in the near term.

If comparisons among unequal places are not so useful, what point of reference would be better?

Kevan Collins, Director of the Education Endowment Foundation in England (England’s equivalent to our Investing in Innovation (i3) program), has an answer to this dilemma, which he explained at a recent conference I attended in Stockholm. His idea is based on a major, very successful initiative of Tony Blair’s government beginning in 2003, called the London Challenge. Secondary schools in the greater London area were put into clusters according to students’ achievement at the end of primary (elementary) school, levels of poverty, numbers of children speaking languages other than English at home, size, and other attributes. Examination of the results being achieved by schools within the same cluster showed remarkable variation in test scores. Even in the poorest clusters there were schools performing above the national average, and in the wealthiest clusters there were schools below the average. Schools low in their own clusters were given substantial resources to improve, with a particular emphasis on leadership. Over time, London went from being one of the lowest-achieving areas of England to scoring among the highest. Later versions of this plan in Manchester and in the Midlands did not work as well, but they did not have much time before the end of the Blair government meant the end of the experiment.

Fast forward to today, and think about states in the U. S. as the unit of reform. Imagine that Maryland, my state, categorized its Title I elementary, middle, and high schools according to percent free lunch, ethnic composition, percent English learners, urban/rural, school size, and so on. Each of Maryland’s Title I schools would be in a cluster of perhaps 50 very similar schools. As in England, there would be huge variation in achievement within clusters.

Just forming clusters to shame schools low in their own cluster would not be enough. The schools need help to greatly improve their outcomes.

This being 2016, we have many more proven programs than were available in the London Challenge. Schools scoring below the median of their cluster might have the opportunity to choose proven programs appropriate to their strengths and needs. The goal would be to assist every school below the median in its own cluster to at least reach the median. School staffs would have to vote by at least 80% in favor to adopt various programs. The school would also commit to use most of its federal Title I funds to match supplemental state or federal funding to pay for the programs. Schools above the median would also be encouraged to adopt proven programs, but might not receive matching funds.

Imagine what could happen. Principals and staffs could no longer argue that it is unfair for their schools to be compared to dissimilar schools. They might visit schools performing at the highest levels in their clusters, and perhaps even form coalitions across district lines to jointly select proven approaches and help each other implement them.

Not all schools would likely participate in the first years, but over time, larger numbers might join in. Because schools would be implementing programs already known to work in schools just like theirs, and would be held accountable within a fair group of peers, schools should see rapid growth toward and beyond their cluster median, and more importantly, the entire clusters should advance toward state goals.

A plan like this could make a substantial difference in performance among all Title I schools statewide. It would focus attention sharply where it is needed, on improved teaching and learning in the schools that need it most. Within a few years, Maryland, or any other state that did the same, might blow past Massachusetts, and a few years after that, we’d all be getting visits from Finnish educators!