“Substantively Important” Isn’t Substantive. It Also Isn’t Important

Since it began in 2002, the What Works Clearinghouse has played an important role in finding, rating, and publicizing findings of evaluations of educational programs. It performs a crucial function for evidence-based reform. For this very reason, it needs to be right. But in several important ways, it uses procedures that are indefensible and have a big impact on its conclusions.

One of these relates to a study rating called “substantively important-positive.” This refers to study outcomes with an effect size of at least +0.25, but that are not statistically significant. I’ve written about this before, but the WWC has recently released a database of information on its studies that makes it easy to analyze WWC data on a large scale, and we have learned a lot more about this topic.

Study outcomes rated as “substantively important – positive” can qualify a study as “potentially positive,” the second-highest WWC rating. “Substantively important-negative” findings (non-significant effect sizes less than -0.25) can cause a study to be rated as potentially negative, which can keep a study from getting a positive rating forever, as a single “potentially negative” rating, under current rules, ensures that a program can never receive a rating better than “mixed,” even if other studies found hundreds of significant positive effects.

People who follow the WWC and know about “substantively important” may assume that it may be a strange rule, but relatively rare in practice. But that is not true.

My graduate student, Amanda Inns, has just done an analysis of WWC data from their own database, and if you are a big fan of the WWC, this is going to be a shock. Amanda has looked at all WWC-accepted reading and math studies. Among these, she found a total of 339 individual outcomes rated “positive” or “potentially positive.” Of these, 155 (46%) reached the “potentially positive” level only because they had effect sizes over +0.25, but were not statistically significant.

Another 36 outcomes were rated “negative” or “potentially negative.” 26 of these (72%) were categorized as “potentially negative” only because they had effect sizes less than -0.25 and were not significant. I’m sure patterns would be similar for subjects other than reading and math.

Put another way, almost half (48%) of outcomes rated positive/potentially positive or negative/potentially negative by the WWC were not statistically significant. As one example of what I’m talking about, consider a program called The Expert Mathematician. It had just one study with only 70 students in 4 classrooms (2 experimental and 2 control). The WWC re-analyzed the data to account for clustering, and the outcomes were nowhere near statistically significant, though they were greater than +0.25. This tiny study, and this study alone, caused The Expert Mathematician to receive the WWC “potentially positive” rating and to be ranked seventh among all middle school math programs. Similarly, Waterford Early Learning received a “potentially positive” rating based on a single tiny study with only 70 kindergarteners in 6 schools. The outcomes ranged from -0.71 to +1.11, and though the mean was more than +0.25, the outcome was far from significant. Yet this study alone put Waterford on the WWC list of proven kindergarten programs.

I’m not taking any position on whether these particular programs are in fact effective. All I am saying is that these very small studies with non-significant outcomes say absolutely nothing of value about that question.

I’m sure that some of you nerdier readers who have followed me this far are saying to yourselves, “well, sure, these substantively important studies may not be statistically significant, but they are probably unbiased estimates of the true effect.”

More bad news. They are not. Not even close.

The problem, also revealed in Amanda Inns’ data, is that studies with large effect sizes but not statistical significance tend to have very small sample sizes (otherwise, they would have been significant). Across WWC reading and math studies that used individual-level assignment, median sample sizes were 48, 74, or 86, for substantively important, significant, or indeterminate (non-significant with ES < +0.25), respectively. For cluster studies, they were 10, 17, and 33 clusters respectively. In other words, “substantively important” outcomes averaged less than half the sample sizes of other outcomes.

And small-sample studies greatly overstate effect sizes. Among all factors that bias effect sizes, small sample size is the most important (only use of researcher/developer-made measures comes close). So a non-significant positive finding in a small study is not an unbiased point estimate that just needs a larger sample to show its significance. It is probably biased, in a consistent, positive direction. Studies with sample sizes less than 100 have about three times the mean effect sizes of studies with sample sizes over 1000, for example.

But “substantively important” ratings can throw a monkey wrench into current policy. The ESSA evidence standards require statistically significant effects for all of its top three levels (strong, moderate, and promising). Yet many educational leaders are using the What Works Clearinghouse as a guide to which programs will meet ESSA evidence standards. They may logically assume that if the WWC says a program is effective, then the federal government stands behind it, regardless of what the ESSA evidence standards actually say. Yet in fact, based on the data analyzed by Amanda Inns for reading and math, 46% of the outcomes rated as positive/potentially positive by WWC (taken to correspond to “strong” or “moderate,” respectively, under ESSA evidence standards) are non-significant, and therefore do not qualify under ESSA.

The WWC needs to remove “substantively important” from its ratings as soon as possible, to avoid a collision with ESSA evidence standards, and to avoid misleading educators any further. Doing so would help make the WWC’s impact on ESSA substantive. And important.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

CDC Told to Avoid Use of “Evidence-Based”: Is the Earth Flat Again?

In this blog series, I generally try to stay non-partisan, avoiding issues that, though important, do not relate to evidence-based reform in education. However, the current administration has just crossed that line.

In a December 16 article in the Washington Post, Lena Sun and Juliet Eilperin reported that the Trump Administration has prohibited employees of the Centers for Disease Control and Prevention (CDC) from using seven words or phrases in their reports. Two of these are “evidence-based” and “science-based.” Admittedly, this relates to health, not education, but who could imagine that education will not be next?

I’m not sure exactly why “evidence-based” and “science-based” are included among a set of banned words that otherwise consist of words such as “fetus,” “transgender,” and “diversity” that have more obvious political overtones. The banning of “evidence-based” and “science-based” is particularly upsetting because evidence, especially in medicine, has up to now been such a non-partisan, good-government concept. Ultimately, Republicans and Democrats and their family members and friends get sick or injured, or want to prevent disease, and perhaps as a result, evidence-based health care has been popular on both sides of the aisle. In education, Republican House Speaker Paul Ryan and Democratic Senator Patty Murray have worked together as forceful advocates for evidence-based reform, as have many others. George W. Bush and Barak Obama both took personal and proactive roles in advancing evidence in education.

You have to go back a long time to find governments banning evidence itself. Perhaps you have to go back to Pope Paul V, whose Cardinal Bellarmine ordered Galileo in 1615 to: “…abandon completely the opinion that the sun stands still at the center of the world and the Earth moves…”

In fear for his life, Galileo agreed, but in 1633, Galileo was accused of breaking his promise. He was threatened with torture, and had to agree again to the Pope’s demand. He was placed under house arrest for the rest of his life.

After his 1633 banishment, Galileo was said to have muttered, “E pur si muove” (And yet it moves). If he did (historians doubt it), he was expressing defiance, but also a key principle of science: “Proven principles remain true even if we are no longer allowed to speak of them.”

The CDC officials were offered a new formulation to use instead of “evidence-based” and “science-based.” It was: “CDC bases its recommendations on science in consideration with community standards and wishes.”

This is of course the antithesis of evidence or science. Does the Earth circle the sun in some states or counties, but it’s the other way around in others? Who decides which scientific principles apply in a given location? Does objective science have any role at all or are every community’s beliefs as valid as every other’s? Adopting the ban would hold back research and applications of settled research, harming millions of potential beneficiaries and making the U.S. a laughingstock among advanced nations. Banning the words “evidence-based” and “science-based” will not change scientific reality. Yet it will perhaps slow down funding for research and dissemination of proven treatments, and that would be disastrous, both in medicine and in education. I hope and expect that scientists in both fields will continue to find the truth and make it known whatever the consequences, and that our leaders of both parties see the folly of this action and reverse it immediately.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

What the Election Might Mean for Evidence-Based Education

Like everyone else in America, I awoke on Wednesday to a new era. Not only was Donald Trump elected president, but the Republicans retained control of both houses of Congress. This election will surely have a powerful impact on issues that the president-elect and other Republicans campaigned on, but education was hardly discussed. The New York Times summarized Mr. Trump’s education positions in an October 31 article. Mr. Trump has spoken in favor of charters and other school choice plans, incentive pay for teachers, and not much else. A Trump administration will probably appoint a conservative Secretary of Education, and that person would have considerable influence on what happens next.

What might this mean for evidence-based reform in education? Hopefully, the new administration will embrace evidence, as embodied in the Every Student Succeeds Act (ESSA). Why? Because the Congress that passed ESSA less than a year ago is more or less the same Congress that was just elected. Significantly, Senators Rob Portman (R-Ohio), Michael Bennet (D-Colorado), and Patty Murray (D-Washington), some of the major champions of evidence in the Senate, were all just re-elected. Senator Lamar Alexander (R-Tennessee), a key architect of ESSA, is still in office. In the absence of a major push from the new executive branch, the Congress seems likely to continue its bipartisan support for the ESSA law.

Or so I fervently hope.

Evidence has not been a partisan issue and it will hopefully remain bipartisan. Everyone has an interest in seeing that education dollars are spent wisely to benefit children. The evidence movement has advanced far enough to offer real hope that step-by-step progress can take place in education as increasingly effective methods, materials, and software become available, as a direct outcome of research and development. Evidence-based reform has strengthened through red and blue administrations. It should continue to grow through the new administration.

Or so I fervently hope.

What if Evidence Doesn’t Match Ideology?

Several years ago when the Conservative Party was first coming into office in the U.K., I had an opportunity to meet with a High Government Official. He had been told that I was a supporter of phonics in early reading, and that was what he wanted to talk about. We chatted amicably for some time about our agreement on this topic.

Then the Great Man turned to another topic. What did I think about the evidence on ability grouping?

I explained that the evidence did not favor ability grouping, and was about to explain why when he cut me off with the internationally understood gesture meaning, “I’m a very busy and important person. Get out of my office immediately.” Ever since then, the British government has gotten along just fine without my advice.

What the Great Man was telling me, of course, is the depressing reality of why it is so difficult to change policy or practice with evidence. Most people value research when it supports the ideological position they already had, and reject research when it does not. The result is that policy and practice remain an ideological struggle, little influenced by the actual findings of research. Advocates of a given position seek evidence to throw at their opponents or to defend themselves from evidence thrown at them by the “other side.” And we all too often evaluate evidence based on the degree to which it corresponds to our pre-existing beliefs rather than re-evaluating our beliefs in light of evidence. I recall that at a meeting of Institute of Education Sciences (IES) grantees, a respected superintendent spoke to the whole assemblage and, entirely without irony or humor, defined good research as that which confirms his beliefs, and bad research as that which contradicts his beliefs.

A scientific field only begins to move forward when researchers and users of the research come to accept research findings whether or not they support their previous beliefs. Not that this is easy. Even in the most scientific of fields, it usually takes a great deal of research over an extended time period to replace a widely accepted belief with a contrary set of findings. In the course of unseating the old belief, researchers who dare to go against the current orthodoxy have difficulty finding an audience, funding, promotions, or respect, so it’s a lot easier to go with the flow. Yet true sciences do change their minds based on evidence, even if they must often be dragged kicking and screaming to the altar of knowledge. One classic example I’ve heard of involved the bacterial origin of gastric ulcers. Ulcers were once thought to be caused by stress, until an obscure Australian researcher deliberately gave himself an ulcer by drinking a solution swarming with gastric bacteria. He then cured himself with a drug known to kill those bacteria. Today, the stress theory is gone and the bacteria theory is dominant, but it wasn’t easy.

Education researchers are only just beginning to have enough confidence in our own research to expect policy makers, practitioners, and other researchers to change their beliefs on the basis of evidence. Yet education will not be an evidence-driven field until evidence begins to routinely change beliefs about what works for students and what does not. We need to change thinking not only about individual programs or principles, but about the role of evidence itself. This is one reason that it is so important that research in education be of impeccable quality, so that we can have confidence that findings will replicate in future studies and will generalize to many practical applications.

A high government official in health would never dismiss research on gastric ulcers because he or she still believed that ulcers are caused by stress. A high government official in agriculture would never dismiss research on the effects of certain farming methods on soil erosion. In the U.S., at least, our Department of Education has begun to value evidence and to encourage schools to adopt proven programs and practices, but there is a long way to go before education joins medicine and agriculture in willingness to recognize and promote findings of rigorous and replicated research. We’re headed in the right direction, but I have to admit that the difficulties getting there are giving me one heck of an ulcer.*

*Just kidding. I’m fine.

What Schools in One Place Can Learn from Schools Elsewhere

In a recent blog, I responded to an article by Lisbeth Schorr and Srik Gopal about their concerns that the findings of randomized experiments will not generalize from one set of schools to another. I got a lot of supportive response to the blog, but I realize that I left out a key point.

The missing point was this: the idea that effective programs readily generalize from one place to another is not theoretical. It happens all the time. I try to avoid talking about our own programs, but in this case, it’s unavoidable. Our Success for All program started almost 30 years ago, working with African American students in Baltimore. We got terrific results with those first schools. But our first dissemination schools beyond Baltimore included a Philadelphia school primarily serving Cambodian immigrants, rural schools in the South, small town schools in the Midwest, and so on. We had to adapt and refine our approaches for these different circumstances, but we found positive effects across a very wide range of settings and circumstances. Over the years, some of our most successful schools have been ones serving a Native Americans, such as a school in the Arizona desert and a school in far northern Quebec. Another category of schools where we see outstanding success is ones serving Hispanic students, including English language learners, as in the Alhambra district in Phoenix and a charter school near Los Angeles. One of our most successful districts anywhere is in small-city Steubenville, Ohio. We have established a successful network of SFA schools in England and Wales, where we have extraordinary schools primarily serving Pakistani, African, and disadvantaged White students in a very different policy context from the one we face in the U.S. And yes, we continue to find great results in Baltimore and in cities that resemble our original home, such as Detroit.

The ability to generalize from one set of schools to others is not at all limited to Success for All. Reading Recovery, for example, has had success in every kind of school, in countries throughout the world. Direct Instruction has also been successful in a wide array of types of schools. In fact, I’d argue that it is rare to find programs that have been proven to be effective in rigorous research that then fail to generalize to other schools, even ones that are quite different. Of course, there is great variation in outcomes in any set of schools using any innovative program, but that variation has to do with leadership, local support, resources, and so on, not with a fundamental limitation on generalizability to additional populations.

How is it possible that programs initially designed for one setting and population so often generalize to others? My answer would be that in most fundamental regards, the closer you get to the classroom, the more schools begin to resemble each other. Individual students do not all learn the same way, but every classroom contains a range of students who have a predictable set of needs. Any effective program has to be able to meet those needs, wherever the school happens to be located. For example, every classroom has some number of kids who are confident, curious, and capable, some number who are struggling, some number who are shy and quiet, some number who are troublemakers. Most contain students who are not native speakers of English. Any effective program has to have a workable plan for each of these types of students, even if the proportions of each may vary from classroom to classroom and school to school.

There are reasonable adaptations necessary for different school contexts, of course. There are schools where attendance is a big issue and others where it can be assumed, schools where safety is a major concern and others where it is less so. Schools in rural areas have different needs from those in urban or suburban ones, and obviously schools with many recent immigrants have different needs from those in which all students are native speakers of English. Involving parents effectively looks different in different places, and there are schools in which eyeglasses and other health concerns can be assumed to be taken care of and others where they are major impediments to success. But after the necessary accommodations are made, you come down to a teacher and twenty to thirty children who need to be motivated, to be guided, to have their individual needs met, and to have their time used to greatest effect. You need to have an effective plan to manage diverse needs and to inspire kids to see their own possibilities. You need to fire children’s imaginations and help them use their minds well to write and solve problems and imagine their own futures. These needs exist equally in Peru and Poughkeepsie, in the Arizona desert or the valleys of Wales, in Detroit or Eastern Kentucky, in California or Maine.

Disregarding evidence from randomized experiments because it does not always replicate is a recipe for the status quo, as far as the eye can see. And the status quo is unacceptable. In my experience, the reason programs fail to replicate is that they were never all that successful in the first place, or because they attempt to replicate a form of a model much less robust than the one they researched.

Generalization can happen. It happens all the time. It has to be planned for, designed for, not just assumed, but it can and does happen. Rather than using failure to replicate as a stick to beat evidence-based policy, let’s agree that we can learn to replicate, and then use every tool at hand to do so. There are so many vulnerable children who need better educations, and we cannot be distracted by arguments that “nothing replicates” that are contradicted by many examples throughout the world.

Kumbaya

When I was a kid, my brothers and I used to go to a YMCA camp on the Chesapeake Bay for a month every summer. My mother said that it was cheaper than feeding us, which was my first exposure to cost-effectiveness analysis.

At the camp, we did all the usual camp things. One of those was evening campfires with singing. This was a YMCA camp in the early 1960s, so we sang a lot of folk songs about peace, love, and understanding. I was reminded of this because I now have a granddaughter who loves a Peter, Paul, and Mary disk with just those songs on it, including Kumbaya.

Skip forward a few decades from those long-ago campfires. Today, the very word Kumbaya is used as an insult of sorts. It means that the person being insulted is an unrealistic idealist, who expects that social progress can be made by sitting around the campfire and singing. As a data-minded social scientist who expects evidence from randomized studies for just about everything, I should be firmly in the anti-Kumbaya camp, so to speak. But I’m not.

Let me be clear: I do not think that singing around campfires causes important social change. Yet I’d argue that a lack of Kumbaya is just as much a problem. Kumbaya-fueled idealism is the very core of evidence-based reform, in fact.

Here’s why. The greatest danger to evidence-based reform is the widespread belief that doing well-intentioned things is good enough, even if we don’t know whether they work. An idealist should never accept this. Good intentions are nice, but they do not bring about real Kumbaya. That depends on good outcomes.

Sitting around campfires and singing about peace, love, and understanding should be good preparation for actually caring whether your actions make the difference you intend them to make. Sure, life teaches you that it takes toughness to insist that good intentions become good actions, but you have to start with the good intentions.

So here is another verse to that ageless song:

Someone’s experimenting, Lord
Kumbaya
Randomizing, Lord
Kumbaya
Someone’s analyzing, Lord
Kumbaya
Oh Lord,
Kumbaya

Remembering Al Shanker: Teachers and Professionalism

Back in the day, I knew Al Shanker, the founder of the American Federation of Teachers. No one has ever been more of an advocate for teachers’ rights – or for their professionalism. At the same time, no one was more of an advocate for evidence as a basis for teaching. He saw no conflict between evidence-based teaching and professionalism. In fact, he saw them as complementary. He argued that in fields in which professionals possess unique knowledge and skills, backed up by research, those professionals are well respected, well compensated, and play a leading role in the institutions in which they work.

If teachers want to be taken seriously, they must be seen to be using methods, technologies, and materials that not just anyone knows how to use, and that are known to be effective. Think physicians, engineers, and lawyers. Their positions in society depend on their possession of specialized and proven knowledge and skills.

Yet when I speak about evidence-based reform, I often get questions from teachers about whether using evidence-proven programs will take away their professionalism, creativity, or independence. I am sympathetic to this question, because I am aware that teachers have had to put up with quite a lot in recent years. Teaching is increasingly being seen by government and the public as something anyone can do.

But how can the teaching profession turn this around? I think Al Shanker had the right answer. If teachers (and teacher educators) can honestly present themselves to the public as people who can select and use proven programs and practices, ones that not just anyone could use effectively, that would go a long way, I think, to enhancing the public’s perception of the professionalism of the field. It would also be awfully good for students, parents, and the economy, of course.

Al Shanker knew that teachers were going to have to publicly and fervently embrace evidence, both to do their jobs better and to make it clear that being a teacher requires knowledge and skills than the general public can respect. I’m certain that he would be a big fan of the new Every Student Succeeds Act (ESSA) evidence standards, which will help educators, policy makers, and researchers identify and put to use proven programs and practices.

Evidence-based reform is essential for kids, but also for teachers. Al Shanker knew that 30 years ago, and his AFT has been a champion for evidence ever since.