Evidence-Based Does Not Equal Evidence-Proven

Chemist

As I speak to educational leaders about using evidence to help them improve outcomes for students, there are two words I hear all the time that give me the fantods (as Mark Twain would say):

Evidence-based

            I like the first word, “evidence,” just fine, but the second word, “based,” sort of negates the first one. The ESSA evidence standards require programs that are evidence-proven, not just evidence-based, for various purposes.

“Evidence-proven” means that a given program, practice, or policy has been put to the test. Ideally, students, teachers, or schools have been assigned at random to use the experimental program or to remain in a control group. The program is provided to the experimental group for a significant period of time, at least a semester, and then final performance on tests that are fair to both groups are compared, using appropriate statistics.

If your doctor gives you medicine, it is evidence proven. It isn’t just the same color or flavor as something proven, it isn’t just generally in line with what research suggests might be a good idea. Instead, it has been found to be effective, compared to current standards of care, in rigorous studies.

“Evidence-based,” on the other hand, is one of those wiggle words that educators love to use to indicate that they are up-to-date and know what’s expected, but don’t actually intend to do anything different from what they are doing now.

Evidence-based is today’s equivalent of “based on scientifically-based research” in No Child Left Behind. It sure sounded good, but what educational program or practice can’t be said to be “based on” some scientific principle?

In a recent Brookings article Mark Dynarski wrote about state ESSA plans, and conversations he’s heard among educators. He says that the plans are loaded with the words “evidence-based,” but with little indication of what specific proven programs they plan to implement, or how they plan to identify, disseminate, implement, and evaluate them.

I hope the ESSA evidence standards give leaders in even a few states the knowledge and the courage to insist on evidence-proven programs, especially in very low-achieving “school improvement” schools that desperately need the very best approaches. I remain optimistic that ESSA can be used to expand evidence-proven practices. But will it in fact have this impact? That remains to be proven.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

“Substantively Important” Isn’t Substantive. It Also Isn’t Important

Since it began in 2002, the What Works Clearinghouse has played an important role in finding, rating, and publicizing findings of evaluations of educational programs. It performs a crucial function for evidence-based reform. For this very reason, it needs to be right. But in several important ways, it uses procedures that are indefensible and have a big impact on its conclusions.

One of these relates to a study rating called “substantively important-positive.” This refers to study outcomes with an effect size of at least +0.25, but that are not statistically significant. I’ve written about this before, but the WWC has recently released a database of information on its studies that makes it easy to analyze WWC data on a large scale, and we have learned a lot more about this topic.

Study outcomes rated as “substantively important – positive” can qualify a study as “potentially positive,” the second-highest WWC rating. “Substantively important-negative” findings (non-significant effect sizes less than -0.25) can cause a study to be rated as potentially negative, which can keep a study from getting a positive rating forever, as a single “potentially negative” rating, under current rules, ensures that a program can never receive a rating better than “mixed,” even if other studies found hundreds of significant positive effects.

People who follow the WWC and know about “substantively important” may assume that it may be a strange rule, but relatively rare in practice. But that is not true.

My graduate student, Amanda Inns, has just done an analysis of WWC data from their own database, and if you are a big fan of the WWC, this is going to be a shock. Amanda has looked at all WWC-accepted reading and math studies. Among these, she found a total of 339 individual outcomes rated “positive” or “potentially positive.” Of these, 155 (46%) reached the “potentially positive” level only because they had effect sizes over +0.25, but were not statistically significant.

Another 36 outcomes were rated “negative” or “potentially negative.” 26 of these (72%) were categorized as “potentially negative” only because they had effect sizes less than -0.25 and were not significant. I’m sure patterns would be similar for subjects other than reading and math.

Put another way, almost half (48%) of outcomes rated positive/potentially positive or negative/potentially negative by the WWC were not statistically significant. As one example of what I’m talking about, consider a program called The Expert Mathematician. It had just one study with only 70 students in 4 classrooms (2 experimental and 2 control). The WWC re-analyzed the data to account for clustering, and the outcomes were nowhere near statistically significant, though they were greater than +0.25. This tiny study, and this study alone, caused The Expert Mathematician to receive the WWC “potentially positive” rating and to be ranked seventh among all middle school math programs. Similarly, Waterford Early Learning received a “potentially positive” rating based on a single tiny study with only 70 kindergarteners in 6 schools. The outcomes ranged from -0.71 to +1.11, and though the mean was more than +0.25, the outcome was far from significant. Yet this study alone put Waterford on the WWC list of proven kindergarten programs.

I’m not taking any position on whether these particular programs are in fact effective. All I am saying is that these very small studies with non-significant outcomes say absolutely nothing of value about that question.

I’m sure that some of you nerdier readers who have followed me this far are saying to yourselves, “well, sure, these substantively important studies may not be statistically significant, but they are probably unbiased estimates of the true effect.”

More bad news. They are not. Not even close.

The problem, also revealed in Amanda Inns’ data, is that studies with large effect sizes but not statistical significance tend to have very small sample sizes (otherwise, they would have been significant). Across WWC reading and math studies that used individual-level assignment, median sample sizes were 48, 74, or 86, for substantively important, significant, or indeterminate (non-significant with ES < +0.25), respectively. For cluster studies, they were 10, 17, and 33 clusters respectively. In other words, “substantively important” outcomes averaged less than half the sample sizes of other outcomes.

And small-sample studies greatly overstate effect sizes. Among all factors that bias effect sizes, small sample size is the most important (only use of researcher/developer-made measures comes close). So a non-significant positive finding in a small study is not an unbiased point estimate that just needs a larger sample to show its significance. It is probably biased, in a consistent, positive direction. Studies with sample sizes less than 100 have about three times the mean effect sizes of studies with sample sizes over 1000, for example.

But “substantively important” ratings can throw a monkey wrench into current policy. The ESSA evidence standards require statistically significant effects for all of its top three levels (strong, moderate, and promising). Yet many educational leaders are using the What Works Clearinghouse as a guide to which programs will meet ESSA evidence standards. They may logically assume that if the WWC says a program is effective, then the federal government stands behind it, regardless of what the ESSA evidence standards actually say. Yet in fact, based on the data analyzed by Amanda Inns for reading and math, 46% of the outcomes rated as positive/potentially positive by WWC (taken to correspond to “strong” or “moderate,” respectively, under ESSA evidence standards) are non-significant, and therefore do not qualify under ESSA.

The WWC needs to remove “substantively important” from its ratings as soon as possible, to avoid a collision with ESSA evidence standards, and to avoid misleading educators any further. Doing so would help make the WWC’s impact on ESSA substantive. And important.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Networks of Proven Programs Could Help State-Level Reform

America is a great country, but it presents a serious problem for school reformers. The problem is that it is honkin’ humongous, with strong traditions of state and local autonomy. Reforming even a single state is a huge task, because most of our states are the size of entire small nations. (My small state, Maryland, has about the population of Scotland, for example.) And states, districts, schools, and teachers are all kind of prickly about taking orders from anyone further up the hierarchy.

The Every Student Succeeds Act (ESSA) puts a particular emphasis on state and local control, a relief after the emphasis on mandates from Washington central to No Child Left Behind. ESSA also contains a welcome focus on using evidence-based programs.

ESSA is new, and state, district and school leaders are just now grappling with how to use the ESSA opportunities to move forward on a large scale. How can states hope to bring about major change on a large scale, working one school at a time?

The solution to this problem might be for states, large districts, or coalitions of smaller districts to offer a set of proven, whole school reform models to a number of schools in need of assistance, such as Title I schools. School leaders and their staffs would have opportunities to learn about programs, find some appropriate to their needs, ideally visit schools using the programs now, and match the programs with their own needs, derived from a thorough needs assessment. Ultimately, all school staff might vote, and at least 80% would have to vote in favor. The state or district would set aside federal or state funds to enable schools to afford the program they have chosen.

All schools in the state, district, or consortium that selected a given program could then form a network. The network would have regular meetings among principals, teachers of similar grades, and other job-alike staff members, to provide mutual help, share ideas, and interact cost-effectively with representatives of program providers. Network members would share a common language, and drawing from common experiences could be of genuine help to each other. The network arrangement would also reduce the costs of adopting each program, because it would create local scale to reduce costs of training and coaching.

The benefits of such a plan would be many. First, schools would be implementing programs they selected, and school staffs would be likely to put their hearts and minds into making the program work. Because the programs would all have been proven to be effective in the first place, they would be very likely to be measurably effective in these applications.

There might be schools that would initially opt not to choose anything, and this would be fine. Such schools would have opportunities each year to join colleagues in one of the expanding networks as they see that the programs are working in their own districts or regions.

As the system moved forward, it would become possible to do high-quality evaluations of each of the programs, contributing to knowledge of how each program works in particular districts or areas.

As the number of networked schools increased across a given state, it would begin to see widespread and substantial gains on state assessments. Further, all involved in this process would be learning not only the average effectiveness of each program, but also how to make each one work, and how to use programs to succeed with particular subgroups or solve particular problems. Networks, program leaders, and state, district, and school leaders, would get smarter each year about how to use proven programs to accelerate learning among students.

How could this all work at scale? The answer is that there are nonprofit organizations and companies that are already capable of working with hundreds of schools. At the elementary level, examples include the Children’s Literacy Initiative, Positive Action, and our own Success for All. At the secondary level, examples include BARR, the Talent Development High School, Reading Apprenticeship, and the Institute for Student Achievement. Other programs currently work with specific curricula and could partner with other programs to provide whole-school approaches, or some schools may only want or need to work on narrower problems. The programs are not that expensive at scale (few are more than $100 per student per year), and could be paid for with federal funds such as school improvement, Title I, Title II, and Striving Readers, or with state or local funds.

The proven programs do not ask schools to reinvent the wheel, but rather to put their efforts and resources toward adopting and effectively implementing proven programs and then making necessary adaptations to meet local needs and circumstances. Over time this would build capacity within each state, so that local people could take increasing responsibility for training and coaching, further reducing costs and increasing local “flavor.”

We’ve given mandates 30 years to show their effectiveness. ESSA offers new opportunities to do things differently, allowing states and districts greater freedom to experiment. It also strongly encourages the use of evidence. This would be an ideal time to try a simple idea: use what works.

This blog is sponsored by the Laura and John Arnold Foundation

Where Will the Capacity for School-by-School Reform Come From?

In recent months, I’ve had a number of conversations with state and district leaders about implementing the ESSA evidence standards. To its credit, ESSA diminishes federal micromanaging, and gives more autonomy to states and locals, but now that the states and locals are in charge, how are they going to achieve greater success? One state department leader described his situation in ESSA as being like that of a dog who’s been chasing cars for years, and then finally catches one. Now what?

ESSA encourages states and local districts to help schools adopt and effectively implement proven programs. For school improvement, portions of Title II, and Striving Readers, ESSA requires use of proven programs. Initially, state and district folks were worried about how to identify proven programs, though things are progressing on that front (see, for example, www.evidenceforessa.org). But now I’m hearing a lot more concern about capacity to help all those individual schools do needs assessments, select proven programs aligned with their needs, and implement them with thought, care, and knowledgeable application of implementation science.

I’ve been in several meetings where state and local folks ask federal folks how they are supposed to implement ESSA. “Regional educational labs will help you!” they suggest. With all due respect to my friends in the RELs, this is going to be a heavy lift. There are ten of them, in a country with about 52,000 Title I schoolwide projects. So each REL is responsible for, on average, five states, 1,400 districts, and 5,200 high-poverty schools. For this reason, RELs have long been primarily expected to work with state departments. There are just not enough of them to serve many individual districts, much less schools.

State departments of education and districts can help schools select and implement proven programs. For example, they can disseminate information on proven programs, make sure that recommended programs have adequate capacity, and perhaps hold effective methods “fairs” to introduce people in their state to program providers. But states and districts rarely have capacity to implement proven programs themselves. It’s very hard to build state and local capacity to support specific proven programs. For example, due to frequent downturns in state or district funding come, the first departments to be cut back or eliminated often involve professional development. For this reason, few state departments or districts have large, experienced professional development staffs. Further, constant changes in state and local superintendents, boards, and funding levels, make it difficult to build up professional development capacity over a period of years.

Because of these problems, schools have often been left to make up their own approaches to school reform. This happened on a wide scale in the NCLB School Improvement Grants (SIG) program, where federal mandates specified very specific structural changes but left the essentials, teaching, curriculum, and professional development, up to the locals. The MDRC evaluation of SIG schools found that they made no better gains than similar, non-SIG schools.

Yet there is substantial underutilized capacity available to help schools across the U.S. to adopt proven programs. This capacity resides in the many organizations (both non-profit and for-profit) that originally created the proven programs, provided the professional development that caused them to meet the “proven” standard, and likely built infrastructure to ensure quality, sustainability, and growth potential.

The organizations that created proven programs have obvious advantages (their programs are known to work), but they also have several less obvious advantages. One is that organizations built to support a specific program have a dedicated focus on that program. They build expertise on every aspect of the program. As they grow, they hire capable coaches, usually ones who have already shown their skills in implementing or leading the program at the building level. Unlike states and districts that often live in constant turmoil, reform organizations or for-profit professional development organizations are likely to have stable leadership over time. In fact, for a high-poverty school engaged with a program provider, that provider and its leadership may be the only partner stable enough to be likely to be able to help them with their core teaching for many years.

State and district leaders play major roles in accountability, management, quality assurance, and personnel, among many other issues. With respect to implementation of proven programs, they have to set up conditions in which schools can make informed choices, monitor the performance of provider organizations, evaluate outcomes, and ensure that schools have the resources and supports they need. But truly reforming hundreds of schools in need of proven programs one at a time is not realistic for most states and districts, at least not without help. It makes a lot more sense to seek capacity in organizations designed to provide targeted professional development services on proven programs, and then coordinate with these providers to ensure benefits for students.

This blog is sponsored by the Laura and John Arnold Foundation

The Sweet Land of Carrots: Promoting Evidence with Incentives

Results for America (RFA) released a report in July analyzing the first 17 Every Student Succeeds Act (ESSA) plans submitted by states. RFA was particularly interested in the degree to which evidence of effectiveness was represented in the plans, and the news is generally good. All states discussed evidence (it’s in the law), but many went much further, proposing to award competitive funding to districts to the degree that they propose to adopt programs proven to be effective according to the ESSA evidence standards. This was particularly true of school improvement grants, where the ESSA law requires evidence, but many state plans extended this principle beyond school improvement into other areas.

As an incurable optimist, this all looks very good to me. If state leaders are clear about what qualifies as “proven” under ESSA, and clear about how proper supports are also needed (e.g. needs assessments, high-quality implementation), then this creates an environment in which evidence will, at long last, play an important role in education policy. This was always the intent of the ESSA evidence standards, which were designed to make it easy for states and districts to identify proven programs so that they could incentivize and assist schools in using such programs.

The focus on encouragement, incentives, and high-quality implementation is a hallmark of the evidence elements of ESSA. To greatly oversimplify, ESSA moves education policy from the frightening land of sticks to the sweet land of carrots. Even though ESSA specifies that schools performing in the lowest 5% of their states must select proven programs, schools still have a wide range of choices that meet ESSA evidence standards. Beyond school improvement, Title II, Striving Readers, and other federal programs already provide funds to schools promising to adopt proven programs, or at least provide competitive preference to applicants promising to implement qualifying programs. Instead of the top-down, over-specific mandates of NCLB, ESSA provides incentives to use proven programs, but leaves it up to schools to pick the ones that are most appropriate to their needs.

There’s an old (and surely apocryphal) story about two approaches to introduce innovations. After the potato was introduced to Europe from the New World, the aristocracy realized that potatoes were great peasant food, rich in calories, easy to grow, and capable of thriving in otherwise non-arable land. The problem was, the peasants didn’t want to have anything to do with potatoes.

Catherine the Great of Russia approached the problem by capturing a few peasants, tying them up, and force-feeding them potatoes. “See?” said her minsters. “They ate potatoes and didn’t die.”

Louis XIV of France had a better idea. His minsters planted a large garden with potatoes, just outside of Paris, and posted a very sleepy guard over it. The wily peasants watched the whole process, and when the guard was asleep, they dug up the potatoes, ate them with great pleasure, and told all their friends how great they were. The word spread like wildfire, and soon peasants all over France were planting and eating potatoes.

The potato story is not precisely carrots and sticks, but it contains the core message. No matter how beneficial an innovation may be, there is always a risk and/or a cost in being the first on your block to adopt it. That risk/cost can be overcome if the innovation is super cool, or if early innovators gain status (as in Louis XIV’s potato strategy). Alternatively, or in addition, providing incentives to prime the pump, to get early adopters out promoting innovations to their friends, is a key part of a strategy to spread proven innovations.

What isn’t part of any effective dissemination plan is sticks. If people feel they must adopt particular policies from above, they are likely to be resentful, and to reason that if the government has to force you to do something, there must be something wrong with it. The moment the government stops monitoring compliance or policies change, the old innovations are dropped like, well, hot potatoes. That was the Catherine the Great strategy. The ESSA rules for school improvement do require that schools use proven programs but this is very different from being told which specific programs they must use, since they have a lot of proven programs to choose from. If schools still can choose which program to implement, then those who do make the choice will put all their energy into high-quality implementation. This is why, in our Success for All program, we require a vote of 80% of school staff in favor of program adoption.

My more cynical friends tell me that once again, I’m being overly optimistic. States, districts, and schools will pretend to adopt proven programs to get their money, they say, but won’t actually implement anything, or will do so leaving out key components, such as adequate professional development. I’m realistic enough to know that this will in fact happen in some places. Enthusiastic and informed federal, state, and district leadership will help avoid this problem, but it cannot be avoided entirely.

However, America is a very big country. If just a few states, for example, wholeheartedly adopted pro-evidence policies and provided technical assistance in selecting, implementing, evaluating, and continuously improving proven programs, they would surely have a substantial impact on their students. And other states would start to notice. Pretty soon, proven programs would be spreading like French fries.

I hope the age of the stick is over, and the age of the sweet carrot has arrived. ESSA has contributed to this possibility, but visionary state and district leaders will have to embrace the idea that helping and incentivizing schools to use proven programs is the best way to rapidly expand their use. And expanding well-implemented proven programs is the way to improve student achievement on a state or national scale. The innovations will be adopted, thoughtfully implemented, and sustained for the right reason – because they work for kids.

This blog is sponsored by the Laura and John Arnold Foundation

Immigrants and Evidence

My grandfather was an immigrant from Argentina, by way of Ellis Island. My three children were all adopted from Chile, so I’d experienced naturalization before. But last week, for the first time, I saw a naturalization ceremony for adults. My oldest son married a wonderful Russian woman, and she just become a U.S. citizen.

The whole experience was quite impressive. Perhaps fifty people from 18 different countries all over the globe were sworn in. The staff couldn’t have been more welcoming. They showed a video, just a slide show, showing pictures of immigrants over time. A new citizen from Mexico volunteered to read the Pledge of Allegiance—so worn by constant usage to most of us, but full of meaning and promise to this group: “…with liberty and justice for all.” Stop and think what those words must mean to immigrants from places in which these concepts do not exist. By my count, in 15 of the 18 countries from which these new citizens came, you could be arrested for criticizing the government.

In history, and up to the present, immigrants come to America for many reasons and in many circumstances, but they know for sure that the streets of America are not made of gold. For most, they are made of hard work, long hours in two or three menial jobs, not to mention cultural disruption, hardship, and all too often, discrimination. Perhaps life is materially better in America, perhaps it’s not. So why do so many come to our shores?

The answer for most: they come for their children, not for themselves. Even for children they don’t have yet. It’s the second or third generation, not the first, that most benefits from immigration. My grandfather from Argentina arrived with little education, no money, and no English. He became a sign painter. But my father, helped by the New York City Public Schools and then the GI Bill, went to college and graduate school, and become a clinical psychologist.

There are two key factors in every immigrant’s story of triumph. One is the determination of loving parents. But the second, is the school. The children of immigrants who succeed in school achieve the American Dream, for themselves and for our country. That’s the way things should happen, in a country founded on an ideology of the perfectibility of mankind through the powerful impact of opportunity and education.

For all of us as educators, this is a weighty responsibility. We have to see the promise in every child, immigrant or native born, and then do our part to make that promise a reality.

As researchers, developers, publishers, principals, teachers, and citizens, the responsibility for children’s futures requires that we do whatever it takes to see that all students succeed. Using proven programs is, of course, a part of this. It’s simply not good enough to have a list of excuses to explain why we cannot help far more of our most at-risk students to succeed. Sure, innovation is hard. It takes money, time, effort, and breaking of long-established routines. Many educators would prefer to just use the textbook because it’s easy. Others would prefer to make up their own, untested approaches. But schools were not built for us educators. They were built for the kids, and we owe it to every one of them to use proven strategies with enthusiasm, care, knowledge, and skill. This means developing and validating approaches specifically for the children of immigrants, but also improving instructional practices for all students.

A school full of the children of immigrants is full of wonderful stories yet to be told, versions of the same stories of triumph we tell of our own families. We cannot do any less than we are able to do to see that these stories come to pass. Immigrants do not ask for any guarantees, for themselves or for their children, but they do ask for opportunity. Enhancing the effectiveness of our schools is the best way we have to give them that opportunity and to thereby build the nation we want. And need.

This blog is sponsored by the Laura and John Arnold Foundation

Research and Development Saved Britain. Maybe They Will Save U.S. Education

One of my summer goals is to read the entire 6 volume history of the Second World War by Winston Churchill. So far, I’m about halfway through the first volume, The Gathering Storm, about the period leading up to 1939.

The book is more or less a wonderfully written rant about the Allies’ shortsightedness. As Hitler built up his armaments, Britain, France, and their allies maintained a pacifist insistence on reducing theirs. Only in the mid-thirties, when war was inevitable, did Britain start investing in armaments, but even then at a very modest pace.

Churchill was a Member of Parliament but was out of government. However, he threw himself into the one thing he could do to help Britain prepare: research and development. In particular, he worked with top scientists to develop the capacity to track, identify, and shoot down enemy aircraft.

When the 1940 Battle of Britain came and German planes tried to destroy and demoralize Britain in advance of an invasion, the inventions by Churchill’s group were a key factor in defeating them.

Churchill’s story is a good analogue to the situation of education research and development. In the current environment, the best-evaluated, most effective programs are not in wide use in U.S. schools. But the research and development that creates and evaluates these programs is essential. It is useful right away in hundreds of schools that do use proven programs already. But imagine what would happen if federal, state, or local governments anywhere decided to use proven programs to combat their most important education problems at scale. Such a decision would be laudable in principle, but where would the proven programs come from? How would they generate convincing evidence of effectiveness?  How would they build robust and capable organizations to provide high-quality professional development materials, and software?

The answer is research and development, of course. Just as Churchill and his scientific colleagues had to create new technologies before Britain was willing to invest in air defenses and air superiority at scale, so American education needs to prepare for the day when government at all levels is ready to invest seriously in proven educational programs.

I once visited a secondary school near London. It’s an ordinary school now, but in 1940 it was a private girls’ school. A German plane, shot down in the Battle of Britain, crash landed near the school. The girls ran out and captured the pilot!

The girls were courageous, as was the British pilot who shot down the German plane. But the advanced systems the British had worked out and tested before the war were also important to saving Britain. In education reform we are building and testing effective programs and organizations to support them. When government decides to improve student learning nationwide, we will be ready, if investments in research and development continue.

This blog is sponsored by the Laura and John Arnold Foundation