The Mystery of the Chinese Dragon: Why Isn’t the WWC Up to Date?

As evidence becomes more important in educational practice and policy, it is increasingly critical that it be up-to-date. This sounds obvious. Of course we’d prefer evidence from recent studies, which are more likely to have been done under social and political conditions like those that exist today, using standards like those prevailing today.

However, there are reasons that up-to-date evidence is especially important in today’s policy environment. Up-to-date evidence is critical because it is far more likely than earlier research to meet very high methodological standards. Because of substantial investments by the U.S. Department of Education and others, there has been an outpouring of top-quality, randomized, usually third-party evaluations of programs for all subjects and grade levels, published from 2012 to the present.

The reason this matters in practice is that to satisfy ESSA evidence standards, many educators are using the What Works Clearinghouse (WWC) to identify proven programs. And the What Works Clearinghouse is very slow in reviewing studies, and therefore does not contain many of the latest, and therefore highest-quality studies.

The graph below illustrates the problem. It compares all secondary literacy (grades 6-12) studies reported by the WWC as of fall, 2017, on the orange line. The blue line represents a review of research on the same topic by Baye et al. (2017; see www.bestevidence.org). I think the graph resembles a Chinese dragon, with its jaws wide open and a long tail. The sort of dragon you see in Chinese New Year’s parades.

 

 

What the graph shows is that while the number of studies published up to 2009 were about equal for Baye et al. and WWC, they diverged sharply in 2010 (thus the huge open jaws). Baye et al. reported on 58 studies published in 2010 to 2017. WWC reported on only 6, and none at all from 2016 or 2017.

The same patterns are apparent throughout the WWC. Across every topic and grade level, the WWC has only 7 accepted studies from 2014, 7 from 2015, zero from 2016, and zero from 2017.

It is likely that every one of the Baye et al. studies would meet WWC standards. Yet the WWC has just not gotten to them.

It’s important to note that the What Works Clearinghouse is plenty busy. Recent studies are often included in Quick Reviews, Single Study Reviews, Grant Competition Reports, and Practice Guides. However, an educator going to the WWC for guidance on what works will go to Find What Works and click on one of the 12 topic areas, which will list programs. They then may filter their search and go to intervention reports.

These intervention reports are not integrated with Quick Reviews, Single Study Reviews, Grant Competition Reports, or Practice Guides, so the user has no easy way to find out about more recent evaluations, if they in fact appear anywhere in any of these reports. Even if users did somehow find additional information on a program in one of these supplemental reports, the information may be incomplete. In many cases, the supplemental report only notes whether a study meets WWC standards, but does not provide any information about what the outcome was.

The slow pace of the WWC reviews is problematic for many reasons. In addition to missing out on the strongest and most recent studies, the WWC does not register changes in the evidence base for programs already in its database. New programs may not appear at all, leaving readers to wonder why.

Any website developer knows that if users go to a website and are unable to find what they expect to find, they are unlikely to come back. The WWC is a website, and it cannot expect many users to check back every few months to see if programs that interest them, which they know to exist, have been added lately.

In the context of the ESSA evidence standards, the slow pace of the WWC is particularly disturbing. Although the WWC has chosen not to align itself with ESSA standards, many educators use the WWC as a guide to which programs are likely to meet ESSA standards. Failing to keep the WWC up to date may convince many users seeking ESSA information that there are few programs meeting either WWC or ESSA standards.

Educators need accurate, up-to-date information to make informed choices for their students. I hope the WWC will move quickly to provide its readers with essential, useful data on today’s evidence supporting today’s programs. It’s going to have to catch up with the Chinese dragon, or be left to watch the parade going by.

…But It Was The Very Best Butter! How Tests Can Be Reliable, Valid, and Worthless

I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”

I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.

Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.

Wow.

How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.

Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.

So how can it be that using even the most reliable and valid tests doesn’t solve this problem?

In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”

The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).

Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.

Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).

Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.

Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Title I: A 20% Solution

Here’s an idea that would cost nothing and profoundly shift education funding and the interest of educators and policy makers toward evidence-proven programs. Simply put, the idea is to require that schools receiving Title I funds use 20% of the total on programs that meet at least a moderate standard of evidence. Two thin dimes on the dollar could make a huge difference in all of education.

In terms of federal education policy, Title I is the big kahuna. At $15 billion per year, it is the largest federal investment in elementary and secondary education, and it has been very politically popular on both sides of the aisle since the Johnson administration in 1965, when the Elementary and Secondary Education Act (ESEA) was first passed. Title I has been so popular because it goes to every congressional district, and provides much-needed funding by formula to help schools serving children living in poverty. Since the reauthorization of ESEA as the Every Student Succeeds Act in 2015, Title I remains the largest expenditure.

In ESSA and other federal legislation, there are two kinds of funding. One is formula funding, like Title I, where money usually goes to states and is then distributed to districts and schools. The formula may adjust for levels of poverty and other factors, but every eligible school gets its share. The other kind of funding is called competitive, or discretionary funding. Schools, districts, and other entities have to apply for competitive funding, and no one is guaranteed a share. In many cases, federal funds are first given to states, and then schools or districts apply to their state departments of education to get a portion of it, but the state has to follow federal rules in awarding the funds.

Getting proven programs into widespread use can be relatively easy in competitive grants. Competitive grants are usually evaluated on a 100-point scale, with all sorts of “competitive preference points” for certain categories of applicants, such as for rural locations, inclusion of English language learners or children of military families, and so on. These preferences add perhaps one to five points to a proposal’s score, giving such applicants a leg up but not a sure thing. In the same way, I and others have proposed adding competitive preference points in competitive proposals for applicants who propose to adopt programs that meet established standards of evidence. For example, Title II SEED grants for professional development now require that applicants propose to use programs found to be effective in at least one rigorous study, and give five points if the programs have been proven effective in at least two rigorous studies. Schools qualifying for school improvement funding under ESSA are now required to select programs that meet ESSA evidence standards.

Adding competitive preference points for using proven programs in competitive grants is entirely sensible and pain-free. It costs nothing, and does not require applicants to use any particular program. In fact, applicants can forego the preference points entirely, and hope to win without them. Preference points for proven programs is an excellent way to nudge the field toward evidence-based reform without top-down mandates or micromanagement. The federal government states a preference for proven programs, which will at least raise their profile among grant writers, but no school or district has to do anything different.

The much more difficult problem is how to get proven programs into formula funding (such as Title I). The great majority of federal funds are awarded by formula, so restricting evidence-based reform to competitive grants is only nibbling at the edges of practice. One solution to this would be to allocate incentive grants to districts if they agree to use formula funds to adopt and implement proven programs.

However, incentives cost money. Instead, imagine that districts and schools get their Title I formula funds, as they have since 1965. However, Congress might require that districts use at least 20% of their Title I, Part A funding to adopt and implement programs that meet a modest standard of evidence, similar to the “moderate” level in ESSA (which requires one quasi-experimental study with positive effects). The adopted program could be anything that meets other Title I requirements—reading, math, tutoring, technology—except that the program has to have evidence of effectiveness. The funds could pay for necessary staffing, professional development, materials, software, hardware, and so on. Obviously, schools could devote more than 20% if they choose to do so.

There are several key advantages to this 20% solution. First, of course, children would immediately benefit from receiving programs with at least moderate evidence of effectiveness. Second, the process would instantly make leaders of the roughly 55,000 Title I schools intensely interested in evidence. Third, the process could gradually shift discussion about Title I away from its historical focus on “how much?” to an additional focus on “for what purpose?” Publishers, software developers, academics, philanthropy, and government itself would perceive the importance of evidence, and would commission or carry out far more high-quality studies to meet the new standards. Over time, the standards of evidence might increase.

All of this would happen at no additional cost, and with a minimum of micromanagement. There are now many programs that would meet the “moderate” standards of evidence in reading, math, tutoring, whole-school reform, and other approaches, so schools would have a wide choice. No Child Left Behind required that low-performing schools devote 20% of their Title I funding to after-school tutoring programs and student transfer policies that research later showed to make little or no difference in outcomes. Why not spend the same on programs that are proven to work in advance, instead of once again rolling the dice with the educational futures of at-risk children?

20% of Title I is a lot of money, but if it can make 100% of Title I more impactful, it is more than worthwhile.

Evidence and Freedom

In 1776, a small group of American patriots had a vision of a government of, by, and for the people, and they risked their lives to make it so. Their commitment to liberty was not just ideological, it was also pragmatic. They knew that people who were empowered to make their own decisions were more likely to be committed to the implementation of those decisions. The same should apply to education today.

One of the most important aspects of the Every Student Succeeds Act (ESSA) is how it balances evidence with freedom. The Act defines proven programs and mentions evidence 60 times. It encourages use of proven programs throughout. It provides for additional preference points for proposals in seven areas that meet evidence requirements. Yet only in the area of school improvement for the lowest 5% of schools does it require use of proven programs. This is probably a good thing.

Americans, even more than other people, don’t like to be told what to do. If the evidence movement turns into a set of mandates, telling educators which programs they can or cannot implement, it will probably be doomed. Even when evidence for or against given programs is solid and widely replicated, many political forces opposing evidence-based reform would surely come into play if educators felt compelled to use certain programs and avoid others.

Years ago, I had an experience that reinforced my view that teachers respond better to proven practices if they are free to choose them. I was doing a cooperative learning workshop in a large urban district. A surly-looking teacher raised her hand. “Do we have to do this?” she asked. “Of course not” I answered. “These are ideas for you to use or not, as you wish”

“In this district,” said the teacher “if we’re not required to use something, we’re not allowed to do it.”

How can we avoid compulsion? The answer is easy. Federal, state, and local policies need to provide incentives for schools to use certain programs with strong evidence of effectiveness from rigorous experiments, but not mandates to do so. That’s what ESSA will do in several areas. Incentives may mean providing a few points on competitive grant proposals, or modest financial incentives, for schools that adopt proven programs. These incentives should be enough to get educators’ attention, but not enough to force them to pick a given program.

Incentives should cause educators to eagerly volunteer to use proven programs, to raise their hands, not their hackles. They could lead educators to learn more about the proven programs available to them and about the research process itself. This in turn could encourage political leaders to support education R & D, as educators and the public at large begin to clamor for more programs and better research.

Government cannot and should not try to get 3 million teachers in 100,000 schools in 14,000 districts to use any particular set of programs, no matter what their evidence of effectiveness. What it can and should do is set in motion policies that gradually expand the availability, adoption, and spread of proven programs, eventually pushing less effective approaches to improve or disappear. Development and evaluation of promising programs continues in ESSA, in the new Education Innovation Research (EIR), which along with R & D funded by other agencies will continuously add to the set of proven programs ready for adoption. As the number and quality of proven programs grow, educators will become more and more comfortable about using them.

From our nation’s founding, freedom to make informed choices has been an essential foundation stone of our system of governance. So it should be in education policy.

Evidence can inform key decisions for children, and government can encourage and incent adoption of proven programs. However, educators need the freedom to do what is right for their children, guided but not steered by valid and useful research.