Evidence for ESSA Celebrates its First Anniversary

Penguin 02 22 18On February 28, 2017 we launched Evidence for ESSA (www.evidenceforessa.org), our website providing the evidence to support educational programs according to the standards laid out in the Every Child Succeeds Act in December, 2015.

Evidence for ESSA began earlier, of course. It really began one day in September, 2016, when I heard leaders of the Institute for Education Sciences (IES) and the What Works Clearinghouse (WWC) announce that the WWC would not be changed to align with the ESSA evidence standards. I realized that no one else was going to create scientifically valid, rapid, and easy-to-use websites providing educators with actionable information on programs meeting ESSA standards. We could do it because our group at Johns Hopkins University, and partners all over the world, had been working for many years creating and updating another website, the Best Evidence Encyclopedia (BEE; www.bestevidence.org).BEE reviews were not primarily designed for practitioners and they did not align with ESSA standards, but at least we were not starting from scratch.

We assembled a group of large membership organizations to advise us and to help us reach thoughtful superintendents, principals, Title I directors, and others who would be users of the final product. They gave us invaluable advice along the way. We also assembled a technical working group (TWG) of distinguished researchers to advise us on key decisions in establishing our website.

It is interesting to note that we have not been able to obtain adequate funding to support Evidence for ESSA. Instead, it is mostly being written by volunteers and graduate students, all of whom are motivated only by a passion for evidence to improve the education of students.

A year after launch, Evidence for ESSA has been used by more than 36,000 unique users, and I hear that it is very useful in helping states and districts meet the ESSA evidence standards.

We get a lot of positive feedback, as well as complaints and concerns, to which we try to respond rapidly. Feedback has been important in changing some of our policies and correcting some errors and we are glad to get it.

At this moment we are thoroughly up-to-date on reading and math programs for grades pre-kindergarten to 12, and we are working on science, writing, social-emotional outcomes, and summer school. We are also continuing to update our more academic BEE reviews, which draw from our work on Evidence for ESSA.

In my view, the evidence revolution in education is truly a revolution. If the ESSA evidence standards ultimately prevail, education will at long last join fields such as medicine and agriculture in a dynamic of practice to development to evaluation to dissemination to better practice, in an ascending spiral that leads to constantly improving practices and outcomes.

In a previous revolution, Thomas Jefferson said, “If I had to choose between government without newspapers and newspapers without government, I’d take the newspapers.” In our evidence revolution in education, Evidence for ESSA, the WWC, and other evidence sources are our “newspapers,” providing the information that people of good will can use to make wise and informed decisions.

Evidence for ESSA is the work of many dedicated and joyful hands trying to provide our profession with the information it needs to improve student outcomes. The joy in it is the joy in seeing teachers, principals, and superintendents see new, attainable ways to serve their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measuresfailure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

“Substantively Important” Isn’t Substantive. It Also Isn’t Important

Since it began in 2002, the What Works Clearinghouse has played an important role in finding, rating, and publicizing findings of evaluations of educational programs. It performs a crucial function for evidence-based reform. For this very reason, it needs to be right. But in several important ways, it uses procedures that are indefensible and have a big impact on its conclusions.

One of these relates to a study rating called “substantively important-positive.” This refers to study outcomes with an effect size of at least +0.25, but that are not statistically significant. I’ve written about this before, but the WWC has recently released a database of information on its studies that makes it easy to analyze WWC data on a large scale, and we have learned a lot more about this topic.

Study outcomes rated as “substantively important – positive” can qualify a study as “potentially positive,” the second-highest WWC rating. “Substantively important-negative” findings (non-significant effect sizes less than -0.25) can cause a study to be rated as potentially negative, which can keep a study from getting a positive rating forever, as a single “potentially negative” rating, under current rules, ensures that a program can never receive a rating better than “mixed,” even if other studies found hundreds of significant positive effects.

People who follow the WWC and know about “substantively important” may assume that it may be a strange rule, but relatively rare in practice. But that is not true.

My graduate student, Amanda Inns, has just done an analysis of WWC data from their own database, and if you are a big fan of the WWC, this is going to be a shock. Amanda has looked at all WWC-accepted reading and math studies. Among these, she found a total of 339 individual outcomes rated “positive” or “potentially positive.” Of these, 155 (46%) reached the “potentially positive” level only because they had effect sizes over +0.25, but were not statistically significant.

Another 36 outcomes were rated “negative” or “potentially negative.” 26 of these (72%) were categorized as “potentially negative” only because they had effect sizes less than -0.25 and were not significant. I’m sure patterns would be similar for subjects other than reading and math.

Put another way, almost half (48%) of outcomes rated positive/potentially positive or negative/potentially negative by the WWC were not statistically significant. As one example of what I’m talking about, consider a program called The Expert Mathematician. It had just one study with only 70 students in 4 classrooms (2 experimental and 2 control). The WWC re-analyzed the data to account for clustering, and the outcomes were nowhere near statistically significant, though they were greater than +0.25. This tiny study, and this study alone, caused The Expert Mathematician to receive the WWC “potentially positive” rating and to be ranked seventh among all middle school math programs. Similarly, Waterford Early Learning received a “potentially positive” rating based on a single tiny study with only 70 kindergarteners in 6 schools. The outcomes ranged from -0.71 to +1.11, and though the mean was more than +0.25, the outcome was far from significant. Yet this study alone put Waterford on the WWC list of proven kindergarten programs.

I’m not taking any position on whether these particular programs are in fact effective. All I am saying is that these very small studies with non-significant outcomes say absolutely nothing of value about that question.

I’m sure that some of you nerdier readers who have followed me this far are saying to yourselves, “well, sure, these substantively important studies may not be statistically significant, but they are probably unbiased estimates of the true effect.”

More bad news. They are not. Not even close.

The problem, also revealed in Amanda Inns’ data, is that studies with large effect sizes but not statistical significance tend to have very small sample sizes (otherwise, they would have been significant). Across WWC reading and math studies that used individual-level assignment, median sample sizes were 48, 74, or 86, for substantively important, significant, or indeterminate (non-significant with ES < +0.25), respectively. For cluster studies, they were 10, 17, and 33 clusters respectively. In other words, “substantively important” outcomes averaged less than half the sample sizes of other outcomes.

And small-sample studies greatly overstate effect sizes. Among all factors that bias effect sizes, small sample size is the most important (only use of researcher/developer-made measures comes close). So a non-significant positive finding in a small study is not an unbiased point estimate that just needs a larger sample to show its significance. It is probably biased, in a consistent, positive direction. Studies with sample sizes less than 100 have about three times the mean effect sizes of studies with sample sizes over 1000, for example.

But “substantively important” ratings can throw a monkey wrench into current policy. The ESSA evidence standards require statistically significant effects for all of its top three levels (strong, moderate, and promising). Yet many educational leaders are using the What Works Clearinghouse as a guide to which programs will meet ESSA evidence standards. They may logically assume that if the WWC says a program is effective, then the federal government stands behind it, regardless of what the ESSA evidence standards actually say. Yet in fact, based on the data analyzed by Amanda Inns for reading and math, 46% of the outcomes rated as positive/potentially positive by WWC (taken to correspond to “strong” or “moderate,” respectively, under ESSA evidence standards) are non-significant, and therefore do not qualify under ESSA.

The WWC needs to remove “substantively important” from its ratings as soon as possible, to avoid a collision with ESSA evidence standards, and to avoid misleading educators any further. Doing so would help make the WWC’s impact on ESSA substantive. And important.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The WWC’s 25% Loophole

I am a big fan of the concept of the What Works Clearinghouse (WWC), though I have concerns about various WWC policies and practices. For example, I have written previously with concerns about WWC’s acceptance of measures made by researchers and developers and WWC’s policy against weighting effect sizes by sample sizes when computing mean effect sizes for various programs. However, there is another WWC policy that is a problem in itself, but this problem is made more serious in light of recent Department of Education guidance on the ESSA evidence standards.

The WWC Standards and Procedures 3.0 manual sets rather tough standards for programs to be rated as having positive effects in studies meeting standards “without reservations” (essentially, randomized experiments) and “with reservations” (essentially, quasi-experiments, or matched studies). However, the WWC defines a special category of programs for which all caution is thrown to the winds. Such studies are called “substantively important,” and are treated as though they met WWC standards. Quoting from Standards and Procedures 3.0: “For the WWC, effect sizes of +0.25 standard deviations or larger are considered to be substantively important…even if they might not reach statistical significance…” The “effect size greater than +0.25” loophole (the >0.25 loophole, for short) is problematic in itself, but could lead to catastrophe for the ESSA evidence standards that now identify programs that meet “strong,” “moderate,” and “promising” levels of evidence.

The problem with the >0.25 loophole is that studies that meet the loophole criterion without meeting the usual methodological criteria are usually very, very, very bad studies, usually with a strong positive bias. These studies are often very small (far too small for statistical significance). They usually use measures made by the developers or researchers, or ones that are excessively aligned with the content of the experimental group but not the control group.

One example of the >0.25 loophole is a Brady (1990) study accepted as “substantively important” by the WWC. In it, 12 students in rural Alaska were randomly assigned to Reciprocal Teaching or to a control group. The literacy treatment was built around specific science content, but the control group never saw this content. Yet one of the outcome measures, focused on this content, was made by Mr. Brady, and two others were scored by him. Mr. Brady also happened to be the teacher of the experimental group. The effect size in this awful study was an extraordinary +0.65, though outcomes in other studies assessed on measures more fair to the control group were much smaller.

Because the WWC does not weight studies by sample size, this tiny, terrible study had the same impact in the WWC summary as studies with hundreds or thousands of students.

For the ESSA evidence standards, the >0.25 loophole can lead to serious errors. A single study meeting standards makes a program qualify for one of the top-three ESSA standards (strong, moderate, or promising). There can be financial consequences for schools using programs in the top three categories (for example, use of such programs is required for schools seeking school improvement grants). Yet a single study meeting the standards, including the awful 12-student study of Reciprocal Teaching, qualify the program for the ESSA category, no matter what is found in all other studies (unless there are qualifying studies with negative impacts). Also, the loophole works in the negative direction too, so a small, terrible study could find an effect size less than -0.25, and no amount or quality of positive findings could make that program meet WWC standards.

The >0.25 loophole is bad enough for research that already exists, but for the future, the problem is even more serious. Program developers or commercial publishers could do many small studies of their programs or could commission studies using developer-made measures. Once a single study exceeds an effect size of +0.25, the program may be considered validated forever.

To add to the problem, in recent guidance from the U. S. Department of Education, a definition of the ESSA “promising” definition specifically mentions the idea that programs can meet the promising definition if they can report statistically significant or substantively important outcomes. The guidance refers to the WWC standards for the “strong” and “moderate” categories, and the WWC standards themselves allow for the >0.25 loophole (even though this is not mentioned or implied by the law itself, which consistently requires statistically significant outcomes, not “substantially important”). In other words, programs that meet WWC standards for “positive” or “potentially positive” based on substantively important evidence alone explicitly do not meet ESSA standards, which require statistical significance. Yet the recent regulations do not recognize this problem.

The >0.25 loophole began, I’d assume, when the WWC was young and few programs met its standards. It was jokingly called the “Nothing Works Clearinghouse.” The loophole was probably added to increase the numbers of included programs. This loophole produced misleading conclusions, but since the WWC did not matter very much to educators, there were few complaints. Today, however, the WWC has greater importance because of the ESSA evidence standards.

Bad loopholes make bad laws. It is time to close this loophole, and eliminate the category of “substantively important.”

Pilot Studies: On the Path to Solid Evidence

This week, the Education Technology Industry Network (ETIN), a division of the Software & Information Industry Association (SIIA), released an updated guide to research methods, authored by a team at Empirical Education Inc. The guide is primarily intended to help software companies understand what is required for studies to meet current standards of evidence.

In government and among methodologists and well-funded researchers, there is general agreement about the kind of evidence needed to establish the effectiveness of an education program intended for broad dissemination. To meet its top rating (“meets standards without reservations”) the What Works Clearinghouse (WWC) requires an experiment in which schools, classes, or students are assigned at random to experimental or control groups, and it has a second category (“meets standards with reservations”) for matched studies.

These WWC categories more or less correspond to the Every Student Succeeds Act (ESSA) evidence standards (“strong” and “moderate” evidence of effectiveness, respectively), and ESSA adds a third category, “promising,” for correlational studies.

Our own Evidence for ESSA website follows the ESSA guidelines, of course. The SIIA guidelines explain all of this.

Despite the overall consensus about the top levels of evidence, the problem is that doing studies that meet these requirements is expensive and time-consuming. Software developers, especially small ones with limited capital, often do not have the resources or the patience to do such studies. Any organization that has developed something new may not want to invest substantial resources into large-scale evaluations until they have some indication that the program is likely to show well in a larger, longer, and better-designed evaluation. There is a path to high-quality evaluations, starting with pilot studies.

The SIIA Guide usefully discusses this problem, but I want to add some further thoughts on what to do when you can’t afford a large randomized study.

1. Design useful pilot studies. Evaluators need to make a clear distinction between full-scale evaluations, intended to meet WWC or ESSA standards, and pilot studies (the SIIA Guidelines call these “formative studies”), which are just meant for internal use, both to assess the strengths or weaknesses of the program and to give an early indicator of whether or not a program is ready for full-scale evaluation. The pilot study should be a miniature version of the large study. But whatever its findings, it should not be used in publicity. Results of pilot studies are important, but by definition a pilot study is not ready for prime time.

An early pilot study may be just a qualitative study, in which developers and others might observe classes, interview teachers, and examine computer-generated data on a limited scale. The problem in pilot studies is at the next level, when developers want an early indication of effects on achievement, but are not ready for a study likely to meet WWC or ESSA standards.

2. Worry about bias, not power. Small, inexpensive studies pose two types of problems. One is the possibility of bias, discussed in the next section. The other is lack of power, mostly meaning having a large enough sample to determine that a potentially meaningful program impact is statistically significant, or unlikely to have happened by chance. To understand this, imagine that your favorite baseball team adopts a new strategy. After the first ten games, the team is doing better than it did last year, in comparison to other teams, but this could have happened by chance. After 100 games? Now the results are getting interesting. If 10 teams all adopt the strategy next year and they all see improvements on average? Now you’re headed toward proof.

During the pilot process, evaluators might compare multiple classes or multiple schools, perhaps assigned at random to experimental and control groups. There may not be enough classes or schools for statistical significance yet, but if the mini-study avoids bias, the results will at least be in the ballpark (so to speak).

3. Avoid bias. A small experiment can be fine as a pilot study, but every effort should be made to avoid bias. Otherwise, the pilot study will give a result far more positive than the full-scale study will, defeating the purpose of doing a pilot.

Examples of common sources of biases in smaller studies are as follows.

a. Use of measures made by developers or researchers. These measures typically produce greatly inflated impacts.

b. Implementation of gold-plated versions of the program. . In small pilot studies, evaluations often implement versions of the program that could never be replicated. Examples include providing additional staff time that could not be repeated at scale.

c. Inclusion of highly motivated teachers or students in the experimental group, which gets the program, but not the control group. For example, matched studies of technology often exclude teachers who did not implement “enough” of the program. The problem is that the full-scale experiment (and real life) include all kinds of teachers, so excluding teachers who could not or did not want to engage with technology overstates the likely impact at scale in ordinary schools. Even worse, excluding students who did not use the technology enough may bias the study toward more capable students.

d. Learn from pilots. Evaluators, developers, and disseminators should learn as much as possible from pilots. Observations, interviews, focus groups, and other informal means should be used to understand what is working and what is not, so when the program is evaluated at scale, it is at its best.

 

***

As evidence becomes more and more important, publishers and software developers will increasingly be called upon to prove that their products are effective. However, no program should have its first evaluation be a 50-school randomized experiment. Such studies are indeed the “gold standard,” but jumping from a two-class pilot to a 50-school experiment is a way to guarantee failure. Software developers and publishers should follow a path that leads to a top-tier evaluation, and learn along the way how to ensure that their programs and evaluations will produce positive outcomes for students at the end of the process.

 

This blog is sponsored by the Laura and John Arnold Foundation

Evidence for ESSA and the What Works Clearinghouse

In just a few weeks, we will launch Evidence for ESSA, a free web site designed to provide education leaders with information on programs that meet the evidence standards included in the Every Student Succeeds Act (ESSA). As most readers of this blog are aware, ESSA defines standards for strong, moderate, and promising levels of evidence, and it promotes the use of programs and practices that meet those standards.

One question I frequently get about Evidence for ESSA is how it is similar to and different from the What Works Clearinghouse (WWC), the federal service that reviews research on education programs and makes its findings available online. In making Evidence for ESSA, my colleagues and I have assumed that the WWC will continue to exist and do what it has always done. We see Evidence for ESSA as a supplement, not a competitor to the WWC. Evidence for ESSA will have a live link to the WWC for users who want more information. But the WWC was not designed to serve the ESSA evidence standards. Ruth Neild, the recently departed Acting Director of the Institute for Education Sciences (IES) (which oversees the WWC) announced at a November (2016) meeting that the WWC would not try to align itself with the ESSA evidence standards.

Evidence for ESSA, in contrast, is specifically aligned with the ESSA evidence standards. It follows most WWC procedures and standards, using similar or identical methods for searching the literature for potentially qualifying studies, computing effect sizes, averaging effect sizes across studies, and so on.

However, the purpose of the ESSA evidence standards is different from that of the WWC, and Evidence for ESSA is correspondingly different. There are four big differences that have to be taken into account. First, ESSA evidence standards are written for superintendents, principals, teachers and parents, not for experts in research design. The WWC has vast information in it, and my colleagues and I depend on it and use it constantly. But not everyone has the time or inclination to navigate the WWC.

Second, ESSA evidence standards require only a single study with a positive outcome for membership in any given category. For example, to get into the “Strong” category, a program can have just one randomized study that found a significant positive effect, even if there were ten more that found zero impact (although U.S. Department of Education guidance does suggest that one significant negative finding can cancel out a positive one).  Personally, I do not like the one-study rule, but that’s the law. The law does specify that studies must be well-designed and well-implemented, and this allows and even compels us, within the law, to make sure that weak or flawed studies are not accepted as the one study qualifying a program for a given category. More on this in a moment.

Third, ESSA defines three levels of evidence: strong, moderate, and promising. Strong and moderate correspond, roughly, to the WWC “meets standards without reservations” (strong) and the “meets . . . with reservations” (moderate) categories, respectively. But WWC does not have anything corresponding to a “promising” category, so users of the WWC seeking all qualifying programs in a given area under ESSA would miss this crucial category. (There is also a fourth category under ESSA which is sometimes referred to as “evidence-building and under evaluation,” but this category is not well-defined enough to allow current research to be assigned to it.)

Finally, there has been an enormous amount of high-quality research appearing in recent years, and educators seeking proven programs want the very latest information. Recent investments by IES and by the Investing in Innovation (i3) program, in particular, are producing a flood of large, randomized evaluations of a broad range of programs for grades Pre-K to 12. More will be appearing in coming years. Decision makers will want and need up-to-date information on programs that exist today.

Evidence for ESSA was designed with the help of a distinguished technical working group to make common-sense adaptions to satisfy the requirements of the law. In so doing, we needed to introduce a number of technical enhancements to WWC structures and procedures.

 Ease of Use and Interpretation

Evidence for ESSA will be very, very easy to use. From the home page, two clicks will get you to a display of all programs in a given area (e.g., programs for struggling elementary readers, or whole-class secondary math programs). The programs will be listed and color-coded by the ESSA level of evidence they meet, and within those categories, ranked by a combination of effect size, number and quality of studies, and overall sample size.

A third click will take you to a program page describing and giving additional practical details on a particular program.

Three clicks. Done.

Of course, there will be many additional options. You will be able to filter programs by urban/rural, for example, or according to groups studied, or according to program features. You will also see references to the individual studies that caused a program to qualify for an ESSA category.

You will be able to spend as much time on the site as you like, and there will be lots of information if you want it, including answers to “Frequently Asked Questions” that go into as much depth as you desire, including listing our entire “Standards and Procedures” manual. But most users will get where they need to go in three completely intuitive clicks, and can then root around to their hearts’ content.

Focus on High-Quality, Current Studies

We added a few additional requirements on top of the WWC standards to ensure that studies that qualify programs for ESSA categories are meaningful and important to educators. First, we excluded programs that are no longer in active dissemination. Second, we eliminated measures made by researchers, and those that are measures of minor skills or skills taught at other grade levels (such as phonics tests in secondary school). Third, the WWC has a loophole, counting as “meeting standards without reservations” studies that have major flaws but have effect sizes of at least +0.25. We eliminated such studies, which removed studies with sample sizes as small as 14.

Definition of Promising

The WWC does not have a rating corresponding to ESSA’s “Promising” category. Within the confines of the law, we established parameters to put “Promising” into practice. Our parameters include high-quality correlational studies, as well as studies that meet all other inclusion criteria and have statistically significant outcomes at the student level, but not enough clusters (schools, teachers) to find significant outcomes at the cluster level.

 

Rapid Inclusion

Evidence for ESSA will be updated regularly and quickly. Our commitment is to include qualifying studies brought to our attention on the website within two weeks.

Evidence for ESSA and the WWC will exist together, offering educators two complementary approaches to information on effective programs and practices. Over time, we will learn how to maximize the benefits of both facilities and how to coordinate them to make them as useful as possible for all of the audiences we serve. But we have to do this now so that the evidence provisions of ESSA will be meaningful rather than an exercise in minimalist compliance. It may be a long time before our field will have as good an opportunity to put evidence of effectiveness at the core of education policy and practice.

Is All That Glitters Gold-Standard?

In the world of experimental design, studies in which students, classes, or schools are assigned at random to experimental or control treatments (randomized clinical trials or RCTs) are often referred to as meeting the “gold standard.” Programs with at least one randomized study with a statistically significant positive outcome on an important measure qualify as having “strong evidence of effectiveness” under the definitions in the Every Student Succeeds Act (ESSA). RCTs virtually eliminate selection bias in experiments. That is, readers don’t have to worry that the teachers using an experimental program might have already been better or more motivated than those who were in the control group. Yet even RCTs can have such serious flaws as to call their outcomes into question.

A recent article by distinguished researchers Alan Ginsburg and Marshall Smith severely calls into question every single elementary and secondary math study accepted by the What Works Clearinghouse (WWC) as “meeting standards without reservations,” which in practice requires a randomized experiment. If they were right, then the whole concept of gold-standard randomized evaluations would go out the window, because the same concerns would apply to all subjects, not just math.

Fortunately, Ginsburg & Smith are mostly wrong. They identify, and then discard, 27 studies accepted by the WWC. In my view, they are right about five. They raise some useful issues about the rest, but not damning ones.

The one area in which I fully agree with Ginsburg & Smith (G&S henceforth) relates to studies that use measures made by the researchers. In a recent paper with Alan Cheung and an earlier one with Nancy Madden, I reported that use of researcher-made tests resulted in greatly overstated effect sizes. Neither WWC nor ESSA should accept such measures.

From this point however, G&S are overly critical. First, they reject all studies in which the developer was one of the report authors. However, the U.S. Department of Education has been requiring third-party evaluations in its larger grants for more than a decade. This is true in IES, i3, and NSF (scale-up) grants for example, and in England’s Education Endowment Foundation (EEF). A developer may be listed as an author, but it’s been a long time since a developer could get his or her thumb on the scale in federally-funded research. Even studies funded by publishers are almost universally using third-party evaluators.

G&S complain that 25 of 27 studies evaluated programs in their first year, compromising fidelity. This is indeed a problem, but it can only affect outcomes in a negative direction. Programs showing positive outcomes in their first year may be particularly powerful.

G&S express concern that half of studies did not state what curriculum the control group was using. This would be nice to know, but does not invalidate a study.

G&S complain that in many cases the amount of instructional time for the experimental group was greater than that for the control group. This could be a problem, but given the findings of research on allocated time, it is unlikely that time alone makes much of a difference in math learning. It may be more sensible to see extra time as a question of cost-effectiveness. Did 30 extra minutes of math per day implementing Program X justify the costs of Program X, including the cost of adding the time? Future studies might evaluate the value added of 30 extra minutes doing ordinary instruction, but does anyone expect this to be a large impact?

Finally, G&S complain that most curricula used in WWC-accepted RCTs are outdated. This could be a serious concern, especially as common core and other college- and career-ready standards are adopted in most states. However, recall that at the time RCTs are done, the experimental and the control groups were subject to the same standards, so if the experimental group did better, it is worth considering as an innovation. The reality is that any program in active dissemination must update its content to meet new standards. A program proven effective before common core and then updated to align with common core standards is not proven for certain to improve common core outcomes, for example, but it is a very good bet. A school or district considering adopting a given proven program might well check to see that it meets current standards, but it would be self-defeating and unnecessary to demand that every program re-prove its effectiveness every time standards change.

Randomized experiments in education are not perfect (neither are randomized experiments in medicine or other fields). However, they provide the best evidence science knows how to produce on the effectiveness of innovations. It is entirely legitimate to raise issues about RCTs, as Ginsburg & Smith do, but rejecting what we do know until perfection is achieved would cut off the best avenue we have for progress toward solid, scientifically defensible reform in our schools.