A Warm Welcome From Babe Ruth’s Home Town to the Registry of Efficacy and Effectiveness Studies (REES)

Every baseball season, many home runs are hit by various players across the major leagues. But in all of history, there is one home run that stands out for baseball fans. In the 1932 World Series, Babe Ruth (born in Baltimore!) pointed to the center field fence. He then hit the next pitch over that fence, exactly where he said he would.

Just 86 years later, the U.S. Department of Education, in collaboration with the Society for Research on Educational Effectiveness (SREE), launched a new (figurative) center field fence for educational evaluation. It’s called the Registry of Efficacy and Effectiveness Studies (REES). The purpose of REES is to ask evaluators of educational programs to register their research designs, measures, analyses, and other features in advance. This is roughly the equivalent of asking researchers to point to the center field fence, announcing their intention to hit the ball right there. The reason this matters is that all too often, evaluators carry out evaluations that do not produce desired, positive outcomes on some measures or some analyses. They then report outcomes only on the measures that did show positive outcomes, or they might use different analyses from those initially planned, or only report outcomes for a subset of their full sample. On this last point, I remember a colleague long ago who obtained and re-analyzed data from a large and important national study that studied several cities but only reported data for Detroit. In her analyses of data from the other cities, she found that the results the authors claimed were seen only in Detroit, not in any other city.

REES pre-registration will, over time, make it possible for researchers, reviewers, and funders to find out whether evaluators are reporting all of the findings and all of the analyses as they originally planned them.  I would assume that within a period of years, review facilities such as the What Works Clearinghouse will start requiring pre-registration before accepting studies for its top evidence categories. We will certainly do so for Evidence for ESSA. As pre-registration becomes common (as it surely will, if IES is suggesting or requiring it), review facilities such as WWC and Evidence for ESSA will have to learn how to use the pre-registration information. Obviously, minor changes in research designs or measures may be allowed, especially small changes made before posttests are known. For example, if some schools named in pre-registration are not in the posttest sample, the evaluators might explain that the schools closed (not a problem if this did not upset pretest equivalence), but if they withdrew for other reasons, reviewers would want to know why, and would insist that withdrawn schools be included in any intent-to-treat (ITT) analysis. Other fields, including much of medical research, have been using pre-registration for many years, and I’m sure REES and review facilities in education could learn from their experiences and policies.

What I find most heartening in REES and pre-registration is that it is an indication of how much and how rapidly educational research has matured in a short time. Ten years ago REES could not have been realistically proposed. There was too little high-quality research to justify it, and frankly, few educators or policy makers cared very much about the findings of rigorous research. There is still a long way to go in this regard, but embracing pre-registration is one way we say to our profession and ourselves that the quality of evidence in education can stand up to that in any other field, and that we are willing to hold ourselves accountable for the highest standards.

blog_11-29-18_BabeRuth_374x500

In baseball history, Babe Ruth’s “pre-registered” home run in the 1932 series is referred to as the “called shot.” No one had ever done it before, and no one ever did it again. But in educational evaluation, we will soon be calling our shots all the time. And when we say in advance exactly what we are going to do, and then do it, just as we promised, showing real benefits for children, then educational evaluation will take a major step forward in increasing users’ confidence in the outcomes.

 

 

 

Photo credit: Babe Ruth, 1920, unattributed photo [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Advertisements

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Curious Case of the Missing Programs

“Let me tell you, my dear Watson, about one of my most curious and vexing cases,” said Holmes. “I call it, ‘The Case of the Missing Programs’. A school superintendent from America sent me a letter.  It appears that whenever she looks in the What Works Clearinghouse to find a program her district wants to use, nine times out of ten there is nothing there!”

Watson was astonished. “But surely there has to be something. Perhaps the missing programs did not meet WWC standards, or did not have positive effects!”

“Not meeting standards or having disappointing outcomes would be something,” responded Holmes, “but the WWC often says nothing at all about a program. Users are apparently confused. They don’t know what to conclude.”

“The missing programs must make the whole WWC less useful and reliable,” mused Watson.

“Just so, my friend,” said Holmes, “and so we must take a trip to America to get to the bottom of this!”

blog_9-6-18_SherlockProf_458x500

While Holmes and Watson are arranging steamship transportation to America, let me fill you in on this very curious case.

In the course of our work on Evidence for ESSA (www.evidenceforessa.org), we are occasionally asked by school district leaders why there is nothing in our website about a given program, text, or software. Whenever this happens, our staff immediately checks to see if there is any evidence we’ve missed. If we are pretty sure that there are no studies of the missing program that meet our standards, we add the program to our website, with a brief indication that there are no qualifying studies. If any studies do meet our standards, we review them as soon as possible and add them as meeting or not meeting ESSA standards.

Sometimes, districts or states send us their entire list of approved texts and software, and we check them all to see that all are included.

From having done this for more than a year, we now have an entry on most of the reading and math programs any district would come up with, though we keep getting more all the time.

All of this seems to us to be obviously essential. If users of Evidence for ESSA look up their favorite programs, or ones they are thinking of adopting, and find that there is no entry, they begin losing confidence in the whole enterprise. They cannot know whether the program they seek was ignored or missed for some reason, or has no evidence of effectiveness, or perhaps has been proven effective but has not been reviewed.

Recently, a large district sent me their list of 98 approved and supplementary texts, software, and other programs in reading and math. They had marked each according to the ratings given by the What Works Clearinghouse and Evidence for ESSA. At the time (a few weeks ago), Evidence for ESSA had listings for 67% of the programs. Today, of course, it has 100%, because we immediately set to work researching and adding in all the programs we’d missed.

What I found astonishing, however, is how few of the district’s programs were mentioned at all in the What Works Clearinghouse. Only 15% of the reading and math programs were in the WWC.

I’ve written previously about how far behind the WWC is in reviewing programs. But the problem with the district list was not just a question of slowness. Many of the programs the WWC missed have been around for some time.

I’m not sure how the WWC decides what to review, but they do not seem to be trying for completeness. I think this is counterproductive. Users of the WWC should expect to be able to find out about programs that meet standards for positive outcomes, those that have an evidence base that meets evidence standards but do not have positive outcomes, those that have evidence not meeting standards, and those that have no evidence at all. Yet it seems clear that the largest category in the WWC is “none of the above.” Most programs a user would be interested in do not appear at all in the WWC. Most often, a lack of a listing means a lack of evidence, but this is not always the case, especially when evidence is recent. One way or another, finding big gaps in any compendium undermines faith in the whole effort. It’s difficult to expect educational leaders to get into the habit of looking for evidence if most of the programs they consider are not listed.

Imagine, for example, that a telephone book was missing a significant fraction of the people who live in a given city. Users would be frustrated about not being able to find their friends, and the gaps would soon undermine confidence in the whole phone book.

****

When Holmes and Watson arrived in the U.S., they spoke with many educators who’d tried to find programs in the WWC, and they heard tales of frustration and impatience. Many former users said they no longer bothered to consult the WWC and had lost faith in evidence in their field. Fortunately, Holmes and Watson got a meeting with U.S. Department of Education officials, who immediately understood the problem and set to work to find the evidence base (or lack of evidence) for every reading and math program in America. Usage of the WWC soared, and support for evidence-based reform in education increased.

Of course, this outcome is fictional. But it need not remain fictional. The problem is real, and the solution is simple. Or as Holmes would say, “Elementary and secondary, my dear Watson!”

Photo credit: By Rumensz [CC0], from Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

John Hattie is Wrong

John Hattie is a professor at the University of Melbourne, Australia. He is famous for a book, Visible Learning, which claims to review every area of research that relates to teaching and learning. He uses a method called “meta-meta-analysis,” averaging effect sizes from many meta-analyses. The book ranks factors from one to 138 in terms of their effect sizes on achievement measures. Hattie is a great speaker, and many educators love the clarity and simplicity of his approach. How wonderful to have every known variable reviewed and ranked!

However, operating on the principle that anything that looks to be too good to be true probably is, I looked into Visible Learning to try to understand why it reports such large effect sizes. My colleague, Marta Pellegrini from the University of Florence (Italy), helped me track down the evidence behind Hattie’s claims. And sure enough, Hattie is profoundly wrong. He is merely shoveling meta-analyses containing massive bias into meta-meta-analyses that reflect the same biases.

blog_6-21-18_salvagepaper_476x500

Part of Hattie’s appeal to educators is that his conclusions are so easy to understand. He even uses a system of dials with color-coded “zones,” where effect sizes of 0.00 to +0.15 are designated “developmental effects,” +0.15 to +0.40 “teacher effects” (i.e., what teachers can do without any special practices or programs), and +0.40 to +1.20 the “zone of desired effects.” Hattie makes a big deal of the magical effect size +0.40, the “hinge point,” recommending that educators essentially ignore factors or programs below that point, because they are no better than what teachers produce each year, from fall to spring, on their own. In Hattie’s view, an effect size of from +0.15 to +0.40 is just the effect that “any teacher” could produce, in comparison to students not being in school at all. He says, “When teachers claim that they are having a positive effect on achievement or when a policy improves achievement, this is almost always a trivial claim: Virtually everything works. One only needs a pulse and we can improve achievement.” (Hattie, 2009, p. 16). An effect size of 0.00 to +0.15 is, he estimates, “what students could probably achieve if there were no schooling” (Hattie, 2009, p. 20). Yet this characterization of dials and zones misses the essential meaning of effect sizes, which are rarely used to measure the amount teachers’ students gain from fall to spring, but rather the amount students receiving a given treatment gained in comparison to gains made by similar students in a control group over the same period. So an effect size of, say, +0.15 or +0.25 could be very important.

Hattie’s core claims are these:

  • Almost everything works
  • Any effect size less than +0.40 is ignorable
  • It is possible to meaningfully rank educational factors in comparison to each other by averaging the findings of meta-analyses.

These claims appear appealing, simple, and understandable. But they are also wrong.

The essential problem with Hattie’s meta-meta-analyses is that they accept the results of the underlying meta-analyses without question. Yet many, perhaps most meta-analyses accept all sorts of individual studies of widely varying standards of quality. In Visible Learning, Hattie considers and then discards the possibility that there is anything wrong with individual meta-analyses, specifically rejecting the idea that the methods used in individual studies can greatly bias the findings.

To be fair, a great deal has been learned about the degree to which particular study characteristics bias study findings, always in a positive (i.e., inflated) direction. For example, there is now overwhelming evidence that effect sizes are significantly inflated in studies with small sample sizes, brief durations, use measures made by researchers or developers, are published (vs. unpublished), or use quasi-experiments (vs. randomized experiments) (Cheung & Slavin, 2016). Many meta-analyses even include pre-post studies, or studies that do not have pretests, or have pretest differences but fail to control for them. For example, I once criticized a meta-analysis of gifted education in which some studies compared students accepted into gifted programs to students rejected for those programs, controlling for nothing!

A huge problem with meta-meta-analysis is that until recently, meta-analysts rarely screened individual studies to remove those with fatal methodological flaws. Hattie himself rejects this procedure: “There is…no reason to throw out studies automatically because of lower quality” (Hattie, 2009, p. 11).

In order to understand what is going on in the underlying meta-analyses in a meta-meta-analysis, is it crucial to look all the way down to the individual studies. As a point of illustration, I examined Hattie’s own meta-meta-analysis of feedback, his third ranked factor, with a mean effect size of +0.79. Hattie & Timperly (2007) located 12 meta-analyses. I found some of the ones with the highest mean effect sizes.

At a mean of +1.24, the meta-analysis with the largest effect size in the Hattie & Timperley (2007) review was a review of research on various reinforcement treatments for students in special education by Skiba, Casey, & Center (1985-86). The reviewers required use of single-subject designs, so the review consisted of a total of 35 students treated one at a time, across 25 studies. Yet it is known that single-subject designs produce much larger effect sizes than ordinary group designs (see What Works Clearinghouse, 2017).

The second-highest effect size, +1.13, was from a meta-analysis by Lysakowski & Walberg (1982), on instructional cues, participation, and corrective feedback. Not enough information is provided to understand the individual studies, but there is one interesting note. A study using a single-subject design, involving two students, had an effect size of 11.81. That is the equivalent of raising a child’s IQ from 100 to 277! It was “winsorized” to the next-highest value of 4.99 (which is like adding 75 IQ points). Many of the studies were correlational, with no controls for inputs, or had no control group, or were pre-post designs.

A meta-analysis by Rummel and Feinberg (1988), with a reported effect size of +0.60, is perhaps the most humorous inclusion in the Hattie & Timperley (2007) meta-meta-analysis. It consists entirely of brief lab studies of the degree to which being paid or otherwise reinforced for engaging in an activity that was already intrinsically motivating would reduce subjects’ later participation in that activity. Rummel & Feinberg (1988) reported a positive effect size if subjects later did less of the activity they were paid to do. The reviewers decided to code studies positively if their findings corresponded to the theory (i.e., that feedback and reinforcement reduce later participation in previously favored activities), but in fact their “positive” effect size of +0.60 indicates a negative effect of feedback on performance.

I could go on (and on), but I think you get the point. Hattie’s meta-meta-analyses grab big numbers from meta-analyses of all kinds with little regard to the meaning or quality of the original studies, or of the meta-analyses.

If you are familiar with the What Works Clearinghouse (2007), or our own Best-Evidence Syntheses (www.bestevidence.org) or Evidence for ESSA (www.evidenceforessa.org), you will know that individual studies, except for studies of one-to-one tutoring, almost never have effect sizes as large as +0.40, Hattie’s “hinge point.” This is because WWC, BEE, and Evidence for ESSA all very carefully screen individual studies. We require control groups, controls for pretests, minimum sample sizes and durations, and measures independent of the treatments. Hattie applies no such standards, and in fact proclaims that they are not necessary.

It is possible, in fact essential, to make genuine progress using high-quality rigorous research to inform educational decisions. But first we must agree on what standards to apply.  Modest effect sizes from studies of practical treatments in real classrooms over meaningful periods of time on measures independent of the treatments tell us how much a replicable treatment will actually improve student achievement, in comparison to what would have been achieved otherwise. I would much rather use a program with an effect size of +0.15 from such studies than to use programs or practices found in studies with major flaws to have effect sizes of +0.79. If they understand the situation, I’m sure all educators would agree with me.

To create information that is fair and meaningful, meta-analysts cannot include studies of unknown and mostly low quality. Instead, they need to apply consistent standards of quality for each study, to look carefully at each one and judge its freedom from bias and major methodological flaws, as well as its relevance to practice. A meta-analysis cannot be any better than the studies that go into it. Hattie’s claims are deeply misleading because they are based on meta-analyses that themselves accepted studies of all levels of quality.

Evidence matters in education, now more than ever. Yet Hattie and others who uncritically accept all studies, good and bad, are undermining the value of evidence. This needs to stop if we are to make solid progress in educational practice and policy.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Hattie, J. (2009). Visible learning. New York, NY: Routledge.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81-112.

Lysakowski, R., & Walberg, H. (1982). Instructional effects of cues, participation, and corrective feedback: A quantitative synthesis. American Educational Research Journal, 19 (4), 559-578.

Rummel, A., & Feinberg, R. (1988). Cognitive evaluation theory: A review of the literature. Social Behavior and Personality, 16 (2), 147-164.

Skiba, R., Casey, A., & Center, B. (1985-86). Nonaversive procedures I the treatment of classroom behavior problems. The Journal of Special Education, 19 (4), 459-481.

What Works Clearinghouse (2017). Procedures handbook 4.0. Washington, DC: Author.

Photo credit: U.S. Farm Security Administration [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

More Chinese Dragons: How the WWC Could Accelerate Its Pace

blog_4-26-18_chinesedragon_500x375

A few months ago, I wrote a blog entitled “The Mystery of the Chinese Dragon: Why Isn’t the WWC Up to Date?” It really had nothing to do with dragons, but compared the timeliness of the What Works Clearinghouse review of research on secondary reading programs and a Baye et al. (2017) review on the same topic. The graph depicting the difference looked a bit like a Chinese dragon with a long tail near the ground and huge jaws. The horizontal axis was the dates accepted studies had appeared, and the vertical axis was the number of studies. Here is the secondary reading graph.

blog_4-26-18_graph1_500x292

What the graph showed is that the WWC and the U.S. studies from the Baye et al. (2017) review were similar in coverage of studies appearing from 1987 to 2009, but after that diverged sharply, because the WWC is very slow to add new studies, in comparison to reviews using similar methods.

In the time since the Chinese Dragon for secondary reading studies appeared on my blog, my colleagues and I have completed two more reviews, one on programs for struggling readers by Inns et al. (2018) and one on programs for elementary math by Pellegrini et al. (2018). We made new Chinese Dragon graphs for each, which appear below.*

blog_4-26-18_graph3_500x300

blog_4-26-18_graph2_500x316

*Note: In the reading graph, the line for “Inns et al.” added numbers of studies from the Inns et al. (2018) review of programs for struggling readers to additional studies of programs for all elementary students in an unfinished report.

The new dragons look remarkably like the first. Again, what matters is the similar pattern of accepted studies before 2009, (the “tail”), and the sharply diverging rates in more recent years (the “jaws”).

There are two phenomena that cause the dragons’ “jaws” to be so wide open. The upper jaw, especially in secondary reading and elementary math, indicate that many high-quality rigorous evaluations are appearing in recent years. Both the WWC inclusion standards and those of the Best Evidence Encyclopedia (BEE; www.bestevidence.org) require control groups, clustered analysis for clustered designs, samples that are well-matched at pretest and have similar attrition by posttest, and other features indicating methodological rigor, of the kind expected by the ESSA evidence standards, for example.

The upper jaw of each dragon is increasing so rapidly because rigorous research is increasing rapidly in the U.S. (it is also increasing rapidly in the U.K., but the WWC does not include non-U.S. studies, and non-U.S. studies are removed from the graph for comparability). This increase is due to U. S. Department of Education funding of many rigorous studies in each topic area, through its Institute for Education Sciences (IES) and Investing in Innovation (i3) programs, and special purpose funding such as Striving Readers and Preschool Curriculum Education Research. These recent studies are not only uniformly rigorous, they are also of great importance to educators, as they evaluate current programs being actively disseminated today. Many of the older programs whose evaluations appear on the dragons’ tails no longer exist, as a practical matter. If educators wanted to adopt them, the programs would have to be revised or reinvented. For example, Daisy Quest, still in the WWC, was evaluated on TRS-80 computers not manufactured since the 1980s. Yet exciting new programs with rigorous evaluations, highlighted in the BEE reviews, do not appear at all in the WWC.

I do not understand why the WWC is so slow to add new evaluations, but I suspect that the answer lies in the painstaking procedures any government has to follow to do . . ., well, anything. Perhaps there are very good reasons for this stately pace of progress. However, the result is clear. The graph below shows the publication dates of every study in every subject and grade level accepted by the WWC and entered on its database. This “half-dragon” graph shows that only 26 studies published or made available after 2013 appear on the entire WWC database. Of these, only two have appeared after 2015.

blog_4-26-18_graph4_500x316

The slow pace of the WWC is of particular concern in light of the appearance of the ESSA evidence standards. More educators than ever before must be consulting the WWC, and many must be wondering why programs they know to exist are not listed there, or why recent studies do not appear.

Assuming that there are good reasons for the slow pace of the WWC, or that for whatever reason the pace cannot be greatly accelerated, what can be done to bring the WWC up to date? I have a suggestion.

Imagine that the WWC commissioned someone to do rapid updating of all topics reviewed on the WWC website. The reviews would follow WWC guidelines, but would appear very soon after studies were published or issued. It’s clear that this is possible, because we do it for Evidence for ESSA (www.evidenceforessa.org). Also, the WWC has a number of “quick reviews,” “single study reports,” and so on, scattered around on its site, but not integrated with its main “Find What Works” reviews of various programs. These could be readily integrated with “Find What Works.”

The recent studies identified in this accelerated process might be identified as “provisionally reviewed,” much as the U. S. Patent Office has “patent pending” before inventions are fully patented. Users would have an option to look only at program reports containing fully reviewed studies, or could decide to look at reviews containing both fully and provisionally reviewed studies. If a more time consuming full review of a study found results different from those of the provisional review, the study report and the program report in which it was contained would be revised, of course.

A process of this kind could bring the WWC up to date and keep it up to date, providing useful, actionable evidence in a timely fashion, while maintaining the current slower process, if there is a rationale for it.

The Chinese dragons we are finding in every subject we have examined indicate the rapid growth and improving quality of evidence on programs for schools and students. The U. S. Department of Education and our whole field should be proud of this, and should make it a beacon on a hill, not hide our light under a bushel. The WWC has the capacity and the responsibility to highlight current, high-quality studies as soon as they appear. When this happens, the Chinese dragons will retire to their caves, and all of us, government, researchers, educators, and students, will benefit.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (2017). Effective reading programs for secondary students. Manuscript submitted for publication. Also see Baye, A., Lake, C., Inns, A. & Slavin, R. E. (2017, August). Effective reading programs for secondary students. Baltimore, MD: Johns Hopkins University, Center for Research and Reform in Education.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Photo credit: J Bar [GFDL (http://www.gnu.org/copyleft/fdl.html), CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/), GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Evidence for ESSA Celebrates its First Anniversary

Penguin 02 22 18On February 28, 2017 we launched Evidence for ESSA (www.evidenceforessa.org), our website providing the evidence to support educational programs according to the standards laid out in the Every Child Succeeds Act in December, 2015.

Evidence for ESSA began earlier, of course. It really began one day in September, 2016, when I heard leaders of the Institute for Education Sciences (IES) and the What Works Clearinghouse (WWC) announce that the WWC would not be changed to align with the ESSA evidence standards. I realized that no one else was going to create scientifically valid, rapid, and easy-to-use websites providing educators with actionable information on programs meeting ESSA standards. We could do it because our group at Johns Hopkins University, and partners all over the world, had been working for many years creating and updating another website, the Best Evidence Encyclopedia (BEE; www.bestevidence.org).BEE reviews were not primarily designed for practitioners and they did not align with ESSA standards, but at least we were not starting from scratch.

We assembled a group of large membership organizations to advise us and to help us reach thoughtful superintendents, principals, Title I directors, and others who would be users of the final product. They gave us invaluable advice along the way. We also assembled a technical working group (TWG) of distinguished researchers to advise us on key decisions in establishing our website.

It is interesting to note that we have not been able to obtain adequate funding to support Evidence for ESSA. Instead, it is mostly being written by volunteers and graduate students, all of whom are motivated only by a passion for evidence to improve the education of students.

A year after launch, Evidence for ESSA has been used by more than 36,000 unique users, and I hear that it is very useful in helping states and districts meet the ESSA evidence standards.

We get a lot of positive feedback, as well as complaints and concerns, to which we try to respond rapidly. Feedback has been important in changing some of our policies and correcting some errors and we are glad to get it.

At this moment we are thoroughly up-to-date on reading and math programs for grades pre-kindergarten to 12, and we are working on science, writing, social-emotional outcomes, and summer school. We are also continuing to update our more academic BEE reviews, which draw from our work on Evidence for ESSA.

In my view, the evidence revolution in education is truly a revolution. If the ESSA evidence standards ultimately prevail, education will at long last join fields such as medicine and agriculture in a dynamic of practice to development to evaluation to dissemination to better practice, in an ascending spiral that leads to constantly improving practices and outcomes.

In a previous revolution, Thomas Jefferson said, “If I had to choose between government without newspapers and newspapers without government, I’d take the newspapers.” In our evidence revolution in education, Evidence for ESSA, the WWC, and other evidence sources are our “newspapers,” providing the information that people of good will can use to make wise and informed decisions.

Evidence for ESSA is the work of many dedicated and joyful hands trying to provide our profession with the information it needs to improve student outcomes. The joy in it is the joy in seeing teachers, principals, and superintendents see new, attainable ways to serve their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measures, failure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.