On Reviews of Research in Education

Not so long ago, every middle class home had at least one encyclopedia. Encyclopedias were prominently displayed, a statement to all that this was a house that valued learning. People consulted the encyclopedia to find out about things of interest to them. Those who did not own encyclopedias found them in the local library, where they were heavily used. As a kid, I loved everything about encyclopedias. I loved to read them, but also loved their musty small, their weight, and their beautiful maps and photos.

There were two important advantages of an encyclopedia. First, it was encyclopedic, so users could be reasonably certain that whatever information they wanted was in there somewhere. Second, they were authoritative. Whatever it said in the encyclopedia was likely to be true, or at least carefully vetted by experts.

blog_10-17-19_encyclopediakid_500x331

In educational research, and all scientific fields, we have our own kinds of encyclopedias. One consists of articles in journals that publish reviews of research. In our field, the Review of Educational Research plays a pre-eminent role in this, but there are many others. Reviews are hugely popular. Invariably, review journals have a much higher citation count than even the most esteemed journals focusing on empirical research. In addition to journals, reviews appear I edited volumes, in online compendia, in technical reports, and other sources. At Johns Hopkins, we produce a bi-weekly newsletter, Best Evidence in Brief (BEiB; https://beibindex.wordpress.com/) that summarizes recent research in education. Two years ago we looked at analytics to find out the favorite articles from BEiB. Although BEiB mostly summarizes individual studies, almost all of its favorite articles were summaries of the findings of recent reviews.

Over time, RER and other review journals become “encyclopedias” of a sort.  However, they are not encyclopedic. No journal tries to ensure that key topics will all be covered over time. Instead, journal reviewers and editors evaluate each review sent to them on its own merits. I’m not criticizing this, but it is the way the system works.

Are reviews in journals authoritative? They are in one sense, because reviews accepted for publication have been carefully evaluated by distinguished experts on the topic at hand. However, review methods vary widely and reviews are written for many purposes. Some are written primarily for theory development, and some are really just essays with citations. In contrast, one category of reviews, meta-analyses, go to great lengths to locate and systematically include all relevant citations. These are not pure types, and most meta-analyses have at least some focus on theory building and discussion of current policy or research issues, even if their main purpose is to systematically review a well-defined set of studies.

Given the state of the art of research reviews in education, how could we create an “encyclopedia” of evidence from all sources on the effectiveness of programs and practices designed to improve student outcomes? The goal of such an activity would be to provide readers with something both encyclopedic and authoritative.

My colleagues and I created two websites that are intended to serve as a sort of encyclopedia of PK-12 instructional programs. The Best Evidence Encyclopedia (BEE; www.bestevidence.org) consists of meta-analyses written by our staff and students, all of which use similar inclusion criteria and review methods. These are used by a wide variety of readers, especially but not only researchers. The BEE has meta-analyses on elementary and secondary reading, reading for struggling readers, writing programs, programs for English learners, elementary and secondary mathematics, elementary and secondary science, early childhood programs, and other topics, so at least as far as achievement outcomes are concerned, it is reasonably encyclopedic. Our second website is Evidence for ESSA, designed more for educators. It seeks to include every program currently in existence, and therefore is truly encyclopedic in reading and mathematics. Sections on social emotional learning, attendance, and science are in progress.

Are the BEE and Evidence for ESSA authoritative as well as encyclopedic? You’ll have to judge for yourself. One important indicator of authoritativeness for the BEE is that all of the meta-analyses are eventually published, so the reviewers for those journals could be considered to be lending authority.

The What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/) could be considered authoritative, as it is a carefully monitored online publication of the U.S. Department of Education. But is it encyclopedic? Probably not, for two reasons. One is that the WWC has difficulty keeping up with new research. Secondly, the WWC does not list programs that do not have any studies that meet its standards. As a result of both of these, a reader who types in the name of a current program may find nothing at all on it. Is this because the program did not meet WWC standards, or because the WWC has not yet reviewed it? There is no way to tell. Still, the WWC makes important contributions in the areas it has reviewed.

Beyond the websites focused on achievement, the most encyclopedic and authoritative source is Blueprints (www.blueprintsprograms.org). Blueprints focuses on drug and alcohol abuse, violence, bullying, social emotional learning, and other topics not extensively covered in other review sources.

In order to provide readers with easy access to all of the reviews meeting a specified level of quality on a given topic, it would be useful to have a source that briefly describes various reviews, regardless of where they appear. For example, a reader might want to know about all of the meta-analyses that focus on elementary mathematics, or dropout prevention, or attendance. These would include review articles published in scientific journals, technical reports, websites, edited volumes, and so on. To be cited in detail, the reviews should have to meet agreed-upon criteria, including a restriction to experimental-control comparison, a broad and well-documented search for eligible studies, documented efforts to include all studies (published or unpublished) that fall within well-specified parameters (e.g., subjects, grade levels, and start and end dates of studies included). Reviews that meet these standards might be highlighted, though others, including less systematic reviews, should be listed as well, as supplementary resources.

Creating such a virtual encyclopedia would be a difficult but straightforward task. At the end, the collection of rigorous reviews would offer readers encyclopedic, authoritative information on the topics of their interest, as well as providing something more important that no paper encyclopedias ever included: contrasting viewpoints from well-informed experts on each topic.

My imagined encyclopedia wouldn’t have the hypnotic musty smell, the impressive heft, or the beautiful maps and photos of the old paper encyclopedias. However, it would give readers access to up-to-date, curated, authoritative, quantitative reviews of key topics in education, with readable and appealing summaries of what was concluded in qualifying reviews.

Also, did I mention that unlike the encyclopedias of old, it would have to be free?

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Do School Districts Really Have Difficulty Meeting ESSA Evidence Standards?

The Center for Educational Policy recently released a report on how school districts are responding to the Every Student Succeeds Act (ESSA) requirement that schools seeking school improvement grants select programs that meet ESSA’s strong, moderate, or promising standards of evidence. Education Week ran a story on the CEP report.

The report noted that many states, districts, and schools are taking the evidence requirements seriously, and are looking at websites and consulting with researchers to help them identify programs that meet the standards. This is all to the good.

However, the report also notes continuing problems districts and schools are having finding out “what works.” Two particular problems were cited. One was that districts and schools were not equipped to review research to find out what works. The other was that rural districts and schools found few programs proven effective in rural schools.

I find these concerns astounding. The same concerns were expressed when ESSA was first passed, in 2015. But that was almost four years ago. Since 2015, the What Works Clearinghouse has added information to help schools identify programs that meet the top two ESSA evidence categories, strong and moderate. Our own Evidence for ESSA, launched in February, 2017, has up-to-date information on virtually all PK-12 reading and math programs currently in dissemination. Among hundreds of programs examined, 113 meet ESSA standards for strong, moderate, or promising evidence of effectiveness. WWC, Evidence for ESSA, and other sources are available online at no cost. The contents of the entire Evidence for ESSA website were imported into Ohio’s own website on this topic, and dozens of states, perhaps all of them, have informed their districts and schools about these sources.

The idea that districts and schools could not find information on proven programs if they wanted to do so is difficult to believe, especially among schools eligible for school improvement grants. Such schools, and the districts in which they are located, write a lot of grant proposals for federal and state funding. The application forms for school improvement grants always explain the evidence requirements, because that is the law. Someone in every state involved with federal funding knows about the WWC and Evidence for ESSA websites. More than 90,000 unique users have used Evidence for ESSA, and more than 800 more sign on each week.

blog_10-10-19_generickids_500x333

As to rural schools, it is true that many studies of educational programs have taken place in urban areas. However, 47 of the 113 programs qualified by Evidence for ESSA were validated in at least one rural study, or a study including a large enough rural sample to enable researchers to separately report program impacts for rural students. Also, almost all widely disseminated programs have been used in many rural schools. So rural districts and schools that care about evidence can find programs that have been evaluated in rural locations, or at least that were evaluated in urban or suburban schools but widely disseminated in rural schools.

Also, it is important to note that if a program was successfully evaluated only in urban or suburban schools, the program still meets the ESSA evidence standards. If no studies of a given outcome were done in rural locations, a rural school in need of better outcomes could, in effect, be asked to choose between a program proven to work somewhere and probably used in dissemination in rural schools, or they could choose a program not proven to work anywhere. Every school and district has to make the best choices for their kids, but if I were a rural superintendent or principal, I’d read up on proven programs, and then go visit some rural schools using that program nearby. Wouldn’t you?

I have no reason to suspect that the CEP survey is incorrect. There are many indications that district and school leaders often do feel that the ESSA evidence rules are too difficult to meet. So what is really going on?

My guess is that there are many district and school leaders who do not want to know about evidence on proven programs. For example, they may have longstanding, positive relationships with representatives of publishers or software developers, or they may be comfortable and happy with the materials and services they are already using, evidence-proven or not. If they do not have evidence of effectiveness that would pass muster with WWC or Evidence for ESSA, the publishers and software developers may push hard on state and district officials, put forward dubious claims for evidence (such as studies with no control groups), and do their best to get by in a system that increasingly demands evidence that they lack. In my experience, district and state officials often complain about having inadequate staff to review evidence of effectiveness, but their concern may be less often finding out what works as it is defending themselves from publishers, software developers, or current district or school users of programs, who maintain that they have been unfairly rated by WWC, Evidence for ESSA, or other reviews. State and district leaders who stand up to this pressure may have to spend a lot of time reviewing evidence or hearing arguments.

On the plus side, at the same time that publishers and software producers may be seeking recognition for their current products, many are also sponsoring evaluations of some of their products that they feel are mostly likely to perform well in rigorous evaluations. Some may be creating new programs that resemble programs that have met evidence standards. If the federal ESSA law continues to demand evidence for certain federal funding purposes, or even to expand this requirement to additional parts of federal grant-making, then over time the ESSA law will have its desired effect, rewarding the creation and evaluation of programs that do meet standards by making it easier to disseminate such programs. The difficulties the evidence movement is experiencing are likely to diminish over time as more proven programs appear, and as federal, state, district, and school leaders get comfortable with evidence.

Evidence-based reform was always going to be difficult, because of the amount of change it entails and the stakes involved. But sooner or later, it is the right thing to do, and leaders who insist on evidence will see increasing levels of learning among their students, at minimal cost beyond what they already spend on untested or ineffective approaches. Medicine went through a similar transition in 1962, when the U.S. Congress first required that medicines be rigorously evaluated for effectiveness and safety. At first, many leaders in the medical profession resisted the changes, but after a while, they came to insist on them. The key is political leadership willing to support the evidence requirement strongly and permanently, so that educators and vendors alike will see that the best way forward is to embrace evidence and make it work for kids.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Send Us Your Evaluations!

In last week’s blog, I wrote about reasons that many educational leaders are wary of the ESSA evidence standards, and the evidence-based reform movement more broadly. Chief among these concerns was a complaint that few educational leaders had the training in education research methods to evaluate the validity of educational evaluations. My response to this was to note that it should not be necessary for educational leaders to read and assess individual evaluations of educational programs, because free, easy-to-interpret review websites, such as the What Works Clearinghouse and Evidence for ESSA, already do such reviews. Our Evidence for ESSA website (www.evidenceforessa.org) lists reading and math programs available for use anywhere in the U.S., and we are constantly on the lookout for any we might have missed. If we have done our job well, you should be able to evaluate the evidence base for any program, in perhaps five minutes.

Other evidence-based fields rely on evidence reviews. Why not education? Your physician may or may not know about medical research, but most rely on websites that summarize the evidence. Farmers may be outstanding in their fields, but they rely on evidence summaries. When you want to know about the safety and reliability of cars you might buy, you consult Consumer Reports. Do you understand exactly how they get their ratings? Neither do I, but I trust their expertise. Why should this not be the same for educational programs?

At Evidence for ESSA, we are aiming to provide information on every program available to you, if you are a school or district leader. At the moment, we cover reading and mathematics, grades pre-k to 12. We want to be sure that if a sales rep or other disseminator offers you a program, you can look it up on Evidence for ESSA and it will be there. If there are no studies of the program that meet our standards, we will say so. If there are qualifying studies that either do or do not have evidence of positive outcomes that meet ESSA evidence standards, we will say so. On our website, there is a white box on the homepage. If you type in the name of any reading or math program, the website should show you what we have been able to find out.

What we do not want to happen is that you type in a program title and find nothing. In our website, “nothing” has no useful meaning. We have worked hard to find every program anyone has heard of, and we have found hundreds. But if you know of any reading or math program that does not appear when you type in its name, please tell us. If you have studies of that program that might meet our inclusion criteria, please send them to us, or citations to them. We know that there are always additional programs entering use, and additional research on existing programs.

Why is this so important to us? The answer is simple, Evidence for ESSA exists because we believe it is essential for the progress of evidence-based reform for educators and policy makers to be confident that they can easily find the evidence on any program, not just the most widely used. Our vision is that someday, it will be routine for educators thinking of adopting educational programs to quickly consult Evidence for ESSA (or other reviews) to find out what has been proven to work, and what has not. I heard about a superintendent who, before meeting with any sales rep, asked them to show her the evidence for the effectiveness of their program on Evidence for ESSA or the What Works Clearinghouse. If they had it, “Come on in,” she’d say. If not, “Maybe later.”

Only when most superintendents and other school officials do this will program publishers and other providers know that it is worth their while to have high-quality evaluations done of each of their programs. Further, they will find it worthwhile to invest in the development of programs likely to work in rigorous evaluations, to provide enough quality professional development to give their programs a chance to succeed, and to insist that schools that adopt their proven programs incorporate the methods, materials, and professional development that their own research has told them are needed for success. Insisting on high-quality PD, for example, adds cost to a program, and providers may worry that demanding sufficient PD will price them out of the market. But if all programs are judged on their proven outcomes, they all will require adequate PD, to be sure that the programs will work when evaluated. That is how evidence will transform educational practice and outcomes.

So our attempt to find and fairly evaluate every program in existence is not due to our being nerds or obsessive compulsive neurotics (though these may be true, too). But thorough, rigorous review of the whole body of evidence in every subject and grade level, and for attendance, social emotional learning, and other non-academic outcomes, is part of a plan.

You can help us on this part of our plan. Tell us about anything we have missed, or any mistakes we have made. You will be making an important contribution to the progress of our profession, and to the success of all children.

blog_6-6-19_mail_500x381
Send us your evaluations!
Photo credit: George Grantham Bain Collection, Library of Congress [Public domain]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Why Do Some Educators Push Back Against Evidence?

In December, 2015, the U.S. Congress passed the Every Student Succeeds Act, or ESSA. Among many other provisions, ESSA defined levels of evidence supporting educational programs: Strong (at least one randomized experiment with positive outcomes), moderate (at least one quasi-experimental study with positive outcomes), and promising (at least one correlational study with positive outcomes). For various forms of federal funding, schools are required (in school improvement) or encouraged (in seven other funding streams) to use programs falling into one of these top three categories. There is also a fourth category, “demonstrates a rationale,” but this one has few practical consequences.

3 ½  years later, the ESSA evidence standards are increasing interest in evidence of effectiveness for educational programs, especially among schools applying for school improvement funding and in state departments of education, which are responsible for managing the school improvement grant process. All of this is to the good, in my view.

On the other hand, evidence is not yet transforming educational practice. Even in portions of ESSA that encourage or require use of proven programs among schools seeking federal funding, schools and districts often try to find ways around the evidence requirements rather than truly embracing them. Even when schools do say they used evidence in their proposals, they may have just accepted assurances from publishers or developers stating that their programs meet ESSA standards, even when this is clearly not so.

blog_5-30-19_pushingcar_500x344
Why are these children in India pushing back on a car?  And why do many educators in our country push back on evidence?

Educators care a great deal about their children’s achievement, and they work hard to ensure their success. Implementing proven, effective programs does not guarantee success, but it greatly increases the chances. So why has evidence of effectiveness played such a limited role in program selection and implementation, even when ESSA, the national education law, defines evidence and requires use of proven programs under certain circumstances?

The Center on Education Policy Report

Not long ago, the Center on Education Policy (CEP) at George Washington University published a report of telephone interviews of state leaders in seven states. The interviews focused on problems states and districts were having with implementation of the ESSA evidence standards. Six themes emerged:

  1. Educational leaders are not comfortable with educational research methods.
  2. State leaders feel overwhelmed serving large numbers of schools qualifying for school improvement.
  3. Districts have to seriously re-evaluate longstanding relationships with vendors of education products.
  4. State and district staff are confused about the prohibition on using Title I school improvement funds on “Tier 4” programs (ones that demonstrate a rationale, but have not been successfully evaluated in a rigorous study).
  5. Some state officials complained that the U.S. Department of Education had not been sufficiently helpful with implementation of ESSA evidence standards.
  6. State leaders had suggestions to make education research more accessible to educators.

What is the Reality?

I’m sure that the concerns expressed by the state and district leaders in the CEP report are sincerely felt. But most of them raise issues that have already been solved at the federal, state, and/or district levels. If these concerns are as widespread as they appear to be, then we have serious problems of communication.

  1. The first theme in the CEP report is one I hear all the time. I find it astonishing, in light of the reality.

No educator needs to be a research expert to find evidence of effectiveness for educational programs. The federal What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/) and our Evidence for ESSA (www.evidenceforessa.org) provide free information on the outcomes of programs, at least in reading and mathematics, that is easy to understand and interpret. Evidence for ESSA provides information on programs that do meet ESSA standards as well as those that do not. We are constantly scouring the literature for studies of replicable programs, and when asked, we review entire state and district lists of adopted programs and textbooks, at no cost. The What Works Clearinghouse is not as up-to-date and has little information on programs lacking positive findings, but it also provides easily interpreted information on what works in education.

In fact, few educational leaders anywhere are evaluating the effectiveness of individual programs by reading research reports one at a time. The What Works Clearinghouse and Evidence for ESSA employ experts who know how to find and evaluate outcomes of valid research and to describe the findings clearly. Why would every state and district re-do this job for themselves? It would be like having every state do its own version of Consumer Reports, or its own reviews of medical treatments. It just makes no sense. In fact, at least in the case of Evidence for ESSA, we know that more than 80,000 unique readers have used Evidence for ESSA since it launched in 2017. I’m sure even larger numbers have used the What Works Clearinghouse and other reviews. The State of Ohio took our entire Evidence for ESSA website and put it on its own state servers with some other information. Several other states have strongly promoted the site. The bottom line is that educational leaders do not have to be research mavens to know what works, and tens of thousands of them know where to find fair and useful information.

  1. State leaders are overwhelmed. I’m sure this is true, but most state departments of education have long been understaffed. This problem is not unique to ESSA.
  2. Districts have to seriously re-evaluate longstanding relationships with vendors. I suspect that this concern is at the core of the problem on evidence. The fact is that most commercial programs do not have adequate evidence of effectiveness. Either they have no qualifying studies (by far the largest number), or they do have qualifying evidence that is not significantly positive. A vendor with programs that do not meet ESSA standards is not going to be a big fan of evidence, or ESSA. These are often powerful organizations with deep personal relationships with state and district leaders. When state officials adhere to a strict definition of evidence, defined in ESSA, local vendors push back hard. Understaffed state departments are poorly placed to fight with vendors and their friends in district offices, so they may be forced to accept weak or no evidence.
  3. Confusions about Tier 4 evidence. ESSA is clear that to receive certain federal funds schools must use programs with evidence in Tiers 1, 2, or 3, but not 4. The reality is that definitions of Tier 4 are so weak that any program on Earth can meet this standard. What program anywhere does not have a rationale? The problem is that districts, states, and vendors have used confusion about Tier 4 to justify any program they wish. Some states are more sophisticated than others and do not allow this, but the very existence of Tier 4 in ESSA language creates a loophole that any clever sales rep or educator can use, or at least try to get away with.
  4. The U. S. Department of Education is not helpful enough. In reality, USDoE is understaffed and overwhelmed on many fronts. In any case, ESSA puts a lot of emphasis on state autonomy, so the feds feel unwelcome in performing oversight.

The Future of Evidence in Education

Despite the serious problems in implementation of ESSA, I still think it is a giant step forward. Every successful field, such as medicine, agriculture, and technology, has started its own evidence revolution fighting entrenched interests and anxious stakeholders. As late as the 1920s, surgeons refused to wash their hands before operations, despite substantial evidence going back to the 1800s that handwashing was essential. Evidence eventually triumphs, though it often takes many years. Education is just at the beginning of its evidence revolution, and it will take many years to prevail. But I am unaware of any field that embraced evidence, only to retreat in the face of opposition. Evidence eventually prevails because it is focused on improving outcomes for people, and people vote. Sooner or later, evidence will transform the practice of education, as it has in so many other fields.

Photo credit: Roger Price from Hong Kong, Hong Kong [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

A Warm Welcome From Babe Ruth’s Home Town to the Registry of Efficacy and Effectiveness Studies (REES)

Every baseball season, many home runs are hit by various players across the major leagues. But in all of history, there is one home run that stands out for baseball fans. In the 1932 World Series, Babe Ruth (born in Baltimore!) pointed to the center field fence. He then hit the next pitch over that fence, exactly where he said he would.

Just 86 years later, the U.S. Department of Education, in collaboration with the Society for Research on Educational Effectiveness (SREE), launched a new (figurative) center field fence for educational evaluation. It’s called the Registry of Efficacy and Effectiveness Studies (REES). The purpose of REES is to ask evaluators of educational programs to register their research designs, measures, analyses, and other features in advance. This is roughly the equivalent of asking researchers to point to the center field fence, announcing their intention to hit the ball right there. The reason this matters is that all too often, evaluators carry out evaluations that do not produce desired, positive outcomes on some measures or some analyses. They then report outcomes only on the measures that did show positive outcomes, or they might use different analyses from those initially planned, or only report outcomes for a subset of their full sample. On this last point, I remember a colleague long ago who obtained and re-analyzed data from a large and important national study that studied several cities but only reported data for Detroit. In her analyses of data from the other cities, she found that the results the authors claimed were seen only in Detroit, not in any other city.

REES pre-registration will, over time, make it possible for researchers, reviewers, and funders to find out whether evaluators are reporting all of the findings and all of the analyses as they originally planned them.  I would assume that within a period of years, review facilities such as the What Works Clearinghouse will start requiring pre-registration before accepting studies for its top evidence categories. We will certainly do so for Evidence for ESSA. As pre-registration becomes common (as it surely will, if IES is suggesting or requiring it), review facilities such as WWC and Evidence for ESSA will have to learn how to use the pre-registration information. Obviously, minor changes in research designs or measures may be allowed, especially small changes made before posttests are known. For example, if some schools named in pre-registration are not in the posttest sample, the evaluators might explain that the schools closed (not a problem if this did not upset pretest equivalence), but if they withdrew for other reasons, reviewers would want to know why, and would insist that withdrawn schools be included in any intent-to-treat (ITT) analysis. Other fields, including much of medical research, have been using pre-registration for many years, and I’m sure REES and review facilities in education could learn from their experiences and policies.

What I find most heartening in REES and pre-registration is that it is an indication of how much and how rapidly educational research has matured in a short time. Ten years ago REES could not have been realistically proposed. There was too little high-quality research to justify it, and frankly, few educators or policy makers cared very much about the findings of rigorous research. There is still a long way to go in this regard, but embracing pre-registration is one way we say to our profession and ourselves that the quality of evidence in education can stand up to that in any other field, and that we are willing to hold ourselves accountable for the highest standards.

blog_11-29-18_BabeRuth_374x500

In baseball history, Babe Ruth’s “pre-registered” home run in the 1932 series is referred to as the “called shot.” No one had ever done it before, and no one ever did it again. But in educational evaluation, we will soon be calling our shots all the time. And when we say in advance exactly what we are going to do, and then do it, just as we promised, showing real benefits for children, then educational evaluation will take a major step forward in increasing users’ confidence in the outcomes.

 

 

 

Photo credit: Babe Ruth, 1920, unattributed photo [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Curious Case of the Missing Programs

“Let me tell you, my dear Watson, about one of my most curious and vexing cases,” said Holmes. “I call it, ‘The Case of the Missing Programs’. A school superintendent from America sent me a letter.  It appears that whenever she looks in the What Works Clearinghouse to find a program her district wants to use, nine times out of ten there is nothing there!”

Watson was astonished. “But surely there has to be something. Perhaps the missing programs did not meet WWC standards, or did not have positive effects!”

“Not meeting standards or having disappointing outcomes would be something,” responded Holmes, “but the WWC often says nothing at all about a program. Users are apparently confused. They don’t know what to conclude.”

“The missing programs must make the whole WWC less useful and reliable,” mused Watson.

“Just so, my friend,” said Holmes, “and so we must take a trip to America to get to the bottom of this!”

blog_9-6-18_SherlockProf_458x500

While Holmes and Watson are arranging steamship transportation to America, let me fill you in on this very curious case.

In the course of our work on Evidence for ESSA (www.evidenceforessa.org), we are occasionally asked by school district leaders why there is nothing in our website about a given program, text, or software. Whenever this happens, our staff immediately checks to see if there is any evidence we’ve missed. If we are pretty sure that there are no studies of the missing program that meet our standards, we add the program to our website, with a brief indication that there are no qualifying studies. If any studies do meet our standards, we review them as soon as possible and add them as meeting or not meeting ESSA standards.

Sometimes, districts or states send us their entire list of approved texts and software, and we check them all to see that all are included.

From having done this for more than a year, we now have an entry on most of the reading and math programs any district would come up with, though we keep getting more all the time.

All of this seems to us to be obviously essential. If users of Evidence for ESSA look up their favorite programs, or ones they are thinking of adopting, and find that there is no entry, they begin losing confidence in the whole enterprise. They cannot know whether the program they seek was ignored or missed for some reason, or has no evidence of effectiveness, or perhaps has been proven effective but has not been reviewed.

Recently, a large district sent me their list of 98 approved and supplementary texts, software, and other programs in reading and math. They had marked each according to the ratings given by the What Works Clearinghouse and Evidence for ESSA. At the time (a few weeks ago), Evidence for ESSA had listings for 67% of the programs. Today, of course, it has 100%, because we immediately set to work researching and adding in all the programs we’d missed.

What I found astonishing, however, is how few of the district’s programs were mentioned at all in the What Works Clearinghouse. Only 15% of the reading and math programs were in the WWC.

I’ve written previously about how far behind the WWC is in reviewing programs. But the problem with the district list was not just a question of slowness. Many of the programs the WWC missed have been around for some time.

I’m not sure how the WWC decides what to review, but they do not seem to be trying for completeness. I think this is counterproductive. Users of the WWC should expect to be able to find out about programs that meet standards for positive outcomes, those that have an evidence base that meets evidence standards but do not have positive outcomes, those that have evidence not meeting standards, and those that have no evidence at all. Yet it seems clear that the largest category in the WWC is “none of the above.” Most programs a user would be interested in do not appear at all in the WWC. Most often, a lack of a listing means a lack of evidence, but this is not always the case, especially when evidence is recent. One way or another, finding big gaps in any compendium undermines faith in the whole effort. It’s difficult to expect educational leaders to get into the habit of looking for evidence if most of the programs they consider are not listed.

Imagine, for example, that a telephone book was missing a significant fraction of the people who live in a given city. Users would be frustrated about not being able to find their friends, and the gaps would soon undermine confidence in the whole phone book.

****

When Holmes and Watson arrived in the U.S., they spoke with many educators who’d tried to find programs in the WWC, and they heard tales of frustration and impatience. Many former users said they no longer bothered to consult the WWC and had lost faith in evidence in their field. Fortunately, Holmes and Watson got a meeting with U.S. Department of Education officials, who immediately understood the problem and set to work to find the evidence base (or lack of evidence) for every reading and math program in America. Usage of the WWC soared, and support for evidence-based reform in education increased.

Of course, this outcome is fictional. But it need not remain fictional. The problem is real, and the solution is simple. Or as Holmes would say, “Elementary and secondary, my dear Watson!”

Photo credit: By Rumensz [CC0], from Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.