Even Magic Johnson Sometimes Had Bad Games: Why Research Reviews Should Not Be Limited to Published Studies

When my sons were young, they loved to read books about sports heroes, like Magic Johnson. These books would all start off with touching stories about the heroes’ early days, but as soon as they got to athletic feats, it was all victories, against overwhelming odds. Sure, there were a few disappointments along the way, but these only set the stage for ultimate triumph. If this weren’t the case, Magic Johnson would have just been known by his given name, Earvin, and no one would write a book about him.

Magic Johnson was truly a great athlete and is an inspiring leader, no doubt about it. However, like all athletes, he surely had good days and bad ones, good years and bad. Yet the published and electronic media naturally emphasize his very best days and years. The sports press distorts the reality to play up its heroes’ accomplishments, but no one really minds. It’s part of the fun.

Blog_2-13-20_magicjohnson_333x500In educational research evaluating replicable programs and practices, our objectives are quite different. Sports reporting builds up heroes, because that’s what readers want to hear about. But in educational research, we want fair, complete, and meaningful evidence documenting the effectiveness of practical means of improving achievement or other outcomes. The problem is that academic publications in education also distort understanding of outcomes of educational interventions, because studies with significant positive effects (analogous to Magic’s best days) are far more likely to be published than are studies with non-significant differences (like Magic’s worst days). Unlike the situation in sports, these distortions are harmful, usually overstating the impact of programs and practices. Then when educators implement interventions and fail to get the results reported in the journals, this undermines faith in the entire research process.

It has been known for a long time that studies reporting large, positive effects are far more likely to be published than are studies with smaller or null effects. One long-ago study, by Atkinson, Furlong, & Wampold (1982), randomly assigned APA consulting editors to review articles that were identical in all respects except that half got versions with significant positive effects and half got versions with the same outcomes but marked as not significant. The articles with outcomes marked “significant” were twice as likely as those marked “not significant” to be recommended for publication. Reviewers of the “significant” studies even tended to state that the research designs were excellent much more often than did those who reviewed the “non-significant” versions.

Not only do journals tend not to accept articles with null results, but authors of such studies are less likely to submit them, or to seek any sort of publicity. This is called the “file-drawer effect,” where less successful experiments disappear from public view (Glass et al., 1981).

The combination of reviewers’ preferences for significant findings and authors’ reluctance to submit failed experiments leads to a substantial bias in favor of published vs. unpublished sources (e.g., technical reports, dissertations, and theses, often collectively termed “gray literature”). A review of 645 K-12 reading, mathematics, and science studies by Cheung & Slavin (2016) found almost a two-to-one ratio of effect sizes between published and gray literature reports of experimental studies, +0.30 to +0.16. Lipsey & Wilson (1993) reported a difference of +0.53 (published) to +0.39 (unpublished) in a study of psychological, behavioral and educational interventions. Similar outcomes have been reported by Polanin, Tanner-Smith, & Hennessy (2016), and many others. Based on these long-established findings, Lipsey & Wilson (1993) suggested that meta-analyses should establish clear, rigorous criteria for study inclusion, but should then include every study that meets those standards, published or not.

The rationale for restricting interest (or meta-analyses) to published articles was always weak, but in recent years it is diminishing. An increasing proportion of the gray literature consists of technical reports, usually by third-party evaluators, of highly funded experiments. For example, experiments funded by IES and i3 in the U.S., the Education Endowment Foundation (EEF) in the U.K., and the World Bank and other funders in developing countries, provide sufficient resources to do thorough, high-quality implementations of experimental treatments, as well as state-of-the-art evaluations. These evaluations almost always meet the standards of the What Works Clearinghouse, Evidence for ESSA, and other review facilities, but they are rarely published, especially because third-party evaluators have little incentive to publish.

It is important to note that the number of high-quality unpublished studies is very large. Among the 645 studies reviewed by Cheung & Slavin (2016), all had to meet rigorous standards. Across all of them, 383 (59%) were unpublished. Excluding such studies would greatly diminish the number of high-quality experiments in any review.

I have the greatest respect for articles published in top refereed journals. Journal articles provide much that tech reports rarely do, such as extensive reviews of the literature, context for the study, and discussions of theory and policy. However, the fact that an experimental study appeared in a top journal does not indicate that the article’s findings are representative of all the research on the topic at hand.

The upshot of this discussion is clear. First, meta-analyses of experimental studies should always establish methodological criteria for inclusion (e.g., use of control groups, measures not overaligned or made by developers or researchers, duration, sample size), but never restrict studies to those that appeared in published sources. Second, readers of reviews of research on experimental studies should ignore the findings of reviews that were limited to published articles.

In the popular press, it’s fine to celebrate Magic Johnson’s triumphs and ignore his bad days. But if you want to know his stats, you need to include all of his games, not just the great ones. So it is with research in education. Focusing only on published findings can make us believe in magic, when what we need are the facts.

 References

Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship? Journal of Counseling Psychology, 29(2), 189–194. https://doi.org/10.1037/0022-0167.29.2.189

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills: Sage Publications.

Lipsey, M.W. & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181-1209.

Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between published and unpublished effect sizes: A meta-review. Review of Educational Research86(1), 207–236. https://doi.org/10.3102/0034654315582067

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Compared to What? Getting Control Groups Right

Several years ago, I had a grant from the National Science Foundation to review research on elementary science programs. I therefore got to attend NSF conferences for principal investigators. At one such conference, we were asked to present poster sessions. The group next to mine was showing an experiment in science education that had remarkably large effect sizes. I got to talking with the very friendly researcher, and discovered that the experiment involved a four-week unit on a topic in middle school science. I think it was electricity. Initially, I was very excited, electrified even, but then I asked a few questions about the control group.

“Of course there was a control group,” he said. “They would have taught electricity too. It’s pretty much a required portion of middle school science.”

Then I asked, “When did the control group teach about electricity?”

“We had no way of knowing,” said my new friend.

“So it’s possible that they had a four-week electricity unit before the time when your program was in use?”

“Sure,” he responded.

“Or possibly after?”

“Could have been,” he said. “It would have varied.”

Being the nerdy sort of person I am, I couldn’t just let this go.

“I assume you pretested students at the beginning of your electricity unit and at the end?”

“Of course.”

“But wouldn’t this create the possibility that control classes that received their electricity unit before you began would have already finished the topic, so they would make no more progress in this topic during your experiment?”

“…I guess so.”

“And,” I continued, “students who received their electricity instruction after your experiment would make no progress either because they had no electricity instruction between pre- and posttest?”

I don’t recall how the conversation ended, but the point is, wonderful though my neighbor’s science program might be, the science achievement outcome of his experiment were, well, meaningless.

In the course of writing many reviews of research, my colleagues and I encounter misuses of control groups all the time, even in articles in respected journals written by well-known researchers. So I thought I’d write a blog on the fundamental issues involved in using control groups properly, and the ways in which control groups are often misused.

The purpose of a control group

The purpose of a control group in any experiment, randomized or matched, is to provide a valid estimate of what the experimental group would have achieved had it not received the experimental treatment, or if the study had not taken place at all. Through random assignment or matching, the experimental and control groups are essentially equal at pretest on all important variables (e.g., pretest scores, demographics), and nothing happens in the course of the experiment to upset this initial equality.

How control groups go wrong

Inequality in opportunities to learn tested content. Often, experiments appear to be legitimate (e.g., experimental and control groups are well matched at pretest), but the design contains major bias, because the content being taught in the experimental group is not the same as the content taught in the control group, and the final outcome measure is aligned to what the experimental group was taught but not what the control group was taught. My story at the start of this blog was an example of this. Between pre- and posttest, all students in the experimental group were learning about electricity, but many of those in the control group had already completed electricity or had not received it yet, so they might have been making great progress on other topics, which were not tested, but were unlikely to make much progress on the electricity content that was tested. In this case, the experimental and control groups could be said to be unequal in opportunities to learn electricity. In such a case, it matters little what the exact content or teaching methods were for the experimental program. Teaching a lot more about electricity is sure to add to learning of that topic regardless of how it is taught.

There are many other circumstances in which opportunities to learn are unequal. Many studies use unusual content, and then use tests partially or completely aligned to this unusual content, but not to what the control group was learning. Another common case is where experimental students learn something involving use of technology, but the control group uses paper and pencil to learn the same content. If the final test is given on the technology used by the experimental but not the control group, the potential for bias is obvious.

blog_2-20-20_schoolstudy_500x333 (2)

Unequal opportunities to learn (as a source of bias in experiments) relates to a topic I’ve written a lot about. Use of developer- or researcher-made outcome measures may introduce unequal opportunities to learn, because these measures are more aligned with what the experimental group was learning than what the control group was learning. However, the problem of unequal opportunities to learn is broader than that of developer/researcher-made measures. For example, the story that began this blog illustrated serious bias, but the measure could have been an off-the-shelf, valid measure of electricity concepts.

Problems with control groups that arise during the experiment. Many problems with control groups only arise after an experiment is under way, or completed. These involve situations in which there are different numbers of students/classes/schools that are not counted in the analysis. Usually, these are cases in which, in theory, experimental and control groups have equal opportunities to learn the tested content at the beginning of the experiment. However, some number of students assigned to the experimental group do not participate in the experiment enough to be considered to have truly received the treatment. Typical examples of this include after-school and summer-school programs. A group of students is randomly assigned to receive after-school services, for example, but perhaps only 60% of the students actually show up, or attend enough days to constitute sufficient participation. The problem is that the researchers know exactly who attended and who did not in the experimental group, but they have no idea which control students would or would not have attended if the control group had had the opportunity. The 40% of students who did not attend can probably be assumed to be less highly motivated, lower achieving, have less supportive parents, or to possess other characteristics that, on average, may identify students who are less likely to do well than students in general. If the researchers drop these 40% of students, the remaining 60% who did participate are likely (on average) to be more motivated, higher achieving, and so on, so the experimental program may look a lot more effective than it truly is. This kind of problem comes up quite often in studies of technology programs, because researchers can easily find out how often students in the experimental group actually logged in and did the required work. If they drop students who did not use the technology as prescribed, then the remaining students who did use the technology as intended are likely to perform better than control students, who will be a mix of students who would or would not have used the technology if they’d had the chance. Because these control groups contain more and less motivated students, while the experimental group only contains the students who were motivated to use the technology, the experimental group may have a huge advantage.

Problems of this kind can be avoided by using intent to treat (ITT) methods, in which all students who were pretested remain in the sample and are analyzed whether or not they used the software or attended the after-school program. Both the What Works Clearinghouse and Evidence for ESSA require use of ITT models in situations of this kind. The problem is that use of ITT analyses almost invariably reduces estimates of effect sizes, but to do otherwise may introduce quite a lot of bias in favor of the experimental groups.

Experiments without control groups

Of course, there are valid research designs that do not require use of control groups at all. These include regression discontinuity designs (in which long-term data trends are studied to see if there is a sharp change at the point when a treatment is introduced) and single-case experimental designs (in which as few as one student/class/school is observed frequently to see what happens when treatment conditions change). However, these designs have their own problems, and single case designs are rarely used outside of special education.

Control groups are essential in most rigorous experimental research in education, and with proper design they can do what they were intended to do with little bias. Education researchers are becoming increasingly sophisticated about fair use of control groups. Next time I go to an NSF conference, for example, I hope I won’t see posters on experiments that compare students who received an experimental treatment to those who did not even receive instruction on the same topic between pretest and posttest.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Queasy about Quasi-Experiments? How Rigorous Quasi-Experiments Can Minimize Bias

I once had a statistics professor who loved to start discussions of experimental design with the following:

“First, pick your favorite random number.”

Obviously, if you pick a favorite random number, it isn’t random. I was recalling this bit of absurdity recently when discussing with colleagues the relative value of randomized experiments (RCTs) and matched studies, or quasi-experimental designs (QED). In randomized experiments, students, teachers, classes, or schools are assigned at random to experimental or control conditions. In quasi-experiments, a group of students, teachers, classes, or schools is identified as the experimental group, and then other schools are located (usually in the same districts) and then matched on key variables, such as prior test scores, percent free lunch, ethnicity, and perhaps other factors. The ESSA evidence standards, the What Works Clearinghouse, Evidence for ESSA, and most methodologists favor randomized experiments over QEDs, but there are situations in which RCTs are not feasible. In a recent “Straight Talk on Evidence,” Jon Baron discussed how QEDs can approach the usefulness of RCTs. In this blog, I build on Baron’s article and go further into strategies for getting the best, most unbiased results possible from QEDs.

Randomized and quasi-experimental studies are very similar in most ways. Both almost always compare experimental and control schools that were very similar on key performance and demographic factors. Both use the same statistics, and require the same number of students or clusters for adequate power. Both apply the same logic, that the control group mean represents a good approximation of what the experimental group would have achieved, on average, if the experiment had never taken place.

However, there is one big difference between randomized and quasi-experiments. In a well-designed randomized experiment, the experimental and control groups can be assumed to be equal not only on observed variables, such as pretests and socio-economic status, but also on unobserved variables. The unobserved variables we worry most about have to do with selection bias. How did it happen (in a quasi-experiment) that the experimental group chose to use the experimental treatment, or was assigned to the experimental treatment? If a set of schools decided to use the experimental treatment on their own, then these schools might be composed of teachers or principals who are more oriented toward innovation, for example. Or if the experimental treatment is difficult, the teachers who would choose it might be more hard-working. If it is expensive, then perhaps the experimental schools have more money. Any of these factors could bias the study toward finding positive effects, because schools that have teachers who are motivated or hard-working, in schools with more resources, might perform better than control schools with or without the experimental treatment.

blog_1-16-20_normalcurve_500x333

Because of this problem of selection bias, studies that use quasi-experimental designs generally have larger effect sizes than do randomized experiments. Cheung & Slavin (2016) studied the effects of methodological features of studies on effect sizes. They obtained effect sizes from 645 studies of elementary and secondary reading, mathematics, and science, as well as early childhood programs. These studies had already passed a screening in which they would have been excluded if they had serious design flaws. The results were as follows:

  No. of studies Mean effect size
Quasi-experiments 449 +0.23
Randomized experiments 196 +0.16

Clearly, mean effect sizes were larger in the quasi-experiments, suggesting the possibility that there was bias. Compared to factors such as sample size and use of developer- or researcher-made measures, the amount of effect size inflation in quasi-experiments was modest, and some meta-analyses comparing randomized and quasi-experimental studies have found no difference at all.

Relative Advantages of Randomized and Quasi-Experiments

Because of the problems of selection bias, randomized experiments are preferred to quasi-experiments, all other factors being equal. However, there are times when quasi-experiments may be necessary for practical reasons. For example, it can be easier to recruit and serve schools in a quasi-experiment, and it can be less expensive. A randomized experiment requires that schools be recruited with the promise that they will receive an exciting program. Yet half of them will instead be in a control group, and to keep them willing to sign up, they may be given a lot of money, or an opportunity to receive the program later on. In a quasi-experiment, the experimental schools all get the treatment they want, and control schools just have to agree to be tested.  A quasi-experiment allows schools in a given district to work together, instead of insisting that experimental and control schools both exist in each district. This better simulates the reality schools are likely to face when a program goes into dissemination. If the problems of selection bias can be minimized, quasi-experiments have many attractions.

An ideal design for quasi-experiments would obtain the same unbiased outcomes as a randomized evaluation of the same treatment might do. The purpose of this blog is to discuss ways to minimize bias in quasi-experiments.

In practice, there are several distinct forms of quasi-experiments. Some have considerable likelihood of bias. However, others have much less potential for bias. In general, quasi-experiments to avoid are forms of post-hoc, or after-the-fact designs, in which determination of experimental and control groups takes place after the experiment. Quasi-experiments with much less likelihood of bias are pre-specified designs, in which experimental and control schools, classrooms, or students are identified and registered in advance. In the following sections, I will discuss these very different types of quasi-experiments.

Post-Hoc Designs

Post-hoc designs generally identify schools, teachers, classes, or students who participated in a given treatment, and then find matches for each in routinely collected data, such as district or school standardized test scores, attendance, or retention rates. The routinely collected data (such as state test scores or attendance) are collected as pre-and posttests from school records, so it may be that neither experimental nor control schools’ staffs are even aware that the experiment happened.

Post-hoc designs sound valid; the experimental and control groups were well matched at pretest, so if the experimental group gained more than the control group, that indicates an effective treatment, right?

Not so fast. There is much potential for bias in this design. First, the experimental schools are almost invariably those that actually implemented the treatment. Any schools that dropped out or (even worse) any that were deemed not to have implemented the treatment enough have disappeared from the study. This means that the surviving schools were different in some important way from those that dropped out. For example, imagine that in a study of computer-assisted instruction, schools were dropped if fewer than 50% of students used the software as much as the developers thought they should. The schools that dropped out must have had characteristics that made them unable to implement the program sufficiently. For example, they might have been deficient in teachers’ motivation, organization, skill with technology, or leadership, all factors that might also impact achievement with or without the computers. The experimental group is only keeping the “best” schools, but the control schools will represent the full range, from best to worst. That’s bias. Similarly, if individual students are included in the experimental group only if they actually used the experimental treatment a certain amount, that introduces bias, because the students who did not use the treatment may be less motivated, have lower attendance, or have other deficits.

As another example, developers or researchers may select experimental schools that they know did exceptionally well with the treatment. Then they may find control schools that match on pretest. The problem is that there could be unmeasured characteristics of the experimental schools that could cause these schools to get good results even without the treatment. This introduces serious bias. This is a particular problem if researchers pick experimental or control schools from a large database. The schools will be matched at pretest, but since the researchers may have many potential control schools to choose among, they may use selection rules that, while they maintain initial equality, introduce bias. The readers of the study might never be able to find out if this happened.

Pre-Specified Designs

The best way to minimize bias in quasi-experiments is to identify experimental and control schools in advance (as contrasted with post hoc), before the treatment is applied. After experimental and control schools, classes, or students are identified and matched on pretest scores and other factors, the names of schools, teachers, and possibly students on each list should be registered on the Registry of Efficacy and Effectiveness Studies. This way, all schools (and all students) involved in the study are counted in intent-to-treat (ITT) analyses, just as is expected in randomized studies. The total effect of the treatment is based on this list, even if some schools or students dropped out along the way. An ITT analysis reflects the reality of program effects, because it is rare that all schools or students actually use educational treatments. Such studies also usually report effects of treatment on the treated (TOT), focusing on schools and students who did implement for treatment, but such analyses are of only minor interest, as they are known to reflect bias in favor of the treatment group.

Because most government funders in effect require use of random assignment, the number of quasi-experiments is rapidly diminishing. All things being equal, randomized studies should be preferred. However, quasi-experiments may better fit the practical realities of a given treatment or population, and as such, I hope there can be a place for rigorous quasi-experiments. We need not be so queasy about quasi-experiments if they are designed to minimize bias.

References

Baron, J. (2019, December 12). Why most non-RCT program evaluation findings are unreliable (and a way to improve them). Washington, DC: Arnold Ventures.

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Proven Programs Can’t Replicate, Just Like Bees Can’t Fly

In the 1930’s, scientists in France announced that based on principles of aerodynamics, bees could not fly. The only evidence to the contrary was observational, atheoretical, quasi-scientific reports that bees do in fact fly.

The widely known story about bees’ ability to fly came up in a discussion about the dissemination of proven programs in education. Many education researchers and policy makers maintain that the research-development-evaluation-dissemination sequence relied upon for decades to create better ways to educate children has failed. Many observers note that few practitioners seek out research when they consider selection of programs intended to improve student learning or other important outcomes. Research Practice Partnerships, in which researchers work in partnership with local educators to solve problems of importance to the educators, is largely based on the idea that educators are unlikely to use programs or practices unless they personally were involved in creating them. Opponents of evidence-based education policies invariably complain that because schools are so diverse, they are unlikely to adopt programs developed and researched elsewhere, and this is why few research-based programs are widely disseminated.

Dissemination of proven programs is in fact difficult, and there is little evidence of how proven programs might be best disseminated. Recognizing these and many other problems, however, it is important to note one small fact in all this doom and gloom: Proven programs are disseminated. Among the 113 reading and mathematics programs that have met the stringent standards of Evidence for ESSA (www.evidenceforessa.org), most have been disseminated to dozens, hundreds, or thousands of schools. In fact, we do not accept programs that are not in active dissemination (because it is not terribly useful for educators, our target audience, to find out that a proven program is no longer available, or never was). Some (generally newer) programs may only operate in a few schools, but they intend to grow. But most programs, supported by non-profit or commercial organizations, are widely disseminated.

Examples of elementary reading programs with strong, moderate, or promising evidence of effectiveness (by ESSA standards) and wide dissemination include Reading Recovery, Success for All, Sound Partners, Lindamood, Targeted Reading Intervention, QuickReads, SMART, Reading Plus, Spell Read, Acuity, Corrective Reading, Reading Rescue, SuperKids, and REACH. For middle/high, effective and disseminated reading programs include SIM, Read180, Reading Apprenticeship, Comprehension Circuit Training, BARR, ITSS, Passport Reading Journeys, Expository Reading and Writing Course, Talent Development, Collaborative Strategic Reading, Every Classroom Every Day, and Word Generation.

In elementary math, effective and disseminated programs include Math in Focus, Math Expressions, Acuity, FocusMath, Math Recovery, Time to Know, Jump Math, ST Math, and Saxon Math. Middle/high school programs include ASSISTments, Every Classroom Every Day, eMINTS, Carnegie Learning, Core-Plus, and Larson Pre-Algebra.

These are programs that I know have strong, moderate, or promising evidence and are widely disseminated. There may be others I do not know about.

I hope this list convinces any doubters that proven programs can be disseminated. In light of this list, how can it be that so many educators, researchers, and policy makers think that proven educational programs cannot be disseminated?

One answer may be that dissemination of educational programs and practices almost never happens the way many educational researchers wish it did. Researchers put enormous energy into doing research and publishing their results in top journals. Then they are disappointed to find out that publishing in a research journal usually has no impact whatever on practice. They then often try to make their findings more accessible by writing them in plain English in more practitioner-oriented journals. Still, this usually has little or no impact on dissemination.

But writing in journals is rarely how serious dissemination happens. The way it does happen is that the developer or an expert partner (such as a publisher or software company) takes the research ideas and makes them into a program, one that solves a problem that is important to educators, is attractive, professional, and complete, and is not too expensive. Effective programs almost always provide extensive professional development, materials, and software. Programs that provide excellent, appealing, effective professional development, materials, and software become likely candidates for dissemination. I’d guess that virtually every one of the programs I listed earlier took a great idea and made it into an appealing program.

A depressing part of this process is that programs that have no evidence of effectiveness, or even have evidence of ineffectiveness, follow the same dissemination process as do proven programs. Until the 2015 ESSA evidence standards appeared, evidence had a very limited role in the whole development-dissemination process. So far, ESSA has pointed more of a spotlight on evidence of effectiveness, but it is still the case that having strong evidence of effectiveness does not provide a program with a decisive advantage over programs lacking positive evidence. Regardless of their actual evidence bases, most programs today make claims that their programs are “evidence-based” or at least “evidence-informed,” so users can easily be fooled.

However, this situation is changing. First, the government itself is identifying programs with evidence of effectiveness, and may publicize them. Government initiatives such as Investing in Innovation (i3; now called EIR) actually provide funding to proven programs to enable them to begin to scale up their programs. The What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/), Evidence for ESSA (www.evidenceforessa.org), and other sources provide easy access to information on proven programs. In other words, government is starting to intervene to nudge the longstanding dissemination process toward programs proven to work.

blog_10-3-19_Bee_art_500x444Back to the bees, the 1930 conclusion that bees should not be able to fly was overturned in 2005, when American researchers observed what bees actually do when they fly, and discovered that bees do not flap their wings like birds. Instead, they push air forward and back with their wings, creating a low pressure zone above them. This pressure keeps them in the air.

In the same way, educational researchers might stop theorizing about how disseminating proven programs is impossible, but instead, observe several programs that have actually done it. Then we can design government policies to further assist proven programs to build the capital and the organizational capacity to effectively disseminate, and to provide incentives and assistance to help schools in need of proven programs to learn about and adopt them.

Perhaps we could call this Plan Bee.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Educational Policies vs. Educational Programs: Evidence from France

Ask any parent what their kids say when they ask them what they did in school today. Invariably, they respond, “Nuffin,” or some equivalent. My four-year-old granddaughter always says, “I played with my fwends.” All well and good.

However, in educational policy, policy makers often give the very same answer when asked, “What did the schools not using the (insert latest policy darling) do?”

“Nuffin’”. Or they say, “Whatever they usually do.” There’s nothing wrong with the latter answer if it’s true. But given the many programs now known to improve student achievement (see www.evidenceforessa.org), why don’t evaluators compare outcomes of new policy initiatives to those of proven educational programs known to improve the same outcomes the policy innovation is supposed to improve, perhaps at far lower cost per student? The evaluations should also compare to “business as usual,” but adding proven programs to evaluations of large policy innovations would help avoid declaring policy innovations to be successful when they are in fact just slightly more effective than “business as usual,” and much less effective or less cost-effective than alternative proven approaches? For example, when evaluating charter schools, why not routinely compare them to whole-school reform models that have similar objectives? When evaluating extending the school day or school year to help high-poverty schools, why not compare these innovations to using the same amount of additional money to hiring tutors to use proven tutoring models to help struggling students? In evaluating policies in which students are held back if they do not read at grade level by third grade, why not compare these approaches to intensive phonics instruction and tutoring in grades K-3, which are known to greatly improve student reading achievement?

blog_7-25-19_LeoandAdaya_375x500
There is nuffin like a good fwend.

As one example of research comparing a policy intervention to a promising educational intervention, I recently saw a very interesting pair of studies from France. Ecalle, Gomes, Auphan, Cros, & Magnan (2019) compared two interventions applied in special priority areas with high poverty levels. Both interventions focused on reading in first grade.

One of the interventions involved halving class size, from approximately 24 students to 12. The other provided intensive reading instruction in small groups (4-6 children) to students who were struggling in reading, as well as less intensive interventions to larger groups (10-12 students). Low achievers got two 30-minute interventions each day for a year, while the higher-performing readers got one 30-minute intervention each day. In both cases, the focus of instruction was on phonics. In all cases, the additional interventions were provided by the students’ usual teachers.

The students in small classes were compared to students in ordinary-sized classes, while the students in the educational intervention were compared to students in same-sized classes who did not get the group interventions. Similar measures and analyses were used in both comparisons.

The results were nearly identical for the class size policy and the educational intervention. Halving class size had effect sizes of +0.14 for word reading and +0.22 for spelling. Results for the educational intervention were +0.13 for word reading, +0.12 for spelling, +0.14 for a group test of reading comprehension, +0.32 for an individual test of comprehension, and +0.19 for fluency.

These studies are less than perfect in experimental design, but they are nevertheless interesting. Most importantly, the class size policy required an additional teacher for each class of 24. Using Maryland annual teacher salaries and benefits ($84,000), that means the cost in our state would be about $3500 per student. The educational intervention required one day of training and some materials. There was virtually no difference in outcomes, but the differences in cost were staggering.

The class size policy was mandated by the Ministry of Education. The educational intervention was offered to schools and provided by a university and a non-profit. As is so often the case, the policy intervention was simplistic, easy to describe in the newspaper, and minimally effective. The class size policy reminds me of a Florida program that extended the school schedule by an hour every day in high-poverty schools, mainly to provide more time for reading instruction. The cost per child was about $800 per year. The outcomes were minimal (ES=+0.05).

After many years of watching what schools do and reviewing research on outcomes of innovations, I find it depressing that policies mandated on a substantial scale are so often found to be ineffective. They are usually far more expensive than much more effective, rigorously evaluated programs that are, however, a bit more difficult to describe, and rarely arouse great debate in the political arena. It’s not that anyone is opposed to the educational intervention, but it is a lot easier to carry a placard saying “Reduce Class Size Now!” than to carry one saying “Provide Intensive Phonics in Small Groups with More Supplemental Teaching for the Lowest Achievers Now!” The latter just does not fit on a placard, and though easy to understand if explained, it does not lend itself to easy communication. Actually, there are much more effective first grade interventions than the one evaluated in France (see www.evidenceforessa.org). At a cost much less than $3500 per student, several one-to-one tutoring programs using well-trained teaching assistants as tutors would have been able to produce an effect size of more than +0.50 for all first graders on average. This would even fit on a placard: “Tutoring Now!”

I am all in favor of trying out policy innovations. But when parents of kids in a proven-program comparison group are asked what they did in school today, they shouldn’t say “nuffin’”. They should say, “My tooter taught me to read. And I played with my fwends.”

References

Ecalle, J., Gomes, C., Auphan, P., Cros, L., & Magnan, A. (2019). Effects of policy and educational interventions intended to reduce difficulties in literacy skills in grade 1. Studies in Educational Evaluation, 61, 12-20.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Charter Schools? Smarter Schools? Why Not Both?

I recently saw an editorial in the May 29 Washington Post, entitled “Denying Poor Children a Chance,” a pro-charter school opinion piece that makes dire predictions about the damage to poor and minority students that would follow if charter expansion were to be limited.  In education, it is common to see evidence-free opinions for and against charter schools, so I was glad to see actual data in the Post editorial.   In my view, if charter schools could routinely and substantially improve student outcomes, especially for disadvantaged students, I’d be a big fan.  My response to charter schools is the same as my response to everything else in education: Show me the evidence.

The Washington Post editorial cited a widely known 2015 Stanford CREDO study comparing urban charter schools to matched traditional public schools (TPS) in the same districts.  Evidence always attracts my attention, so I decided to look into this and other large, multi-district studies. Despite the Post’s enthusiasm for the data, the average effect size was only +0.055 for math and +0.04 for reading.  By anyone’s standards, these are very, very small outcomes.  Outcomes for poor, urban, African American students were somewhat higher, at +0.08 for math and +0.06 for reading, but on the other hand, average effect sizes for White students were negative, averaging -0.05 for math and -0.02 for reading.  Outcomes were also negative for Native American students: -0.10 for math, zero for reading.  With effect sizes so low, these small differences are probably just different flavors of zero.  A CREDO (2013) study of charter schools in 27 states, including non-urban as well as urban schools, found average effect sizes of +0.01 for math and -0.01 for reading. How much smaller can you get?

In fact, the CREDO studies have been widely criticized for using techniques that inflate test scores in charter schools.  They compare students in charter schools to students in traditional public schools, matching on pretests and ethnicity.  This ignores the obvious fact that students in charter schools chose to go there, or their parents chose for them to go.  There is every reason to believe that students who choose to attend charter schools are, on average, higher-achieving, more highly motivated, and better behaved than students who stay in traditional public schools.  Gleason et al. (2010) found that students who applied to charter schools started off 16 percentage points higher in reading and 13 percentage points higher in math than others in the same schools who did not apply.  Applicants were more likely to be White and less likely to be African American or Hispanic, and they were less likely to qualify for free lunch.  Self-selection is a particular problem in studies of students who choose or are sent to “no-excuses” charters, such as KIPP or Success Academies, because the students or their parents know students will be held to very high standards of behavior and accomplishment, and may be encouraged to leave the school if they do not meet those standards (this is not a criticism of KIPP or Success Academies, but when such charter systems use lotteries to select students, the students who show up for the lotteries were at least motivated to participate in a lottery to attend a very demanding school).

Well-designed studies of charter schools usually focus on schools that use lotteries to select students, and then they compare the students who were successful in the lottery to those who were not so lucky.  This eliminates the self-selection problem, as students were selected by a random process.  The CREDO studies do not do this, and this may be why their studies report higher (though still very small) effect sizes than those reported by syntheses of studies of students who all applied to charters, but may have been “lotteried in” or “lotteried out” at random.  A very rigorous WWC synthesis of such studies by Gleason et al. (2010) found that middle school students who were lotteried into charter schools in 32 states performed non-significantly worse than those lotteried out, in math (ES=-0.06) and in reading (ES=-0.08).  A 2015 update of the WWC study found very similar, slightly negative outcomes in reading and math.

It is important to note that “no-excuses” charter schools, mentioned earlier, have had more positive outcomes than other charters.  A recent review of lottery studies by Cheng et al. (2017) found effect sizes of +0.25 for math and +0.17 for reading.  However, such “no-excuses” charters are a tiny percentage of all charters nationwide.

blog_6-5-19_schoolmortorbd_500x422

Other meta-analyses of studies of achievement outcomes of charter schools also exist, but none found effect sizes as high as the CREDO urban study.  The means of +0.055 for math and +0.04 for reading represent upper bounds for effects of urban charter schools.

Charter Schools or Smarter Schools?

So far, every study of achievement effects of charters has focused on impacts of charters on achievement compared to those of traditional public schools.  However, this should not be the only question.  “Charters” and “non-charters” do not exhaust the range of possibilities.

What if we instead ask this question: Among the range of programs available, which are most likely to be most effective at scale?

To illustrate the importance of this question, consider a study in England, which evaluated a program called Engaging Parents Through Mobile Phones.  The program involves texting parents on cell phones to alert them to upcoming tests, inform them about whether students are completing their homework, and tell them what students were being taught in school.  A randomized evaluation (Miller et al, 2017) found effect sizes of +0.06 for math and +0.03 for reading, remarkably similar to the urban charter school effects reported by CREDO (2015).  The cost of the mobile phone program was £6 per student per year, or $7.80.  If you like the outcomes of charter schools, might you prefer to get the same outcomes for $7.80 per child per year, without all the political, legal, and financial stresses of charter schools?

The point here is that rather than arguing about the size of small charter effects, one could consider charters a “treatment” and compare them to other proven approaches.  In our Evidence for ESSA website, we list 112 reading and math programs that meet ESSA standards for “Strong,” “Moderate,” or “Promising” evidence of effectiveness.  Of these, 107 had effect sizes larger than those CREDO (2015) reports for urban charter schools.  In both math and reading, there are many programs with average effect sizes of +0.20, +0.30, up to more than +0.60.  If applied as they were in the research, the best of these programs could, for example, entirely overcome Black-White and Hispanic-White achievement gaps in one or two years.

A few charter school networks have their own proven educational approaches, but the many charters that do not have proven programs should be looking for them.  Most proven programs work just as well in charter schools as they do in traditional public schools, so there is no reason existing charter schools should not proactively seek proven programs to increase their outcomes.  For new charters, wouldn’t it make sense for chartering agencies to encourage charter applicants to systematically search for and propose to adopt programs that have strong evidence of effectiveness?  Many charter schools already use proven programs.  In fact, there are several that specifically became charters to enable them to adopt or maintain our Success for All whole-school reform program.

There is no reason for any conflict between charter schools and smarter schools.  The goal of every school, regardless of its governance, should be to help students achieve their full potential, and every leader of a charter or non-charter school would agree with this. Whatever we think about governance, all schools, traditional or charter, should get smarter, using proven programs of all sorts to improve student outcomes.

References

Cheng, A., Hitt, C., Kisida, B., & Mills, J. N. (2017). “No excuses” charter schools: A meta-analysis of the experimental evidence on student achievement. Journal of School Choice, 11 (2), 209-238.

Clark, M.A., Gleason, P. M., Tuttle, C. C., & Silverberg, M. K., (2015). Do charter schools improve student achievement? Educational Evaluation and Policy Analysis, 37 (4), 419-436.

Gleason, P.M., Clark, M. A., Tuttle, C. C., & Dwoyer, E. (2010).The evaluation of charter school impacts. Washington, DC: What Works Clearinghouse.

Miller, S., Davison, J, Yohanis, J., Sloan, S., Gildea, A., & Thurston, A. (2016). Texting parents: Evaluation report and executive summary. London: Education Endowment Foundation.

Washington Post: Denying poor children a chance. [Editorial]. (May 29, 2019). The Washington Post, A16.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.