When my sons were young, they loved to read books about sports heroes, like Magic Johnson. These books would all start off with touching stories about the heroes’ early days, but as soon as they got to athletic feats, it was all victories, against overwhelming odds. Sure, there were a few disappointments along the way, but these only set the stage for ultimate triumph. If this weren’t the case, Magic Johnson would have just been known by his given name, Earvin, and no one would write a book about him.
Magic Johnson was truly a great athlete and is an inspiring leader, no doubt about it. However, like all athletes, he surely had good days and bad ones, good years and bad. Yet the published and electronic media naturally emphasize his very best days and years. The sports press distorts the reality to play up its heroes’ accomplishments, but no one really minds. It’s part of the fun.
In educational research evaluating replicable programs and practices, our objectives are quite different. Sports reporting builds up heroes, because that’s what readers want to hear about. But in educational research, we want fair, complete, and meaningful evidence documenting the effectiveness of practical means of improving achievement or other outcomes. The problem is that academic publications in education also distort understanding of outcomes of educational interventions, because studies with significant positive effects (analogous to Magic’s best days) are far more likely to be published than are studies with non-significant differences (like Magic’s worst days). Unlike the situation in sports, these distortions are harmful, usually overstating the impact of programs and practices. Then when educators implement interventions and fail to get the results reported in the journals, this undermines faith in the entire research process.
It has been known for a long time that studies reporting large, positive effects are far more likely to be published than are studies with smaller or null effects. One long-ago study, by Atkinson, Furlong, & Wampold (1982), randomly assigned APA consulting editors to review articles that were identical in all respects except that half got versions with significant positive effects and half got versions with the same outcomes but marked as not significant. The articles with outcomes marked “significant” were twice as likely as those marked “not significant” to be recommended for publication. Reviewers of the “significant” studies even tended to state that the research designs were excellent much more often than did those who reviewed the “non-significant” versions.
Not only do journals tend not to accept articles with null results, but authors of such studies are less likely to submit them, or to seek any sort of publicity. This is called the “file-drawer effect,” where less successful experiments disappear from public view (Glass et al., 1981).
The combination of reviewers’ preferences for significant findings and authors’ reluctance to submit failed experiments leads to a substantial bias in favor of published vs. unpublished sources (e.g., technical reports, dissertations, and theses, often collectively termed “gray literature”). A review of 645 K-12 reading, mathematics, and science studies by Cheung & Slavin (2016) found almost a two-to-one ratio of effect sizes between published and gray literature reports of experimental studies, +0.30 to +0.16. Lipsey & Wilson (1993) reported a difference of +0.53 (published) to +0.39 (unpublished) in a study of psychological, behavioral and educational interventions. Similar outcomes have been reported by Polanin, Tanner-Smith, & Hennessy (2016), and many others. Based on these long-established findings, Lipsey & Wilson (1993) suggested that meta-analyses should establish clear, rigorous criteria for study inclusion, but should then include every study that meets those standards, published or not.
The rationale for restricting interest (or meta-analyses) to published articles was always weak, but in recent years it is diminishing. An increasing proportion of the gray literature consists of technical reports, usually by third-party evaluators, of highly funded experiments. For example, experiments funded by IES and i3 in the U.S., the Education Endowment Foundation (EEF) in the U.K., and the World Bank and other funders in developing countries, provide sufficient resources to do thorough, high-quality implementations of experimental treatments, as well as state-of-the-art evaluations. These evaluations almost always meet the standards of the What Works Clearinghouse, Evidence for ESSA, and other review facilities, but they are rarely published, especially because third-party evaluators have little incentive to publish.
It is important to note that the number of high-quality unpublished studies is very large. Among the 645 studies reviewed by Cheung & Slavin (2016), all had to meet rigorous standards. Across all of them, 383 (59%) were unpublished. Excluding such studies would greatly diminish the number of high-quality experiments in any review.
I have the greatest respect for articles published in top refereed journals. Journal articles provide much that tech reports rarely do, such as extensive reviews of the literature, context for the study, and discussions of theory and policy. However, the fact that an experimental study appeared in a top journal does not indicate that the article’s findings are representative of all the research on the topic at hand.
The upshot of this discussion is clear. First, meta-analyses of experimental studies should always establish methodological criteria for inclusion (e.g., use of control groups, measures not overaligned or made by developers or researchers, duration, sample size), but never restrict studies to those that appeared in published sources. Second, readers of reviews of research on experimental studies should ignore the findings of reviews that were limited to published articles.
In the popular press, it’s fine to celebrate Magic Johnson’s triumphs and ignore his bad days. But if you want to know his stats, you need to include all of his games, not just the great ones. So it is with research in education. Focusing only on published findings can make us believe in magic, when what we need are the facts.
Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship? Journal of Counseling Psychology, 29(2), 189–194. https://doi.org/10.1037/0022-0188.8.131.52
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills: Sage Publications.
Lipsey, M.W. & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181-1209.
Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between published and unpublished effect sizes: A meta-review. Review of Educational Research, 86(1), 207–236. https://doi.org/10.3102/0034654315582067
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.