Whenever I talk to educators and mention effect sizes, someone inevitably complains. “We don’t understand effect sizes,” they say. I always explain that you don’t have to understand exactly what effect sizes are, but if you do know that more of them are good and less of them are bad, assuming that the research from which they came is of equal quality, then why do you have to know precisely what they are? Sometimes I mention the car reliability rating system Consumer Reports uses, with full red circles at the top and full black circles at the bottom. Does anyone understand how they arrived at those ratings? I don’t even know, but I don’t care, because like everyone else, what I do know is that I don’t want a car with a reliability rating in the black.
People always tell me that they would like it better if I’d use “additional months of gain.” I do this when I have to, but I really do not like it, because these “months of gain” do not really mean very much, and they work very differently at the early elementary grades than they do in high schools.
So here is an idea that some people might find useful. The National Assessment of Educational Progress (NAEP) uses reading and math scales that have a theoretical standard deviation of 50. So an effect size of, say, +0.20 can be expressed as a gain equivalent to a NAEP score gain of +10 (0.20 x 50 = 10) points. That’s not really interesting yet, because most people also don’t know what NAEP scores mean.
But here’s another way to use such data that might be more fun and easier to understand. I think people could understand and care about their state’s rank on NAEP scores. So for example, the highest-scoring state on 4th grade reading is Massachusetts, with a NAEP reading score of 231 in 2019. What if the 13th state, Nebraska (222), adopted a great reading program statewide, and it gained an average effect size of +0.20. That’s equivalent to 10 NAEP points. Such a gain in effect size would make Nebraska score one point ahead of Massachusetts (if Massachusetts didn’t change). Number 1!
If we learned to speak in terms of how many ranks states would gain if they gained a given effect size, I wonder if that would give educators more understanding and respect for the findings of experiments. Even fairly small effect sizes, if replicated across a whole state, could propel a state past its traditional rivals. For example, 26th ranked Wisconsin (220) could equal neighboring 12th ranked Minnesota (222) with a statewide reading effect size gain of only +0.04. As a practical matter, Wisconsin could increase its fourth grade test scores by an effect size of +0.04, perhaps by using a program with an effect size of +0.20 with (say) the lowest-achieving fifth of its fourth graders.
If only one could get states thinking this way, the meaning and importance of effect sizes would soon become clear. And as a side benefit, perhaps if Wisconsin invested its enthusiasm and money in a “Beat Minnesota” reading campaign, as it does to try to beat the University of Minnesota’s football team, Wisconsin’s students might actually benefit. I can hear it now:
On Wisconsin, On Wisconsin,
Raise effect size high!
We are not such lazy loafers
We can beat the Golden Gophers
Point-oh-four or point-oh-eight
We’ll surpass them, just you wait!
Well, a nerd can dream, can’t he?
_______
Note: No states were harmed in the writing of this blog.
This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.
Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org.
By Robert Slavin and Amanda Neitzel, Johns Hopkins University
In two recent blogs (here and here), I’ve written about Baltimore’s culinary glories: crabs and oysters. My point was just that in both cases, there is a lot you have to discard to get to what matters. But I was of course just setting the stage for a problem that is deadly serious, at least to anyone concerned with evidence-based reform in education.
Meta-analysis has contributed a great deal to educational research and reform, helping readers find out about the broad state of the evidence on practical approaches to instruction and school and classroom organization. Recent methodological developments in meta-analysis and meta-regression, and promotion of the use of these methods by agencies such as IES and NSF, have expanded awareness and use of modern methods.
Yet looking at large numbers of meta-analyses published over the past five years, even up to the present, the quality is highly uneven. That’s putting it nicely. The problem is that most meta-analyses in education are far too unselective with regards to the methodological quality of the studies they include. Actually, I’ve been ranting about this for many years, and along with colleagues, have published several articles on it (e.g., Cheung & Slavin, 2016; Slavin & Madden, 2011; Wolf et al., 2020). But clearly, my colleagues and I are not making enough of a difference.
My colleague, Amanda Neitzel, and I thought of a simple way we could communicate the enormous difference it makes if a meta-analysis accepts studies that contain design elements known to inflate effect sizes. In this blog, we once again use the Kulik & Fletcher (2016) meta-analysis of research on computerized intelligent tutoring, which I critiqued in my blog a few weeks ago (here). As you may recall, the only methodological inclusion standards used by Kulik & Fletcher required that studies use RCTs or QEDs, and that they have a duration of at least 30 minutes (!!!). However, they included enough information to allow us to determine the effect sizes that would have resulted if they had a) weighted for sample size in computing means, which they did not, and b) excluded studies with various features known to inflate effect size estimates. Here is a table summarizing our findings when we additionally excluded studies containing procedures known to inflate mean effect sizes:
If you follow meta-analyses, this table should be shocking. It starts out with 50 studies and a very large effect size, ES=+0.65. Just weighting the mean for study sample sizes reduces this to +0.56. Eliminating small studies (n<60) cut the number of studies almost in half (n=27) and cut the effect size to +0.39. But the largest reductions are due to excluding “local” measures, which on inspection are always measures made by developers or researchers themselves. (The alternative was “standardized measures.”) By itself, excluding local measures (and weighting) cut the number of included studies to 12, and the effect size to +0.10, which was not significantly different from zero (p=.17). Excluding small, brief, and “local” measures only slightly changes the results, because both small and brief studies almost always use “local” (i.e., researcher-made) measures. Excluding all three, and weighting for sample size, leaves this review with only nine studies and an effect size of +0.09, which is not significantly different from zero (p=.21).
The estimates at the bottom of the chart represent what we call “selective standards.” These are the standards we apply in every meta-analysis we write (see www.bestevidence.org), and in Evidence for ESSA (www.evidenceforessa.org).
It is easy to see why this matters. Selective standards almost always produce much lower estimates of effect sizes than do reviews with much less selective standards, which therefore include studies containing design features that have a strong positive bias on effect sizes. Consider how this affects mean effect sizes in meta-analyses. For example, imagine a study that uses two measures of achievement. One is a measure made by the researcher or developer specifically to be “sensitive” to the program’s outcomes. The other is a test independent of the program, such as GRADE/GMADE or Woodcock, standardized tests but not necessarily state tests. Imagine that the researcher-made measure obtains an effect size of +0.30, while the independent measure has an effect size of +0.10. A less-selective meta-analysis would report a mean effect size of +0.20, a respectable-sounding impact. But a selective meta-analysis would report an effect size of +0.10, a very small impact. Which of these estimates represents an outcome with meaning for practice? Clearly, school leaders should not value the +0.30 or +0.20 estimates, which require use of a test designed to be “sensitive” to the treatment. They should care about the gains on the independent test, which represents what educators are trying to achieve and what they are held accountable for. The information from the researcher-made test may be valuable to the researchers, but it has little or no value to educators or students.
The point of this exercise is to illustrate that in meta-analyses, choices of methodological exclusions may entirely determine the outcomes. Had they chosen other exclusions, the Kulik & Fletcher meta-analysis could have reported any effect size from +0.09 (n.s.) to +0.65 (p<.000).
The importance of these exclusions is not merely academic. Think how you’d explain the chart above to your sister the principal:
Principal Sis: I’m thinking of using one of those intelligent tutoring programs to improve achievement in our math classes. What do you suggest?
You: Well, it all depends. I saw a review of this in the top journal in education research. It says that if you include very small studies, very brief studies, and studies in which the researchers made the measures, you could have an effect size of +0.65! That’s like seven additional months of learning!
Principal Sis: I like those numbers! But why would I care about small or brief studies, or measures made by researchers? I have 500 kids, we teach all year, and our kids have to pass tests that we don’t get to make up!
You (sheepishly): I guess you’re right, Sis. Well, if you just look at the studies with large numbers of students, which continued for more than 12 weeks, and which used independent measures, the effect size was only +0.09, and that wasn’t even statistically significant.
Principal Sis: Oh. In that case, what kinds of programs should we use?
From a practical standpoint, study features such as small samples or researcher-made measures add a lot to effect sizes while adding nothing to the value to students or schools of the programs or practices they want to know about. They just add a lot of bias. It’s like trying to convince someone that corn on the cob is a lot more valuable than corn off the cob, because you get so much more quantity (by weight or volume) for the same money with corn on the cob. Most published meta-analyses only require that studies have control groups, and some do not even require that much. Few exclude researcher- or developer-made measures, or very small or brief studies. The result is that effect sizes in published meta-analyses are very often implausibly large.
Meta-analyses that include studies lacking control groups or studies with small samples, brief durations, pretest differences, or researcher-made measures report overall effect sizes that cannot be fairly compared to other meta-analyses that excluded such studies. If outcomes do not depend on the power of the particular program but rather on the number of potentially biasing features they did or did not exclude, then outcomes of meta-analyses are meaningless.
It is important to note that these two examples are not at all atypical. As we have begun to look systematically at published meta-analyses, most of them fail to exclude or control for key methodological factors known to contribute a great deal of bias. Something very serious has to be done to change this. Also, I’d remind readers that there are lots of programs that do meet strict standards and show positive effects based on reality, not on including biasing factors. At www.evidenceforessa.org, you can see more than 120 reading and math programs that meet selective standards for positive impacts. The problem is that in meta-analyses that include studies containing biasing factors, these truly effective programs are swamped by a blizzard of bias.
In my recent blog (here) I proposed a common set of methodological inclusion criteria that I would think most methodologists would agree to. If these (or a similar consensus list) were consistently used, we could make more valid comparisons both within and between meta-analyses. But as long as inclusion criteria remain highly variable from meta-analysis to meta-analysis, then all we can do is pick out the few that do use selective standards, and ignore the rest. What a terrible waste.
References
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.
Slavin, R. E., Madden, N. A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4, 370–380.
Wolf, R., Morrison, J.M., Inns, A., Slavin, R. E., & Risman, K. (2020). Average effect sizes in developer-commissioned and independent evaluations. Journal of Research on Educational Effectiveness. DOI: 10.1080/19345747.2020.1726537
Photo credit: Deeper Learning 4 All, (CC BY-NC 4.0)
This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.
Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org.
One of the best things about living in Baltimore is eating steamed hard shell crabs every summer. They are cooked in a very spicy mix of spices, and with Maryland corn and Maryland beer, these define the very peak of existence for Marylanders. (To be precise, the true culture of the crab also extends into Virginia, but does not really exist more than 20 miles inland from the bay).
As every crab eater knows, a steamed crab comes with a lot of inedible shell and other inner furniture. So you get perhaps an ounce of delicious meat for every pound of whole crab. Here is a bit of crab math. Let’s say you have ten pounds of whole crabs, and I have 20 ounces of delicious crabmeat. Who gets more to eat? Obviously I do, because your ten pounds of crabs will only yield 10 ounces of meat.
How Baltimoreans learn about meta-analysis.
All Baltimoreans instinctively understand this from birth. So why is this same principle not understood by so many meta-analysts?
I recently ran across a meta-analysis of research on intelligent tutoring programs by Kulik & Fletcher (2016), published in the Review of Educational Research (RER). The meta-analysis reported an overall effect size of +0.66! Considering that the single largest effect size of one-to-one tutoring in mathematics was “only” +0.31 (Torgerson et al., 2013), it is just plain implausible that the average effect size for a computer-assisted instruction intervention is twice as large. Consider that a meta-analysis our group did on elementary mathematics programs found a mean effect size of +0.19 for all digital programs, across 38 rigorous studies (Slavin & Lake, 2008). So how did Kulik & Fletcher come up with +0.66?
The answer is clear. The authors excluded very few studies except for those of less than 30 minutes’ duration. The studies they included used methods known to greatly inflate effect sizes, but they did not exclude or control for them. To the authors’ credit, they then carefully documented the effects of some key methodological factors. For example, they found that “local” measures (presumably made by researchers) had a mean effect size of +0.73, while standardized measures had an effect size of +0.13, replicating findings of many other reviews (e.g., Cheung & Slavin, 2016). They found that studies with sample sizes less than 80 had an effect size of +0.78, while those with samples of more than 250 had an effect size of +0.30. Brief studies had higher effect sizes than those of longer studies, as found in many studies. All of this is nice to know, but even knowing it all, Kulik & Fletcher failed to control for any of it, not even to weight by sample size. So, for example, the implausible mean effect size of +0.66 includes a study with a sample size of 33, a duration of 80 minutes, and an effect size of +1.17, on a “local” test. Another had 48 students, a duration of 50 minutes, and an effect size of +0.95. Now, if you believe that 80 minutes on a computer is three times as effective for math achievement than months of one-to-one tutoring by a teacher, then I have a lovely bridge in Baltimore I’d like to sell you.
I’ve long been aware of these problems with meta-analyses that neither exclude nor control for characteristics of studies known to greatly inflate effect sizes. This was precisely the flaw for which I criticized John Hattie’s equally implausible reviews. But what I did not know until recently was just how widespread this is.
I was working on a proposal to do a meta-analysis of research on technology applications in mathematics. A colleague located every meta-analysis published on this topic since 2013. She found 20 of them. After looking at the remarkable outcomes on a few, I computed a median effect size across all twenty. It was +0.44. That is, to put it mildly, implausible. Looking further, I discovered that only one of the reviews adjusted for sample size (inverse variances). Its mean effect size was +0.05. Every one of the other 19 meta-analyses, all in respectable journals, did not control for methodological features or exclude studies based on them, and reported effect sizes up to +1.02 and +1.05.
Meta-analyses are important, because they are widely read and widely cited, in comparison to individual studies. Yet until meta-analyses start consistently excluding, or at least controlling for studies with factors known to inflate mean effect sizes, then they will have little if any meaning for practice. As things stand now, the overall mean impacts reported by meta-analyses in education depend on how stringent the inclusion standards were, not how effective the interventions truly were.
This is a serious problem for evidence-based reform. Our field knows how to solve it, but all too many meta-analysts do not do so. This needs to change. We see meta-analyses claiming huge impacts, and then wonder why these effects do not transfer to practice. In fact, these big effect sizes do not transfer because they are due to methodological artifacts, not to actual impacts teachers are likely to obtain in real schools with real students.
Ten pounds (160 ounces) of crabs only appear to be more than 20 ounces of crabmeat, because the crabs contain a lot you need to discard. The same is true of meta-analyses. Using small samples, brief durations, and researcher-made measures in evaluations inflate effect sizes without adding anything to the actual impact of treatments for students. Our job as meta-analysts is to strip away the bias the best we can, and get to the actual impact. Then we can make comparisons and generalizations that make sense, and move forward understanding of what really works in education.
In our research group, when we deal with thorny issues of meta-analysis, I often ask my colleagues to consider that they had a sister who is a principal. “What would you say to her,” I ask, “if she asked what really works, all BS aside? Would you suggest a program that was very effective in a 30-minute study? One that has only been evaluated with 20 students? One that has only been shown to be effective if the researcher gets to make the measure? Principals are sharp, and appropriately skeptical. Your sister would never accept such evidence. Especially if she’s experienced with Baltimore crabs.”
References
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.
Slavin, R., & Lake, C. (2008). Effective programs in elementary mathematics: A best-evidence synthesis. Review of Educational Research, 78 (3), 427-515.
Torgerson, C. J., Wiggins, A., Torgerson, D., Ainsworth, H., & Hewitt, C. (2013). Every Child Counts: Testing policy effectiveness using a randomised controlled trial, designed, conducted and reported to CONSORT standards. Research In Mathematics Education, 15(2), 141–153. doi:10.1080/14794802.2013.797746.
Photo credit: Kathleen Tyler Conklin/(CC BY 2.0)
This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.
Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org.
Everyone knows that cherry picking is bad. Bad, bad, bad, bad, bad, bad, bad. Cherry picking means showing non-representative examples to give a false impression of the quality of a product, an argument, or a scientific finding. In educational research, for example, cherry picking might mean a publisher or software developer showing off a school using their product that is getting great results, without saying anything about all the other schools using their product that are getting mediocre results, or worse. Very bad, and very common. The existence of cherry picking is one major reason that educational leaders should always look at valid research involving many schools to evaluate the likely impact of a program. The existence of cherry picking also explains why preregistration of experiments is so important, to make it difficult for developers to do many experiments and then publicize only the ones that got the best results, ignoring the others.
However, something that looks a bit like cherry picking can be entirely appropriate, and is in fact an important way to improve educational programs and outcomes. This is when there are variations in outcomes among various programs of a given type. The average across all programs of that type is unimpressive, but some individual programs have done very well, and have replicated their findings in multiple studies.
As an analogy, let’s move from cherries to apples. The first delicious apple was grown by a farmer in Iowa in 1880. He happened to notice that fruit from one particular branch or one tree had a beautiful shape and a terrific flavor. The Stark Seed Company was looking for new apple varieties, and they bought his tree They grafted the branch on an ordinary rootstock, and (as apples are wont to do), every apple on the grafted tree looked and tasted like the ones from that one unusual branch.
Had the farmer been hoping to sell his whole orchard, and had he taken potential buyers to see this one tree, and offered potential buyers picked apples from this particular branch, then that would be gross cherry-picking. However, he knew (and the Stark Seed Company knew) all about grafting, so instead of using his exceptional branch to fool anyone (note that I am resisting the urge to mention “graft and corruption”), the farmer and Stark could replicate that amazing branch. The key here is the word “replicate.” If it were impossible to replicate the amazing branch, the farmer would have had a local curiosity at most, or perhaps just a delicious occasional snack. But with replication, this one branch transformed the eating apple for a century.
Now let’s get back to education. Imagine that there were a category of educational programs that generally had mediocre results in rigorous experiments. There is always variation in educational outcomes, so the developers of each program would know of individual schools using their program and getting fantastic results. This would be useful for marketing, but if the program developers are honest, they would make all studies of their program available, rather than claiming that the unusual super-duper schools represent what an average school that adopts their program is likely to obtain.
However, imagine that there is a program that resembles others in its category in most ways, yet time and again gets results far beyond those obtained by similar programs of the same type. Perhaps there is a “secret sauce,” some specific factor that explains the exceptional outcomes, or perhaps the organization that created and/or disseminates the program is exceptionally capable. Either way, any potential user would be missing something if they selected a program based on the mediocre average achievement outcomes for its category. If the outcomes for one or more programs are outstanding (and assuming costs and implementation characteristics are similar), then the average achievement effects for the category should no longer be particularly relevant, because any educator who cares about evidence should be looking for the most effective programs, since no one would want to implement an entire category.
I was thinking about apples and cherries because of our group’s work reviewing research on various tutoring programs (Neitzel et al., 2020). As is typical of reviews, we were computing average effect sizes for achievement impacts of categories. Yet these average impacts were much less than the replicated impacts for particular programs. For example, the mean effect size for one-to-small group tutoring was +0.20. Yet various individual programs had mean effect sizes of +0.31, +0.39, +0.42, +0.43, +0.46, and +0.64. In light of these findings, is the practical impact of small group tutoring truly +0.20, or is it somewhere in the range of +0.31 to +0.64? If educators chose programs based on evidence, they would be looking a lot harder at the programs with the larger impacts, not at the mean of all small-group tutoring approaches
Educational programs cannot be replicated (grafted) as easily as apple trees can. But just as the value to the Stark Seed Company of the Iowa farmer’s orchard could not be determined by averaging ratings of a sampling of all of his apples, the value of a category of educational programs cannot be determined by its average effects on achievement. Rather, the value of the category should depend on the effectiveness of its best, replicated, and replicable examples.
At least, you have to admit it’s a delicious idea!
References
Neitzel, A., Lake, C., Pellegrini, M., & Slavin, R. (2020). A synthesis of quantitative research on programs for struggling readers in elementary schools. Available at www.bestevidence.org. Manuscript submitted for publication.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org.
When my sons were young, they loved to read books about sports heroes, like Magic Johnson. These books would all start off with touching stories about the heroes’ early days, but as soon as they got to athletic feats, it was all victories, against overwhelming odds. Sure, there were a few disappointments along the way, but these only set the stage for ultimate triumph. If this weren’t the case, Magic Johnson would have just been known by his given name, Earvin, and no one would write a book about him.
Magic Johnson was truly a great athlete and is an inspiring leader, no doubt about it. However, like all athletes, he surely had good days and bad ones, good years and bad. Yet the published and electronic media naturally emphasize his very best days and years. The sports press distorts the reality to play up its heroes’ accomplishments, but no one really minds. It’s part of the fun.
In educational research evaluating replicable programs and practices, our objectives are quite different. Sports reporting builds up heroes, because that’s what readers want to hear about. But in educational research, we want fair, complete, and meaningful evidence documenting the effectiveness of practical means of improving achievement or other outcomes. The problem is that academic publications in education also distort understanding of outcomes of educational interventions, because studies with significant positive effects (analogous to Magic’s best days) are far more likely to be published than are studies with non-significant differences (like Magic’s worst days). Unlike the situation in sports, these distortions are harmful, usually overstating the impact of programs and practices. Then when educators implement interventions and fail to get the results reported in the journals, this undermines faith in the entire research process.
It has been known for a long time that studies reporting large, positive effects are far more likely to be published than are studies with smaller or null effects. One long-ago study, by Atkinson, Furlong, & Wampold (1982), randomly assigned APA consulting editors to review articles that were identical in all respects except that half got versions with significant positive effects and half got versions with the same outcomes but marked as not significant. The articles with outcomes marked “significant” were twice as likely as those marked “not significant” to be recommended for publication. Reviewers of the “significant” studies even tended to state that the research designs were excellent much more often than did those who reviewed the “non-significant” versions.
Not only do journals tend not to accept articles with null results, but authors of such studies are less likely to submit them, or to seek any sort of publicity. This is called the “file-drawer effect,” where less successful experiments disappear from public view (Glass et al., 1981).
The combination of reviewers’ preferences for significant findings and authors’ reluctance to submit failed experiments leads to a substantial bias in favor of published vs. unpublished sources (e.g., technical reports, dissertations, and theses, often collectively termed “gray literature”). A review of 645 K-12 reading, mathematics, and science studies by Cheung & Slavin (2016) found almost a two-to-one ratio of effect sizes between published and gray literature reports of experimental studies, +0.30 to +0.16. Lipsey & Wilson (1993) reported a difference of +0.53 (published) to +0.39 (unpublished) in a study of psychological, behavioral and educational interventions. Similar outcomes have been reported by Polanin, Tanner-Smith, & Hennessy (2016), and many others. Based on these long-established findings, Lipsey & Wilson (1993) suggested that meta-analyses should establish clear, rigorous criteria for study inclusion, but should then include every study that meets those standards, published or not.
The rationale for restricting interest (or meta-analyses) to published articles was always weak, but in recent years it is diminishing. An increasing proportion of the gray literature consists of technical reports, usually by third-party evaluators, of highly funded experiments. For example, experiments funded by IES and i3 in the U.S., the Education Endowment Foundation (EEF) in the U.K., and the World Bank and other funders in developing countries, provide sufficient resources to do thorough, high-quality implementations of experimental treatments, as well as state-of-the-art evaluations. These evaluations almost always meet the standards of the What Works Clearinghouse, Evidence for ESSA, and other review facilities, but they are rarely published, especially because third-party evaluators have little incentive to publish.
It is important to note that the number of high-quality unpublished studies is very large. Among the 645 studies reviewed by Cheung & Slavin (2016), all had to meet rigorous standards. Across all of them, 383 (59%) were unpublished. Excluding such studies would greatly diminish the number of high-quality experiments in any review.
I have the greatest respect for articles published in top refereed journals. Journal articles provide much that tech reports rarely do, such as extensive reviews of the literature, context for the study, and discussions of theory and policy. However, the fact that an experimental study appeared in a top journal does not indicate that the article’s findings are representative of all the research on the topic at hand.
The upshot of this discussion is clear. First, meta-analyses of experimental studies should always establish methodological criteria for inclusion (e.g., use of control groups, measures not overaligned or made by developers or researchers, duration, sample size), but never restrict studies to those that appeared in published sources. Second, readers of reviews of research on experimental studies should ignore the findings of reviews that were limited to published articles.
In the popular press, it’s fine to celebrate Magic Johnson’s triumphs and ignore his bad days. But if you want to know his stats, you need to include all of his games, not just the great ones. So it is with research in education. Focusing only on published findings can make us believe in magic, when what we need are the facts.
References
Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship? Journal of Counseling Psychology, 29(2), 189–194. https://doi.org/10.1037/0022-0167.29.2.189
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills: Sage Publications.
Lipsey, M.W. & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181-1209.
Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between published and unpublished effect sizes: A meta-review. Review of Educational Research, 86(1), 207–236. https://doi.org/10.3102/0034654315582067
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
I once had a statistics professor who loved to start discussions of experimental design with the following:
“First, pick your favorite random number.”
Obviously, if you pick a favorite random number, it isn’t random. I was recalling this bit of absurdity recently when discussing with colleagues the relative value of randomized experiments (RCTs) and matched studies, or quasi-experimental designs (QED). In randomized experiments, students, teachers, classes, or schools are assigned at random to experimental or control conditions. In quasi-experiments, a group of students, teachers, classes, or schools is identified as the experimental group, and then other schools are located (usually in the same districts) and then matched on key variables, such as prior test scores, percent free lunch, ethnicity, and perhaps other factors. The ESSA evidence standards, the What Works Clearinghouse, Evidence for ESSA, and most methodologists favor randomized experiments over QEDs, but there are situations in which RCTs are not feasible. In a recent “Straight Talk on Evidence,” Jon Baron discussed how QEDs can approach the usefulness of RCTs. In this blog, I build on Baron’s article and go further into strategies for getting the best, most unbiased results possible from QEDs.
Randomized and quasi-experimental studies are very similar in most ways. Both almost always compare experimental and control schools that were very similar on key performance and demographic factors. Both use the same statistics, and require the same number of students or clusters for adequate power. Both apply the same logic, that the control group mean represents a good approximation of what the experimental group would have achieved, on average, if the experiment had never taken place.
However, there is one big difference between randomized and quasi-experiments. In a well-designed randomized experiment, the experimental and control groups can be assumed to be equal not only on observed variables, such as pretests and socio-economic status, but also on unobserved variables. The unobserved variables we worry most about have to do with selection bias. How did it happen (in a quasi-experiment) that the experimental group chose to use the experimental treatment, or was assigned to the experimental treatment? If a set of schools decided to use the experimental treatment on their own, then these schools might be composed of teachers or principals who are more oriented toward innovation, for example. Or if the experimental treatment is difficult, the teachers who would choose it might be more hard-working. If it is expensive, then perhaps the experimental schools have more money. Any of these factors could bias the study toward finding positive effects, because schools that have teachers who are motivated or hard-working, in schools with more resources, might perform better than control schools with or without the experimental treatment.
Because of this problem of selection bias, studies that use quasi-experimental designs generally have larger effect sizes than do randomized experiments. Cheung & Slavin (2016) studied the effects of methodological features of studies on effect sizes. They obtained effect sizes from 645 studies of elementary and secondary reading, mathematics, and science, as well as early childhood programs. These studies had already passed a screening in which they would have been excluded if they had serious design flaws. The results were as follows:
No. of studies
Mean effect size
Quasi-experiments
449
+0.23
Randomized experiments
196
+0.16
Clearly, mean effect sizes were larger in the quasi-experiments, suggesting the possibility that there was bias. Compared to factors such as sample size and use of developer- or researcher-made measures, the amount of effect size inflation in quasi-experiments was modest, and some meta-analyses comparing randomized and quasi-experimental studies have found no difference at all.
Relative Advantages of Randomized and Quasi-Experiments
Because of the problems of selection bias, randomized experiments are preferred to quasi-experiments, all other factors being equal. However, there are times when quasi-experiments may be necessary for practical reasons. For example, it can be easier to recruit and serve schools in a quasi-experiment, and it can be less expensive. A randomized experiment requires that schools be recruited with the promise that they will receive an exciting program. Yet half of them will instead be in a control group, and to keep them willing to sign up, they may be given a lot of money, or an opportunity to receive the program later on. In a quasi-experiment, the experimental schools all get the treatment they want, and control schools just have to agree to be tested. A quasi-experiment allows schools in a given district to work together, instead of insisting that experimental and control schools both exist in each district. This better simulates the reality schools are likely to face when a program goes into dissemination. If the problems of selection bias can be minimized, quasi-experiments have many attractions.
An ideal design for quasi-experiments would obtain the same unbiased outcomes as a randomized evaluation of the same treatment might do. The purpose of this blog is to discuss ways to minimize bias in quasi-experiments.
In practice, there are several distinct forms of quasi-experiments. Some have considerable likelihood of bias. However, others have much less potential for bias. In general, quasi-experiments to avoid are forms of post-hoc, or after-the-fact designs, in which determination of experimental and control groups takes place after the experiment. Quasi-experiments with much less likelihood of bias are pre-specified designs, in which experimental and control schools, classrooms, or students are identified and registered in advance. In the following sections, I will discuss these very different types of quasi-experiments.
Post-Hoc Designs
Post-hoc designs generally identify schools, teachers, classes, or students who participated in a given treatment, and then find matches for each in routinely collected data, such as district or school standardized test scores, attendance, or retention rates. The routinely collected data (such as state test scores or attendance) are collected as pre-and posttests from school records, so it may be that neither experimental nor control schools’ staffs are even aware that the experiment happened.
Post-hoc designs sound valid; the experimental and control groups were well matched at pretest, so if the experimental group gained more than the control group, that indicates an effective treatment, right?
Not so fast. There is much potential for bias in this design. First, the experimental schools are almost invariably those that actually implemented the treatment. Any schools that dropped out or (even worse) any that were deemed not to have implemented the treatment enough have disappeared from the study. This means that the surviving schools were different in some important way from those that dropped out. For example, imagine that in a study of computer-assisted instruction, schools were dropped if fewer than 50% of students used the software as much as the developers thought they should. The schools that dropped out must have had characteristics that made them unable to implement the program sufficiently. For example, they might have been deficient in teachers’ motivation, organization, skill with technology, or leadership, all factors that might also impact achievement with or without the computers. The experimental group is only keeping the “best” schools, but the control schools will represent the full range, from best to worst. That’s bias. Similarly, if individual students are included in the experimental group only if they actually used the experimental treatment a certain amount, that introduces bias, because the students who did not use the treatment may be less motivated, have lower attendance, or have other deficits.
As another example, developers or researchers may select experimental schools that they know did exceptionally well with the treatment. Then they may find control schools that match on pretest. The problem is that there could be unmeasured characteristics of the experimental schools that could cause these schools to get good results even without the treatment. This introduces serious bias. This is a particular problem if researchers pick experimental or control schools from a large database. The schools will be matched at pretest, but since the researchers may have many potential control schools to choose among, they may use selection rules that, while they maintain initial equality, introduce bias. The readers of the study might never be able to find out if this happened.
Pre-Specified Designs
The best way to minimize bias in quasi-experiments is to identify experimental and control schools in advance (as contrasted with post hoc), before the treatment is applied. After experimental and control schools, classes, or students are identified and matched on pretest scores and other factors, the names of schools, teachers, and possibly students on each list should be registered on the Registry of Efficacy and Effectiveness Studies. This way, all schools (and all students) involved in the study are counted in intent-to-treat (ITT) analyses, just as is expected in randomized studies. The total effect of the treatment is based on this list, even if some schools or students dropped out along the way. An ITT analysis reflects the reality of program effects, because it is rare that all schools or students actually use educational treatments. Such studies also usually report effects of treatment on the treated (TOT), focusing on schools and students who did implement for treatment, but such analyses are of only minor interest, as they are known to reflect bias in favor of the treatment group.
Because most government funders in effect require use of random assignment, the number of quasi-experiments is rapidly diminishing. All things being equal, randomized studies should be preferred. However, quasi-experiments may better fit the practical realities of a given treatment or population, and as such, I hope there can be a place for rigorous quasi-experiments. We need not be so queasy about quasi-experiments if they are designed to minimize bias.
References
Baron, J. (2019, December 12). Why most non-RCT program evaluation findings are unreliable (and a way to improve them). Washington, DC: Arnold Ventures.
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
On a recent trip to Norway, I visited the Fram Museum in Oslo. The Fram was Roald Amundson’s ship, used to transport a small crew to the South Pole in 1911. The museum is built around the Fram itself, and visitors can go aboard this amazing ship, surrounded by information and displays about polar exploration. What was most impressive about the Fram is the meticulous attention to detail in every aspect of the expedition. Amundson had undertaken other trips to the polar seas to prepare for his trip, and had carefully studied the experiences of other polar explorers. The ship’s hull was special built to withstand crushing from the shifting of polar ice. He carried many huskies to pull sleds over the ice, and trained them to work in teams.. Every possible problem was carefully anticipated in light of experience, and exact amounts of food for men and dogs were allocated and stored. Amundson said that forgetting “a single trouser button” could doom the effort. As it unfolded, everything worked as anticipated, and all the men and dogs returned safely after reaching the South Pole.
From At the South Pole by Roald Amundsen, 1913 [Public domain]The story of Amundson and the Fram is an illustration of how to overcome major obstacles to achieve audacious goals. I’d like to build on it to return to a topic I’ve touched on in two previous blogs. The audacious goal: Overcoming the substantial gap in elementary reading achievement between students who qualify for free lunch and those who do not, between African American and White students, and between Hispanic and non-Hispanic students. According to the National Assessment of Educational Progress (NAEP), each of these gaps is about one half of a standard deviation, also known as an effect size of +0.50. This is a very large gap, but it has been overcome in a very small number of intensive programs. These programs were able to increase the achievement of disadvantaged students by an effect size of more than +0.50, but few were able to reproduce these gains under normal circumstances. Our goal is to enable thousands of ordinary schools serving disadvantaged students to achieve such outcomes, at a cost of no more than 5% beyond ordinary per-pupil costs.
Educational Reform and Audacious Goals
Researchers have long been creating and evaluating many different approaches to improving reading achievement. This is necessary in the research and development process to find “what works” and build up from there. However, each individual program or practice has a modest effect on key outcomes, and we rarely combine proven programs to achieve an effect large enough to, for example, overcome the achievement gap. This is not what Amundson, or the Wright Brothers, or the worldwide team that achieved eradication of smallpox did. Instead, they set audacious goals and kept at them systematically, using what works, until they were achieved.
I would argue that we should and could do the same in education. The reading achievement gap is the largest problem of educational practice and policy in the U.S. We need to use everything we know how to do to solve it. This means stating in advance that our goal is to find strategies capable of eliminating reading gaps at scale, and refusing to declare victory until this goal is achieved. We need to establish that the goal can be achieved, by ordinary teachers and principals in ordinary schools serving disadvantaged students.
Tutoring Our Way to the Goal
In a previous blog I proposed that the goal of +0.50 could be reached by providing disadvantaged, low-achieving students tutoring in small groups or, when necessary, one-to-one. As I argued there and elsewhere, there is no reading intervention as effective as tutoring. Recent reviews of research have found that well-qualified teaching assistants using proven methods can achieve outcomes as good as those achieved by certified teachers working as tutors, thereby making tutoring much less expensive and more replicable (Inns et al., 2019). Providing schools with significant numbers of well-trained tutors is one likely means of reaching ES=+0.50 for disadvantaged students. Inns et al. (2019) found an average effect size of +0.38 for tutoring by teaching assistants, but several programs had effect sizes of +0.40 to +0.47. This is not +0.50, but it is within striking distance of the goal. However, each school would need multiple tutors in order to provide high-quality tutoring to most students, to extend the known positive effects of tutoring to the whole school.
Combining Intensive Tutoring With Success for All
Tutoring may be sufficient by itself, but research on tutoring has rarely used tutoring schoolwide, to benefit all students in high-poverty schools. It may be more effective to combine widespread tutoring for students who most need it with other proven strategies designed for the whole school, rather than simply extending a program designed for individuals and small groups. One logical strategy to reach the goal of +0.50 in reading might be to combine intensive tutoring with our Success for All whole-school reform model.
Success for All adds to intensive tutoring in several ways. It provides teachers with professional development on proven reading strategies, as well as cooperative learning and classroom management strategies at all levels. Strengthening core reading instruction reduces the number of children at great risk, and even for students who are receiving tutoring, it provides a setting in which students can apply and extend their skills. For students who do not need tutoring, Success for All provides acceleration. In high-poverty schools, students who are meeting reading standards are likely to still be performing below their potential, and improving instruction for all is likely to help these students excel.
Success for All was created in the late 1980s in an attempt to achieve a goal similar to the +0.50 challenge. In its first major evaluation, a matched study in six high-poverty Baltimore elementary schools, Success for All achieved a schoolwide reading effect size of at least +0.50 schoolwide in grades 1-5 on individually administered reading measures. For students in the lowest 25% of the sample at pretest, the effect size averaged +0.75 (Madden et al., 1993). That experiment provided two to six certified teacher tutors per school, who worked one to one with the lowest-achieving first and second graders. The tutors supplemented a detailed reading program, which used cooperative learning, phonics, proven classroom management methods, parent involvement, frequent assessment, distributed leadership, and other elements (as Success for All still does).
An independent follow-up assessment found that the effect maintained to the eighth grade, and also showed a halving of retentions in grade and a halving of assignments to special education, compared to the control group (Borman & Hewes, 2002). Schools using Success for All since that time have rarely been able to afford so many tutors, instead averaging one or two tutors. Many schools using SFA have not been able to afford even one tutor. Still, across 28 qualifying studies, mostly by third parties, the Success for All effect size has averaged +0.27 (Cheung et al., in press). This is impressive, but it is not +0.50. For the lowest achievers, the mean effect size was +0.62, but again, our goal is +0.50 for all disadvantaged students, not just the lowest achievers.
Over a period of years, could schools using Success for All with five or more teaching assistant tutors reach the +0.50 goal? I’m certain of it. Could we go even further, perhaps creating a similar approach for secondary schools or adding in an emphasis on mathematics? That would be the next frontier.
The Policy Importance of +0.50
If we can routinely achieve an effect size of +0.50 in reading in most Title I schools, this would provide a real challenge for policy makers. Many policy makers argue that money does not make much difference in education, or that housing, employment, and other basic economic improvements are needed before major improvements in the education of disadvantaged students will be possible. But what if it became widely known that outcomes in high-poverty schools could be reliably and substantially improved at a modest cost, compared to the outcomes? Policy makers would hopefully focus on finding ways to provide the resources needed if they could be confident in the outcomes.
As Amundson knew, difficult goals can be attained with meticulous planning and high-quality implementation. Every element of his expedition had been tested extensively in real arctic conditions, and had been found to be effective and practical. We would propose taking a similar path to universal success in reading. Each component of a practical plan to reach an effect size of +0.50 or more must be proven to be effective in schools serving many disadvantaged students. Combining proven approaches, we can add sufficiently to the reading achievement of disadvantaged schools to enable them to perform as well as middle class students do. It just takes an audacious goal and the commitment and resources to accomplish it.
References
Borman, G., & Hewes, G. (2002). Long-term effects and cost effectiveness of Success for All. Educational Evaluation and Policy Analysis, 24 (2), 243-266.
Cheung, A., Xie, C., Zhang, T., & Slavin, R. E. (in press). Success for All: A quantitative synthesis of evaluations. Education Research Review.
Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2019). A synthesis of quantitative research on programs for struggling readers in elementary schools. Available at www.bestevidence.org. Manuscript submitted for publication.
Madden, N. A., Slavin, R. E., Karweit, N. L., Dolan, L., & Wasik, B. (1993). Success for All: Longitudinal effects of a schoolwide elementary restructuring program. American Educational Reseach Journal, 30, 123-148.
Madden, N. A., & Slavin, R. E. (2017). Evaluations of technology-assisted small-group tutoring for struggling readers. Reading & Writing Quarterly, 1-8. http://dx.doi.org/10.1080/10573569.2016.1255577
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
What if people could make their own yardsticks, and all of a sudden people who did so gained two inches overnight, while people who used ordinary yardsticks did not change height? What if runners counted off time as they ran (one Mississippi, two Mississippi…), and then it so happened that these runners reduced their time in the 100-yard dash by 20%? What if archers could draw their own targets freehand and those who did got more bullseyes?
All of these examples are silly, you say. Of course people who make their own measures will do better on the measures they themselves create. Even the most honest and sincere people, trying to be fair, may give themselves the benefit of the doubt in such situations.
In educational research, it is frequently the case that researchers or developers make up their own measures of achievement or other outcomes. Numerous reviews of research (e.g., Baye et al., 2019; Cheung & Slavin, 2016; deBoer et al., 2014; Wolf et al., 2019) have found that studies that use measures made by developers or researchers obtain effect sizes that may be two or three times as large as measures independent of the developers or researchers. In fact, some studies (e.g., Wolf et al., 2019; Slavin & Madden, 2011) have compared outcomes on researcher/developer-made measures and independent measures within the same studies. In almost every study with both kinds of measures, the researcher/developer measures show much higher effect sizes.
I think anyone can see that researcher/developer measures tend to overstate effects, and the reasons why they would do so are readily apparent (though I will discuss them in a moment). I and other researchers have been writing about this problem in journals and other outlets for years. Yet journals still accept these measures, most authors of meta-analyses still average them into their findings, and life goes on.
I’ve written about this problem in several blogs in this series. In this one I hope to share observations about the persistence of this practice.
How Do Researchers Justify Use of Researcher/Developer-Made Measures?
Very few researchers in education are dishonest, and I do not believe that researchers set out to hoodwink readers by using measures they made up. Instead, researchers who make up their own measures or use developer-made measures express reasonable-sounding rationales for making their own measures. Some common rationales are discussed below.
Perhaps the most common rationale for using researcher/developer-made measures is that the alternative is to use standardized tests, which are felt to be too insensitive to any experimental treatment. Often researchers will use both a “distal” (i.e., standardized) measure and a “proximal” (i.e., researcher/developer-made) measure. For example, studies of vocabulary-development programs that focus on specific words will often create a test consisting primarily or entirely of these focal words. They may also use a broad-range standardized test of vocabulary. Typically, such studies find positive effects on the words taught in the experimental group, but not on vocabulary in general. However, the students in the control group did not focus on the focal words, so it is unlikely they would improve on them as much as students who spent considerable time with them, regardless of the teaching method. Control students may be making impressive gains on vocabulary, mostly on words other than those emphasized in the experimental group.
Many researchers make up their own tests to reflect their beliefs about how children should learn. For example, a researcher might believe that students should learn algebra in third grade. Because there are no third grade algebra tests, the researcher might make one. If others complain that of course the students taught algebra in third grade will do better on a test of the algebra they learned (but that the control group never saw), the researcher may give excellent reasons why algebra should be taught to third graders, and if the control group didn’t get that content, well, they should
Often, researchers say they used their own measures because there were no appropriate tests available focusing on whatever they taught. However, there are many tests of all kinds available either from specialized publishers or from measures made by other researchers. A researcher who cannot find anything appropriate is perhaps studying something so esoteric that it will not have ever been seen by any control group.
Sometimes, researchers studying technology applications will give the final test on the computer. This may, of course, give a huge advantage to the experimental group, which may have been using the specific computers and formats emphasized in the test. The control group may have much less experience with computers, or with the particular computer formats used in the experimental group. The researcher might argue that it would not be fair to teach on computers but test on paper. Yet every student knows how to write with a pencil, but not every student has extensive experience with the computers used for the test.
A Potential Solution to the Problem of Researcher/Developer Measures
Researcher/developer-made measures clearly inflate effect sizes considerably. Further, research in education, an applied field, should use measures like those for which schools and teachers are held accountable. No principal or teacher gets to make up his or her own test to use for accountability, and neither should researchers or developers have that privilege.
However, arguments for the use of researcher- and developer-made measures are not entirely foolish, as long as these measures are only used as supplements to independent measures. For example, in a vocabulary study, there may be a reason researchers want to know the effect of a program on the hundred words it emphasizes. This is at least a minimum expectation for such a treatment. If a vocabulary intervention that focused on only 100 words all year did not improve knowledge of those words, that would be an indication of trouble. Similarly, there may be good reasons to try out treatments based on unique theories of action and to test them using measures also aligned with that theory of action.
The problem comes in how such results are reported, and especially how they are treated in meta-analyses or other quantitative syntheses. My suggestions are as follows:
Results from researcher/developer-made measures should be reported in articles on the program being evaluated, but not emphasized or averaged with independent measures. Analyses of researcher/developer-made measures may provide information, but not a fair or meaningful evaluation of the program impact. Reports of effect sizes from researcher/developer measures should be treated as implementation measures, not outcomes. The outcomes emphasized should only be those from independent measures.
In meta-analyses and other quantitative syntheses, only independent measures should be used in calculations. Results from researcher/developer measures may be reported in program descriptions, but never averaged in with the independent measures.
Studies whose only achievement measures are made by researchers or developers should not be included in quantitative reviews.
Fields in which research plays a central and respected role in policy and practice always pay close attention to the validity and fairness of measures. If educational research is ever to achieve a similar status, it must relegate measures made by researchers or developers to a supporting role, and stop treating such data the same way it treats data from independent, valid measures.
References
Baye, A., Lake, C., Inns, A., & Slavin, R. (2019). Effective reading programs for secondary students. Reading Research Quarterly, 54 (2), 133-166.
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
de Boer, H., Donker, A.S., & van der Werf, M.P.C. (2014). Effects of the attributes of educational interventions on students’ academic performance: A meta- analysis. Review of Educational Research, 84(4), 509–545. https://doi.org/10.3102/0034654314540006
Slavin, R.E., & Madden, N.A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4 (4), 370-380.
Wolf, R., Morrison, J., Inns, A., Slavin, R., & Risman, K. (2019). Differences in average effect sizes in developer-commissioned and independent studies. Manuscript submitted for publication.
Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
Once upon a time, there was a very famous restaurant, called The Hummingbird. It was known the world over for its unique specialty: Hummingbird Stew. It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird? You’d have to catch dozens of them just for one bowl of stew.
One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner. When they were alone, the visitor said, “You have quite an operation here! But I have been in the restaurant business for many years, and I have always wondered how you do it. No one can make money selling Hummingbird Stew! Tell me how you make it work, and I promise on my honor to keep your secret to my grave. Do you…mix just a little bit?”
The Hummingbird’s owner looked around to be sure no one was listening. “You look honest,” he said. “I will trust you with my secret. We do mix in a bit of horsemeat.”
“I knew it!,” said the visitor. “So tell me, what is the ratio?”
“One to one.”
“Really!,” said the visitor. “Even that seems amazingly generous!”
“I think you misunderstand,” said the owner. “I meant one hummingbird to one horse!”
In education, we write a lot of reviews of research. These are often very widely cited, and can be very influential. Because of the work my colleagues and I do, we have occasion to read a lot of reviews. Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically. Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators. They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.
However, all too many reviews are deeply flawed. Frequently, reviews of research make it impossible to check the validity of the findings of the original studies. As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew. Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies. If the individual studies are of good quality, this is wonderfully useful. But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles. Few readers are willing to do this.
Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals. One published review only used pre-post gains. Presumably, if the reviewers found a study with a control group, they would have ignored the control group data! Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.
I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests. Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests. Accepting this jumble of experimental designs, a review makes no sense. Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.
Many published reviews include results from measures that were made up by program developers. We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.
The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size. Study details were lacking, so I had to find and read the original studies. It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis! The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.
It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.
References
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
In the 1984 mockumentary This is Spinal Tap, there is a running joke about a hapless band, Spinal Tap, which proudly bills itself “Britain’s Loudest Band.” A pesky reporter keeps asking the band’s leader, “But how can you prove that you are Britain’s loudest band?” The band leader explains, with declining patience, that while ordinary amplifiers’ sound controls only go up to 10, Spinal Tap’s go up to 11. “But those numbers are arbitrary,” says the reporter. “They don’t mean a thing!” “Don’t you get it?” asks the band leader. “ELEVEN is more than TEN! Anyone can see that!”
In educational research, we have an ongoing debate reminiscent of Spinal Tap. Educational researchers speaking to other researchers invariably express the impact of educational treatments as effect sizes (the difference in adjusted means for the experimental and control groups divided by the unadjusted standard deviation). All else being equal, higher effect sizes are better than lower ones.
However, educators who are not trained in statistics often despise effect sizes. “What do they mean?” they ask. “Tell us how much difference the treatment makes in student learning!”
Researchers want to be understood, so they try to translate effect sizes into more educator-friendly equivalents. The problem is that the friendlier the units, the more statistically problematic they are. The friendliest of all is “additional months of learning.” Researchers or educators can look on a chart and, for any particular effect size, they can find the number of “additional months of learning.” The Education Endowment Foundation in England, which funds and reports on rigorous experiments, reports both effect sizes and additional months of learning, and provides tables to help people make the conversion. But here’s the rub. A recent article by Baird & Pane (2019) compared additional months of learning to three other translations of effect sizes. Additional months of learning was rated highest in ease of use, but lowest in four other categories, such as transparency and consistency. For example, a month of learning clearly has a different meaning in kindergarten than it does in tenth grade.
The other translations rated higher by Baird and Pane were, at least to me, just as hard to understand as effect sizes. For example, the What Works Clearinghouse presents, along with effect sizes, an “improvement index” that has the virtue of being equally incomprehensible to researchers and educators alike.
On one hand, arguing about outcome metrics is as silly as arguing the relative virtues of Fahrenheit and Celsius. If they can be directly transformed into the other unit, who cares?
However, additional months of learning is often used to cover up very low effect sizes. I recently ran into an example of this in a series of studies by the Stanford Center for Research on Education Outcomes (CREDO), in which disadvantaged urban African American students gained 59 more “days of learning” than matched students not in charters in math, and 44 more days in reading. These numbers were cited in an editorial praising charter schools in the May 29 Washington Post.
However, these “days of learning” are misleading. The effect size for this same comparison was only +0.08 for math, and +0.06 for reading. Any researcher will tell you that these are very small effects. They were only made to look big by reporting the gains in days. These not only magnify the apparent differences, but they also make them unstable. Would it interest you to know that White students in urban charter schools performed 36 days a year worse than matched students in math (ES= -0.05) and 14 days worse in reading (ES= -0.02)? How about Native American students in urban charter schools, whose scores were 70 days worse than matched students in non-charters in math (ES= -0.10), and equal in reading. I wrote about charter school studies in a recent blog. In the blog, I did not argue that charter schools are effective for disadvantaged African Americans but harmful for Whites and Native Americans. That seems unlikely. What I did argue is that the effects of charter schools are so small that the directions of the effects are unstable. The overall effects across all urban schools studied were only 40 days (ES=+0.055) in math and 28 days (ES=+0.04) in reading. These effects look big because of the “days of learning” transformation, but they are not.
In This is Spinal Tap, the argument about whether or not Spinal Tap is Britain’s loudest band is absurd. Any band can turn its amplifiers to the top and blow out everyone’s eardrums, whether the top is marked eleven or ten. In education, however, it does matter a great deal that educators are taking evidence into account in their decisions about educational programs. Using effect sizes, perhaps supplemented by additional months of learning, is one way to help readers understand outcomes of educational experiments. Using “days of learning,” however, is misleading, making very small impacts look important. Why not additional hours or minutes of learning, while we’re at it? Spinal Tap would be proud.
References
Baird, M., & Paine, J. (2019). Translating standardized effects of education programs into more interpretable metrics. Educational Researcher. Advance online publication. doi.org/10.3102/0013189X19848729
CREDO (2015). Overview of the Urban Charter School Study. Stanford, CA: Author.
Washington Post: Denying poor children a chance. [Editorial]. (May 29, 2019). The Washington Post, A16.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.