Meta-Analysis and Its Discontents

Everyone loves meta-analyses. We did an analysis of the most frequently opened articles on Best Evidence in Brief. Almost all of the most popular were meta-analyses. What’s so great about meta-analyses is that they condense a lot of evidence and synthesize it, so instead of just one study that might be atypical or incorrect, a meta-analysis seems authoritative, because it averages many individual studies to find the true effect of a given treatment or variable.

Meta-analyses can be wonderful summaries of useful information. But today I wanted to discuss how they can be misleading. Very misleading.

The problem is that there are no norms among journal editors or meta-analysts themselves about standards for including studies or, perhaps most importantly, how much or what kind of information needs to be reported about each individual study in a meta-analysis. Some meta-analyses are completely statistical. They report all sorts of statistics and very detailed information on exactly how the search for articles took place, but never say anything about even a single study. This is a problem for many reasons. Readers may have no real understanding of what the studies really say. Even if citations for the included studies are available, only a very motivated reader is going to go find any of them. Most meta-analyses do have a table listing studies, but the information in the table may be idiosyncratic or limited.

One reason all of this matters is that without clear information on each study, readers can be easily misled. I remember encountering this when meta-analysis first became popular in the 1980s. Gene Glass, who coined the very term, proposed some foundational procedures, and popularized the methods. Early on, he applied meta-analysis to determine the effects of class size, which by then had been studied several times and found to matter very little except in first grade. Reducing “class size” to one (i.e., one-to-one tutoring) also was known to make a big difference, but few people would include one-to-one tutoring in a review of class size. But Glass and Smith (1978) found a much higher effect, not limited to first grade or tutoring. It was a big deal at the time.

I wanted to understand what happened. I bought and read Glass’ book on class size, but it was nearly impossible to tell what had happened. But then I found in an obscure appendix a distribution of effect sizes. Most studies had effect sizes near zero, as I expected. But one had a huge effect size, of +1.25! It was hard to tell which particular study accounted for this amazing effect but I searched by process of elimination and finally found it.

It was a study of tennis.

blog_6-7-18_tennis_500x355

The outcome measure was the ability to “rally a ball against a wall so many times in 30 seconds.” Not surprisingly, when there were “large class sizes,” most students got very few chances to practice, while in “small class sizes,” they did.

If you removed the clearly irrelevant tennis study, the average effect size for class sizes (other than tutoring) dropped to near zero, as reported in all other reviews (Slavin, 1989).

The problem went way beyond class size, of course. What was important, to me at least, was that Glass’ presentation of the data made it very difficult to find out what was really going on. He had attractive and compelling graphs and charts showing effects of class size, but they all depended on the one tennis study, and there was no easy way to find out.

Because of this review and several others appearing in the 1980s, I wrote an article criticizing numbers–only meta-analyses and arguing that reviewers should show all of the relevant information about the studies in their meta-analyses, and should even describe each study briefly to help readers understand what was happening. I made up a name for this, “best-evidence synthesis” (Slavin, 1986).

Neither the term nor the concept really took hold, I’m sad to say. You still see meta-analyses all the time that do not tell readers enough for them to know what’s really going on. Yet several developments have made the argument for something like best-evidence synthesis a lot more compelling.

One development is the increasing evidence that methodological features can be strongly correlated with effect sizes (Cheung & Slavin, 2016). The evidence is now overwhelming that effect sizes are greatly inflated when sample sizes are small, when study durations are brief, when measures are made by developers or researchers, or when quasi-experiments rather than randomized experiments are used, for example. Many meta-analyses check for the effects of these and other study characteristics, and may make adjustments if there are significant differences. But this is not sufficient, because in a particular meta-analysis, there may not be enough studies to make any study-level factors significant. For example, if Glass had tested “tennis vs. non-tennis,” there would have been no significant difference, because there was only one tennis study. Yet that one study dominated the means anyway. Eliminating studies using, for example, researcher/developer-made measures or very small sample sizes or very brief durations is one way to remove bias from meta-analyses, and this is what we do in our reviews. But at bare minimum, it is important to have enough information available in tables to enable readers or journal reviewers to look for such biasing factors so they can recompute or at least understand the main effects if they are so inclined.

The second development that makes it important to require more information on individual studies in meta-analyses is the increased popularity of meta-meta-analyses, where the average effect sizes from whole meta-analyses are averaged. These have even more potential for trouble than the worst statistics-only reviews, because it is extremely unlikely that many readers will follow the citations to each included meta-analysis and then follow those citations to look for individual studies. It would be awfully helpful if readers or reviewers could trust the individual meta-analyses (and therefore their averages), or at least see for themselves.

As evidence takes on greater importance, this would be a good time to discuss reasonable standards for meta-analyses. Otherwise, we’ll be rallying balls uselessly against walls forever.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292

Glass, G., & Smith, M. L. (1978). Meta-Analysis of research on the relationship of class size and achievement. San Francisco: Far West Laboratory for Educational Research and Development.

Slavin, R.E. (1986). Best-evidence synthesis: An alternative to meta-analytic and traditional reviews. Educational Researcher, 15 (9), 5-11.

Slavin, R. E. (1989). Class size and student achievement:  Small effects of small classes. Educational Psychologist, 24, 99-110.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

When Developers Commission Studies, What Develops?

I have the greatest respect for commercial developers and disseminators of educational programs, software, and professional development. As individuals, I think they genuinely want to improve the practice of education, and help produce better outcomes for children. However, most developers are for-profit companies, and they have shareholders who are focused on the bottom line. So when developers carry out evaluations, or commission evaluation companies to do so on their behalf, perhaps it’s best to keep in mind a bit of dialogue from a Marx Brothers movie. Someone asks Groucho if Chico is honest. “Sure,” says Groucho, “As long as you watch him!”

blog_5-31-18_MarxBros_500x272

         A healthy role for developers in evidence-based reform in education is desirable. Publishers, software developers, and other commercial companies have a lot of capital, and a strong motivation to create new products with evidence of effectiveness that will stand up to scrutiny. In medicine, most advances in practical drugs and treatments are made by drug companies. If you’re a cynic, this may sound disturbing. But for a long time, the federal government has encouraged drug companies to do development and evaluation of new drugs, but they have strict rules about what counts as conclusive evidence. Basically, the government says, following Groucho, “Are drug companies honest? Sure, as long as you watch ‘em.”

            In our field, we may want to think about how to do this. As one contribution, my colleague Betsy Wolf did some interesting research on outcomes of studies sponsored by developers, compared to those conducted by independent, third parties. She looked at all reading/literacy and math studies listed on the What Works Clearinghouse database. The first thing she found was very disturbing. Sure enough, the effect sizes for the developer-commissioned studies (ES = +0.27, n=73) were twice as large as those for independent studies (ES = +0.13, n=96). That’s a huge difference.

Being a curious person, Betsy wanted to know why developer-commissioned studies had effect sizes that were so much larger than independent ones. We now know a lot about study characteristics that inflate effect sizes. The most inflationary are small sample size, use of measures made by researchers or developers (rather than independent measures), and use of quasi-experiments instead of randomized designs. Developer-commissioned studies were in fact much more likely to use researcher/developer-made measures (29% in developer-commissioned vs. 8% in independent studies), and randomized vs. quasi-experiments (51% quasi-experiments for developer-commissioned studies vs. 15% quasi-experiments for independent studies). However, sample sizes were similar in developer-commissioned and independent studies. And most surprising, statistically controlling for all of these factors did not reduce the developer effect by very much.

If there is so much inflation of effect sizes in developer-commissioned studies, then how come controlling for the largest factors that usually cause effect size inflation does not explain the developer effect?

There is a possible reason for this, which Betsy cautiously advances (since it cannot be proven). Perhaps the reason that effect sizes are inflated in developer-commissioned studies is not due to the nature of the studies we can find, but to the studies we cannot find. There has long been recognition of what is called the “file drawer effect,” which happens when studies that do not obtain a positive outcome disappear (into a file drawer). Perhaps developers are especially likely to hide disappointing findings. Unlike academic studies, which are likely to exist as technical reports or dissertations, perhaps commercial companies have no incentive to make null findings findable in any form.

This may not be true, or it may be true of some but not other developers. But if government is going to start taking evidence a lot more seriously, as it has done with the ESSA evidence standards (see www.evidenceforessa.org), it is important to prevent developers, or any researchers, from hiding their null findings.

There is a solution to this problem that is heading rapidly in our direction. This is pre-registration. In pre-registration, researchers or evaluators must file a study design, measures, and analyses about to be used in a study, but perhaps most importantly, pre-registration announces that a study exists, or will exist soon. If a developer pre-registered a study but that study never showed up in the literature, this might be a cause for losing faith in the developer. Imagine that the What Works Clearinghouse, Evidence for ESSA, and journals refused to accept research reports on programs unless the study had been pre-registered, and unless all other studies of the program were made available.

Some areas of medicine use pre-registration, and the Society for Research on Educational Effectiveness is moving toward introducing a pre-registration process for education. Use of pre-registration and other safeguards could be a boon to commercial developers, as it is to drug companies, because it could build public confidence in developer-sponsored research. Admittedly, it would take many years and/or a lot more investment in educational research to make this practical, but there are concrete steps we could take in that direction.

I’m not sure I see any reason we shouldn’t move toward pre-registration. It would be good for Groucho, good for Chico, and good for kids. And that’s good enough for me!

Photo credit: By Paramount Pictures (source) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Effect Sizes and the 10-Foot Man

If you ever go into the Ripley’s Believe It or Not Museum in Baltimore, you will be greeted at the entrance by a statue of the tallest man who ever lived, Robert Pershing Wadlow, a gentle giant at 8 feet, 11 inches in his stocking feet. Kids and adults love to get their pictures taken standing by him, to provide a bit of perspective.

blog_5-10-18_Wadlow_292x500

I bring up Mr. Wadlow to explain a phrase I use whenever my colleagues come up with an effect size of more than 1.00. “That’s a 10-foot man,” I say. What I mean, of course, is that while it is not impossible that there could be a 10-foot man someday, it is extremely unlikely, because there has never been a man that tall in all of history. If someone reports seeing one, they are probably mistaken.

In the case of effect sizes you will never, or almost never, see an effect size of more than +1.00, assuming the following reasonable conditions:

  1. The effect size compares experimental and control groups (i.e., it is not pre-post).
  2. The experimental and control group started at the same level, or they started at similar levels and researchers statistically controlled for pretest differences.
  3. The measures involved were independent of the researcher and the treatment, not made by the developers or researchers. The test was not given by the teachers to their own students.
  4. The treatment was provided by ordinary teachers, not by researchers, and could in principle be replicated widely in ordinary schools. The experiment had a duration of at least 12 weeks.
  5. There were at least 30 students and 2 teachers in each treatment group (experimental and control).

If these conditions are met, the chances of finding effect sizes of more than +1.00 are about the same as the chances of finding a 10-foot man. That is, zero.

I was thinking about the 10-foot man when I was recently asked by a reporter about the “two sigma effect” claimed by Benjamin Bloom and much discussed in the 1970s and 1980s. Bloom’s students did a series of experiments in which students were taught about a topic none of them knew anything about, usually principles of sailing. After a short period, students were tested. Those who did not achieve at least 80% (defined as “mastery”) on the tests were tutored by University of Chicago graduate students long enough to ensure that every tutored student reached mastery. The purpose of this demonstration was to make a claim that every student could learn whatever we wanted to teach them, and the only variable was instructional time, as some students need more time to learn than others. In a system in which enough time could be given to all, “ability” would disappear as a factor in outcomes. Also, in comparison to control groups who were not taught about sailing at all, the effect size was often more than 2.0, or two sigma. That’s why this principle was called the “two sigma effect.” Doesn’t the two sigma effect violate my 10-foot man principle?

No, it does not. The two sigma studies used experimenter-made tests of content taught to the experimental but not control groups. It used University of Chicago graduate students providing far more tutoring (as a percentage of initial instruction) than any school could ever provide. The studies were very brief and sample sizes were small. The two sigma experiments were designed to prove a point, not to evaluate a feasible educational method.

A more recent example of the 10-foot man principle is found in Visible Learning, the currently fashionable book by John Hattie claiming huge effect sizes for all sorts of educational treatments. Hattie asks the reader to ignore any educational treatment with an effect size of less than +0.40, and reports many whole categories of teaching methods with average effect sizes of more than +1.00. How can this be?

The answer is that such effect sizes, like two sigma, do not incorporate the conditions I laid out. Instead, Hattie throws into his reviews entire meta-analyses which may include pre-post studies, studies using researcher-made measures, studies with tiny samples, and so on. For practicing educators, such effect sizes are useless. An educator knows that all children grow from pre- to posttest. They would not (and should not) accept measures made by researchers. The largest known effect sizes that do meet the above conditions are one-to-one tutoring studies with effect sizes up to +0.86. Still not +1.00. What could be more effective than the best of 1-1 tutoring?

It’s fun to visit Mr. Wadlow at the museum, and to imagine what an ever taller man could do on a basketball team, for example. But if you see a 10-foot man at Ripley’s Believe it or Not, or anywhere else, here’s my suggestion. Don’t believe it. And if you visit a museum of famous effect sizes that displays a whopper effect size of +1.00, don’t believe that, either. It doesn’t matter how big effect sizes are if they are not valid.

A 10-foot man would be a curiosity. An effect size of +1.00 is a distraction. Our work on evidence is too important to spend our time looking for 10-foot men, or effect sizes of +1.00, that don’t exist.

Photo credit: [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Mill and The School

 

On a recent trip to Scotland, I visited some very interesting oat mills. I always love to visit medieval mills, because I find it endlessly fascinating how people long ago used natural forces and materials – wind, water, and fire, stone, wood, and metal – to create advanced mechanisms that had a profound impact on society.

In Scotland, it’s all about oat mills (almost everywhere else, it’s wheat). These grain mills date back to the 10th century. In their time, they were a giant leap in technology. A mill is very complicated, but at its heart are two big innovations. In the center of the mill, a heavy millstone turns on top of another. The grain is poured through a hole in the top stone for grinding. The miller’s most difficult task is to maintain an exact distance between the stones. A few millimeters too far apart and no milling happens. A few millimeters too close and the heat of friction can ruin the machinery, possibly causing a fire.

The other key technology is the water wheel (except in windmills, of course). The water mill is part of a system that involves a carefully controlled flow of water from a millpond, which the miller uses to provide exactly the right amount of water to turn a giant wooden wheel, which powers the top millstone.

blog_5-2-18_TheMaidOfTheMill_500x472

The medieval grain mill is not a single innovation, but a closely integrated system of innovations. Millers learned to manage this complex technology in a system of apprenticeship over many years.

Mills enabled medieval millers to obtain far more nutrition from an acre of grain than was possible before. This made it possible for land to support many more people, and the population surged. The whole feudal system was built around the economics of mills, and mills thrived through the 19th century.

What does the mill have to with the school? Mills only grind well-behaved grain into well-behaved flour, while schools work with far more complex children, families, and all the systems that surround them. The products of schools must include joy and discovery, knowledge and skills.

Yet as different as they are, mills have something to teach us. They show the importance of integrating diverse systems that can then efficiently deliver desired outcomes. Neither a mill nor an effective school comes into existence because someone in power tells it to. Instead, complex systems, mills or schools, must be created, tested, adapted to local needs, and constantly improved. Once we know how to create, manage, and disseminate effective mills or schools, policies can be readily devised to support their expansion and improvement.

Important progress in societies and economies almost always comes about from development of complex, multi-component innovations that, once developed, can be disseminated and continuously improved. The same is true of schools. Changes in governance or large-scale policies can enhance (or inhibit) the possibility of change, but the reality of reform depends on creation of complex, integrated systems, from mills to ships to combines to hospitals to schools.

For education, what this means is that system transformation will come only when we have whole-school improvement approaches that are known to greatly increase student outcomes. Whole-school change is necessary because many individual improvements are needed to make big changes, and these must be carefully aligned with each other. Just as the huge water wheel and the tiny millstone adjustment mechanism and other components must work together in the mill, the key parts of a school must work together in synchrony to produce maximum impact, or the whole system fails to work as well as it should.

For example, if you look at research on proven programs, you’ll find effective strategies for school management, for teaching, and for tutoring struggling readers. These are all well and good, but they work so much better if they are linked to each other.

To understand this, first consider tutoring. Especially in the elementary grades, there is no more effective strategy. Our recent review of research on programs for struggling readers finds that well-qualified teaching assistants can be as effective as teachers in tutoring struggling readers, and that while one-to-four tutoring is less effective than one-to-one, it is still a lot more effective than no tutoring. So an evidence-oriented educator might logically choose to implement proven one-to-one and/or one-to-small group tutoring programs to improve school outcomes.

However, tutoring only helps the students who receive it, and it is expensive. A wise school administrator might reason that tutoring alone is not sufficient, but improving the quality of classroom instruction is also essential, both to improve outcomes for students who do not need tutoring and to reduce the number of students who do need tutoring. There is an array of proven classroom methods the principal or district might choose to improve student outcomes in all subjects and grade levels (see www.evidenceforessa.org).

But now consider students who are at risk because they are not attending regularly, or have behavior problems, or need eyeglasses but do not have them. Flexible school-level systems are necessary to ensure that students are in school, eager to learn, well-behaved, and physically prepared to succeed.

In addition, there is a need to have school principals and other leaders learn strategies for making effective use of proven programs. These would include managing professional development, coaching, monitoring implementation and outcomes of proven programs, distributed leadership, and much more. Leadership also requires jointly setting school goals with all school staff and monitoring progress toward these goals.

These are all components of the education “mill” that have to be designed, tested, and (if effective) disseminated to ever-increasing numbers of schools. Like the mill, an effective school design integrates individual parts, makes them work in synchrony, constantly assesses their functioning and output, and adjusts procedures when necessary.

Many educational theorists argue that education will only change when systems change. Ferocious battles rage about charters vs. ordinary public schools, about adopting policies of countries that do well on international tests, and so on. These policies can be important, but they are unlikely to create substantial and lasting improvement unless they lead to development and dissemination of proven whole-school approaches.

Effective school improvement is not likely to come about from let-a-thousand-flowers-bloom local innovation, nor from top-level changes in policy or governance. Sufficient change will not come about by throwing individual small innovations into schools and hoping they will collectively make a difference. Instead, effective improvement will take root when we learn how to reliably create effective programs for schools, implement them in a coordinated and planful way, find them effective, and then disseminate them. Once such schools are widespread, we can build larger policies and systems around their needs.

Coordinated, schoolwide improvement approaches offer schools proven strategies for increasing the achievement and success of their children. There should be many programs of this kind, among which schools and districts can choose. A school is not the same as mill, but the mill provides at least one image of how creating complex, integrated replicable systems can change whole societies and economies. We should learn from this and many other examples of how to focus our efforts to improve outcomes for all children.

Photo credit: By Johnson, Helen Kendrik [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

More Chinese Dragons: How the WWC Could Accelerate Its Pace

blog_4-26-18_chinesedragon_500x375

A few months ago, I wrote a blog entitled “The Mystery of the Chinese Dragon: Why Isn’t the WWC Up to Date?” It really had nothing to do with dragons, but compared the timeliness of the What Works Clearinghouse review of research on secondary reading programs and a Baye et al. (2017) review on the same topic. The graph depicting the difference looked a bit like a Chinese dragon with a long tail near the ground and huge jaws. The horizontal axis was the dates accepted studies had appeared, and the vertical axis was the number of studies. Here is the secondary reading graph.

blog_4-26-18_graph1_500x292

What the graph showed is that the WWC and the U.S. studies from the Baye et al. (2017) review were similar in coverage of studies appearing from 1987 to 2009, but after that diverged sharply, because the WWC is very slow to add new studies, in comparison to reviews using similar methods.

In the time since the Chinese Dragon for secondary reading studies appeared on my blog, my colleagues and I have completed two more reviews, one on programs for struggling readers by Inns et al. (2018) and one on programs for elementary math by Pellegrini et al. (2018). We made new Chinese Dragon graphs for each, which appear below.*

blog_4-26-18_graph3_500x300

blog_4-26-18_graph2_500x316

*Note: In the reading graph, the line for “Inns et al.” added numbers of studies from the Inns et al. (2018) review of programs for struggling readers to additional studies of programs for all elementary students in an unfinished report.

The new dragons look remarkably like the first. Again, what matters is the similar pattern of accepted studies before 2009, (the “tail”), and the sharply diverging rates in more recent years (the “jaws”).

There are two phenomena that cause the dragons’ “jaws” to be so wide open. The upper jaw, especially in secondary reading and elementary math, indicate that many high-quality rigorous evaluations are appearing in recent years. Both the WWC inclusion standards and those of the Best Evidence Encyclopedia (BEE; www.bestevidence.org) require control groups, clustered analysis for clustered designs, samples that are well-matched at pretest and have similar attrition by posttest, and other features indicating methodological rigor, of the kind expected by the ESSA evidence standards, for example.

The upper jaw of each dragon is increasing so rapidly because rigorous research is increasing rapidly in the U.S. (it is also increasing rapidly in the U.K., but the WWC does not include non-U.S. studies, and non-U.S. studies are removed from the graph for comparability). This increase is due to U. S. Department of Education funding of many rigorous studies in each topic area, through its Institute for Education Sciences (IES) and Investing in Innovation (i3) programs, and special purpose funding such as Striving Readers and Preschool Curriculum Education Research. These recent studies are not only uniformly rigorous, they are also of great importance to educators, as they evaluate current programs being actively disseminated today. Many of the older programs whose evaluations appear on the dragons’ tails no longer exist, as a practical matter. If educators wanted to adopt them, the programs would have to be revised or reinvented. For example, Daisy Quest, still in the WWC, was evaluated on TRS-80 computers not manufactured since the 1980s. Yet exciting new programs with rigorous evaluations, highlighted in the BEE reviews, do not appear at all in the WWC.

I do not understand why the WWC is so slow to add new evaluations, but I suspect that the answer lies in the painstaking procedures any government has to follow to do . . ., well, anything. Perhaps there are very good reasons for this stately pace of progress. However, the result is clear. The graph below shows the publication dates of every study in every subject and grade level accepted by the WWC and entered on its database. This “half-dragon” graph shows that only 26 studies published or made available after 2013 appear on the entire WWC database. Of these, only two have appeared after 2015.

blog_4-26-18_graph4_500x316

The slow pace of the WWC is of particular concern in light of the appearance of the ESSA evidence standards. More educators than ever before must be consulting the WWC, and many must be wondering why programs they know to exist are not listed there, or why recent studies do not appear.

Assuming that there are good reasons for the slow pace of the WWC, or that for whatever reason the pace cannot be greatly accelerated, what can be done to bring the WWC up to date? I have a suggestion.

Imagine that the WWC commissioned someone to do rapid updating of all topics reviewed on the WWC website. The reviews would follow WWC guidelines, but would appear very soon after studies were published or issued. It’s clear that this is possible, because we do it for Evidence for ESSA (www.evidenceforessa.org). Also, the WWC has a number of “quick reviews,” “single study reports,” and so on, scattered around on its site, but not integrated with its main “Find What Works” reviews of various programs. These could be readily integrated with “Find What Works.”

The recent studies identified in this accelerated process might be identified as “provisionally reviewed,” much as the U. S. Patent Office has “patent pending” before inventions are fully patented. Users would have an option to look only at program reports containing fully reviewed studies, or could decide to look at reviews containing both fully and provisionally reviewed studies. If a more time consuming full review of a study found results different from those of the provisional review, the study report and the program report in which it was contained would be revised, of course.

A process of this kind could bring the WWC up to date and keep it up to date, providing useful, actionable evidence in a timely fashion, while maintaining the current slower process, if there is a rationale for it.

The Chinese dragons we are finding in every subject we have examined indicate the rapid growth and improving quality of evidence on programs for schools and students. The U. S. Department of Education and our whole field should be proud of this, and should make it a beacon on a hill, not hide our light under a bushel. The WWC has the capacity and the responsibility to highlight current, high-quality studies as soon as they appear. When this happens, the Chinese dragons will retire to their caves, and all of us, government, researchers, educators, and students, will benefit.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (2017). Effective reading programs for secondary students. Manuscript submitted for publication. Also see Baye, A., Lake, C., Inns, A. & Slavin, R. E. (2017, August). Effective reading programs for secondary students. Baltimore, MD: Johns Hopkins University, Center for Research and Reform in Education.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Photo credit: J Bar [GFDL (http://www.gnu.org/copyleft/fdl.html), CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/), GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Effect Sizes: How Big is Big?

blog_4-12-18_elephantandmouseAn effect size is a measure of how much an experimental group exceeds a control group, controlling for pretests. As every quantitative researcher knows, the formula is (XT – XC)/SD, or adjusted treatment mean minus adjusted control mean divided by the unadjusted standard deviation. If this is all gobbledygook to you, I apologize, but sometimes us research types just have to let our inner nerd run free.

Effect sizes have come to be accepted as a standard indicator of the impact an experimental treatment had on a posttest. As research becomes more important in policy and practice, understanding them is becoming increasingly important.

One constant question is how important a given effect size is. How big is big? Many researchers still use a rule of thumb from Cohen to the effect that +0.20 is “small,” +0.50 is “moderate,” and +0.80 or more is “large.”  Yet Cohen himself disavowed these standards long ago.

High-quality experimental-control comparison research in schools rarely gets effect sizes as large as +0.20, and only one-to-one tutoring studies routinely get to +0.50. So Cohen’s rule of thumb was demanding effect sizes for rigorous school research far larger than those typically reported in practice.

An article by Hill, Bloom, Black, and Lipsey (2008) considered several ways to determine the importance of effect sizes. They noted that students learn more each year (in effect sizes) in the early elementary grades than do high school students. They suggested that therefore a given effect size for an experimental treatment may be more important in secondary school than the same effect size would be in elementary school. However, in four additional tables in the same article, they show that actual effect sizes from randomized studies are relatively consistent across the grades. They also found that effect sizes vary greatly depending on methodology and the nature of measures. They end up concluding that it is most reasonable to determine the importance of an effect size by comparing it to effect sizes in other studies with similar measures and designs.

A study done by Alan Cheung and myself (2016) reinforces the importance of methodology in determining what is an important effect size. We analyzed all findings from 645 high-quality studies included in all reviews in our Best Evidence Encyclopedia (www.bestevidence.org). We found that the most important factors in effect sizes were sample size and design (randomized vs. matched). Here is the key table.

Effects of Sample Size and Designs on Effect Sizes

  Sample Size
Design Small Large
Matched +0.33 +0.17
Randomized +0.23 +0.12

What this chart shows is that matched studies with small sample sizes (less than 250 students) have much higher effect sizes, on average, than, say, large randomized studies (+0.33 vs. +0.12). These differences say nothing about the impact on children, but are completely due to differences in study design.

If effect sizes are so different due to study design, then we cannot have a single standard to tell us when an effect size is large or small. All we can do is note when an effect size is large compared to similar studies. For example imagine that a study finds an effect size of +0.20. Is that big or small? If it was a matched study with a small sample size, +0.20 would be a rather small impact. If it were a randomized study with a large sample size, it might be considered quite a large impact.

Beyond study methods, a good general principle is to compare like with like. For example, some treatments may have very small effect sizes, but they may be so inexpensive or may affect so many students that a small effect may be important. For example, principal or superintendent training may affect very many students, or benchmark assessments may be so inexpensive that a small effect size may be worthwhile, and may compare favorably with equally inexpensive means of solving the same problem.

My colleagues and I will be developing a formula to enable researchers and readers to easily put in features of a study to produce an “expected effect size” to determine more accurately whether an effect size should be considered large or small.

Not long ago, it would not have mattered much how large effect sizes were considered, but now it does. That’s an indication of the progress we have made in recent years. Big indeed!

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

What if a Sears Catalogue Married Consumer Reports?

blog_3-15-18_familyreading_500x454When I was in high school, I had a summer job delivering Sears catalogues. I borrowed my mother’s old Chevy station wagon and headed out fully laden into the wilds of the Maryland suburbs of Washington.

I immediately learned something surprising. I thought of a Sears catalogue as a big book of advertisements. But the people to whom I was delivering them often saw it as a book of dreams. They were excited to get their catalogues. When a neighborhood saw me coming, I became a minor celebrity.

Thinking back on those days, I was thinking about our Evidence for ESSA website (www.evidenceforessa.org). I realized that what I wanted it to be was a way to communicate to educators the wonderful array of programs they could use to improve outcomes for their children. Sort of like a Sears catalogue for education. However, it provides something that a Sears catalogue does not: Evidence about the effectiveness of each catalogue entry. Imagine a Sears catalogue that was married to Consumer Reports. Where a traditional Sears catalogue describes a kitchen gadget, “It slices and dices, with no muss, no fuss!”, the marriage with Consumer Reports would instead say, “Effective at slicing and dicing, but lots of muss. Also fuss.”

If this marriage took place, it might take some of the fun out of the Sears catalogue (making it a book of realities rather than a book of dreams), but it would give confidence to buyers, and help them make wise choices. And with proper wordsmithing, it could still communicate both enthusiasm, when warranted, and truth. But even more, it could have a huge impact on the producers of consumer goods, because they would know that their products would need to be rigorously tested and found to be able to back up their claims.

In enhancing the impact of research on the practice of education, we have two problems that have to be solved. Just like the “Book of Dreams,” we have to help educators know the wonderful array of programs available to them, programs they may never had heard of. And beyond the particular programs, we need to build excitement about the opportunity to select among proven programs.

In education, we make choices not for ourselves, but on behalf of our children. Responsible educators want to choose programs and practices that improve the achievement of their students. Something like a marriage of the Sears catalogue and Consumer Reports is necessary to address educators’ dreams and their need for information on program outcomes. Users should be both excited and informed. Information usually does not excite. Excitement usually does not inform. We need a way to do both.

In Evidence for ESSA, we have tried to give educators a sense that there are many solutions to enduring instructional problems (excitement), and descriptions of programs, outcomes, costs, staffing requirements, professional development, and effects for particular subgroups, for example (information).

In contrast to Sears catalogues, Evidence for ESSA is light (Sears catalogues were huge, and ultimately broke the springs on my mother’s station wagon). In contrast to Consumer Reports, Evidence for ESSA is free.  Every marriage has its problems, but our hope is that we can capture the excitement and the information from the marriage of these two approaches.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Picture source: Nationaal Archief, the Netherlands