Meta-Analysis and Its Discontents

Everyone loves meta-analyses. We did an analysis of the most frequently opened articles on Best Evidence in Brief. Almost all of the most popular were meta-analyses. What’s so great about meta-analyses is that they condense a lot of evidence and synthesize it, so instead of just one study that might be atypical or incorrect, a meta-analysis seems authoritative, because it averages many individual studies to find the true effect of a given treatment or variable.

Meta-analyses can be wonderful summaries of useful information. But today I wanted to discuss how they can be misleading. Very misleading.

The problem is that there are no norms among journal editors or meta-analysts themselves about standards for including studies or, perhaps most importantly, how much or what kind of information needs to be reported about each individual study in a meta-analysis. Some meta-analyses are completely statistical. They report all sorts of statistics and very detailed information on exactly how the search for articles took place, but never say anything about even a single study. This is a problem for many reasons. Readers may have no real understanding of what the studies really say. Even if citations for the included studies are available, only a very motivated reader is going to go find any of them. Most meta-analyses do have a table listing studies, but the information in the table may be idiosyncratic or limited.

One reason all of this matters is that without clear information on each study, readers can be easily misled. I remember encountering this when meta-analysis first became popular in the 1980s. Gene Glass, who coined the very term, proposed some foundational procedures, and popularized the methods. Early on, he applied meta-analysis to determine the effects of class size, which by then had been studied several times and found to matter very little except in first grade. Reducing “class size” to one (i.e., one-to-one tutoring) also was known to make a big difference, but few people would include one-to-one tutoring in a review of class size. But Glass and Smith (1978) found a much higher effect, not limited to first grade or tutoring. It was a big deal at the time.

I wanted to understand what happened. I bought and read Glass’ book on class size, but it was nearly impossible to tell what had happened. But then I found in an obscure appendix a distribution of effect sizes. Most studies had effect sizes near zero, as I expected. But one had a huge effect size, of +1.25! It was hard to tell which particular study accounted for this amazing effect but I searched by process of elimination and finally found it.

It was a study of tennis.

blog_6-7-18_tennis_500x355

The outcome measure was the ability to “rally a ball against a wall so many times in 30 seconds.” Not surprisingly, when there were “large class sizes,” most students got very few chances to practice, while in “small class sizes,” they did.

If you removed the clearly irrelevant tennis study, the average effect size for class sizes (other than tutoring) dropped to near zero, as reported in all other reviews (Slavin, 1989).

The problem went way beyond class size, of course. What was important, to me at least, was that Glass’ presentation of the data made it very difficult to find out what was really going on. He had attractive and compelling graphs and charts showing effects of class size, but they all depended on the one tennis study, and there was no easy way to find out.

Because of this review and several others appearing in the 1980s, I wrote an article criticizing numbers–only meta-analyses and arguing that reviewers should show all of the relevant information about the studies in their meta-analyses, and should even describe each study briefly to help readers understand what was happening. I made up a name for this, “best-evidence synthesis” (Slavin, 1986).

Neither the term nor the concept really took hold, I’m sad to say. You still see meta-analyses all the time that do not tell readers enough for them to know what’s really going on. Yet several developments have made the argument for something like best-evidence synthesis a lot more compelling.

One development is the increasing evidence that methodological features can be strongly correlated with effect sizes (Cheung & Slavin, 2016). The evidence is now overwhelming that effect sizes are greatly inflated when sample sizes are small, when study durations are brief, when measures are made by developers or researchers, or when quasi-experiments rather than randomized experiments are used, for example. Many meta-analyses check for the effects of these and other study characteristics, and may make adjustments if there are significant differences. But this is not sufficient, because in a particular meta-analysis, there may not be enough studies to make any study-level factors significant. For example, if Glass had tested “tennis vs. non-tennis,” there would have been no significant difference, because there was only one tennis study. Yet that one study dominated the means anyway. Eliminating studies using, for example, researcher/developer-made measures or very small sample sizes or very brief durations is one way to remove bias from meta-analyses, and this is what we do in our reviews. But at bare minimum, it is important to have enough information available in tables to enable readers or journal reviewers to look for such biasing factors so they can recompute or at least understand the main effects if they are so inclined.

The second development that makes it important to require more information on individual studies in meta-analyses is the increased popularity of meta-meta-analyses, where the average effect sizes from whole meta-analyses are averaged. These have even more potential for trouble than the worst statistics-only reviews, because it is extremely unlikely that many readers will follow the citations to each included meta-analysis and then follow those citations to look for individual studies. It would be awfully helpful if readers or reviewers could trust the individual meta-analyses (and therefore their averages), or at least see for themselves.

As evidence takes on greater importance, this would be a good time to discuss reasonable standards for meta-analyses. Otherwise, we’ll be rallying balls uselessly against walls forever.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292

Glass, G., & Smith, M. L. (1978). Meta-Analysis of research on the relationship of class size and achievement. San Francisco: Far West Laboratory for Educational Research and Development.

Slavin, R.E. (1986). Best-evidence synthesis: An alternative to meta-analytic and traditional reviews. Educational Researcher, 15 (9), 5-11.

Slavin, R. E. (1989). Class size and student achievement:  Small effects of small classes. Educational Psychologist, 24, 99-110.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

When Developers Commission Studies, What Develops?

I have the greatest respect for commercial developers and disseminators of educational programs, software, and professional development. As individuals, I think they genuinely want to improve the practice of education, and help produce better outcomes for children. However, most developers are for-profit companies, and they have shareholders who are focused on the bottom line. So when developers carry out evaluations, or commission evaluation companies to do so on their behalf, perhaps it’s best to keep in mind a bit of dialogue from a Marx Brothers movie. Someone asks Groucho if Chico is honest. “Sure,” says Groucho, “As long as you watch him!”

blog_5-31-18_MarxBros_500x272

         A healthy role for developers in evidence-based reform in education is desirable. Publishers, software developers, and other commercial companies have a lot of capital, and a strong motivation to create new products with evidence of effectiveness that will stand up to scrutiny. In medicine, most advances in practical drugs and treatments are made by drug companies. If you’re a cynic, this may sound disturbing. But for a long time, the federal government has encouraged drug companies to do development and evaluation of new drugs, but they have strict rules about what counts as conclusive evidence. Basically, the government says, following Groucho, “Are drug companies honest? Sure, as long as you watch ‘em.”

            In our field, we may want to think about how to do this. As one contribution, my colleague Betsy Wolf did some interesting research on outcomes of studies sponsored by developers, compared to those conducted by independent, third parties. She looked at all reading/literacy and math studies listed on the What Works Clearinghouse database. The first thing she found was very disturbing. Sure enough, the effect sizes for the developer-commissioned studies (ES = +0.27, n=73) were twice as large as those for independent studies (ES = +0.13, n=96). That’s a huge difference.

Being a curious person, Betsy wanted to know why developer-commissioned studies had effect sizes that were so much larger than independent ones. We now know a lot about study characteristics that inflate effect sizes. The most inflationary are small sample size, use of measures made by researchers or developers (rather than independent measures), and use of quasi-experiments instead of randomized designs. Developer-commissioned studies were in fact much more likely to use researcher/developer-made measures (29% in developer-commissioned vs. 8% in independent studies), and randomized vs. quasi-experiments (51% quasi-experiments for developer-commissioned studies vs. 15% quasi-experiments for independent studies). However, sample sizes were similar in developer-commissioned and independent studies. And most surprising, statistically controlling for all of these factors did not reduce the developer effect by very much.

If there is so much inflation of effect sizes in developer-commissioned studies, then how come controlling for the largest factors that usually cause effect size inflation does not explain the developer effect?

There is a possible reason for this, which Betsy cautiously advances (since it cannot be proven). Perhaps the reason that effect sizes are inflated in developer-commissioned studies is not due to the nature of the studies we can find, but to the studies we cannot find. There has long been recognition of what is called the “file drawer effect,” which happens when studies that do not obtain a positive outcome disappear (into a file drawer). Perhaps developers are especially likely to hide disappointing findings. Unlike academic studies, which are likely to exist as technical reports or dissertations, perhaps commercial companies have no incentive to make null findings findable in any form.

This may not be true, or it may be true of some but not other developers. But if government is going to start taking evidence a lot more seriously, as it has done with the ESSA evidence standards (see www.evidenceforessa.org), it is important to prevent developers, or any researchers, from hiding their null findings.

There is a solution to this problem that is heading rapidly in our direction. This is pre-registration. In pre-registration, researchers or evaluators must file a study design, measures, and analyses about to be used in a study, but perhaps most importantly, pre-registration announces that a study exists, or will exist soon. If a developer pre-registered a study but that study never showed up in the literature, this might be a cause for losing faith in the developer. Imagine that the What Works Clearinghouse, Evidence for ESSA, and journals refused to accept research reports on programs unless the study had been pre-registered, and unless all other studies of the program were made available.

Some areas of medicine use pre-registration, and the Society for Research on Educational Effectiveness is moving toward introducing a pre-registration process for education. Use of pre-registration and other safeguards could be a boon to commercial developers, as it is to drug companies, because it could build public confidence in developer-sponsored research. Admittedly, it would take many years and/or a lot more investment in educational research to make this practical, but there are concrete steps we could take in that direction.

I’m not sure I see any reason we shouldn’t move toward pre-registration. It would be good for Groucho, good for Chico, and good for kids. And that’s good enough for me!

Photo credit: By Paramount Pictures (source) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Good, the Bad, and the (Un)Promising

The ESSA evidence standards are finally beginning to matter. States are starting the process that will lead them to make school improvement awards to their lowest-achieving schools. The ESSA law is clear that for schools to qualify for these awards, they must agree to implement programs that meet the strong, moderate, or promising levels of the ESSA evidence standards. This is very exciting for those who believe in the power of proven programs to transform schools and benefit children. It is good news for kids, for teachers, and for our profession.

But inevitably, there is bad news with the good. If evidence is to be a standard for government funding, there are bound to be people who disseminate programs lacking high-quality evidence who will seek to bend the definitions to declare themselves “proven.” And there are also bound to be schools and districts that want to keep using what they have always used, or to keep choosing programs based on factors other than evidence, while doing the minimum the law requires.

The battleground is the ESSA “promising” criterion. “Strong” programs are pretty well defined as having significant positive evidence from high-quality randomized studies. “Moderate” programs are pretty well defined as having significant positive evidence from high-quality matched studies. Both “strong” and “moderate” are clearly defined in Evidence for ESSA (www.evidenceforessa.org), and, with a bit of translation, by the What Works Clearinghouse, both of which list specific programs that meet or do not meet these standards.

“Promising,” on the other hand is kind  of . . . squishy. The ESSA evidence standards do define programs meeting “promising” as ones that have statistically significant effects in “well-designed and well-implemented” correlational studies, with controls for inputs (e.g., pretests).  This sounds good, but it is hard to nail down in practice. I’m seeing and hearing about a category of studies that perfectly illustrate the problem. Imagine that a developer commissions a study of a form of software. A set of schools and their 1000 students are assigned to use the software, while control schools and their 1000 students do not have access to the software but continue with business as usual.

Computers routinely produce “trace data” that automatically tells researchers all sorts of things about how much students used the software, what they did with it, how successful they were, and so on.

The problem is that typically, large numbers of students given software do not use it. They may never even hit a key, or they may use the software so little that the researchers rule the software use to be effectively zero. So in a not unusual situation, let’s assume that in the treatment group, the one that got the software, only 500 of the 1000 students actually used the software at an adequate level.

Now here’s the rub. Almost always, the 500 students will out-perform the 1000 controls, even after controlling for pretests. Yet this would be likely to happen even if the software were completely ineffective.

To understand this, think about the 500 students who did use the software and the 500 who did not. The users are probably more conscientious, hard-working, and well-organized. The 500 non-users are more likely to be absent a lot, to fool around in class, to use their technology to play computer games, or go on (non-school-related) social media, rather than to do math or science for example. Even if the pretest scores in the user and non-user groups were identical, they are not identical students, because their behavior with the software is not equal.

I once visited a secondary school in England that was a specially-funded model for universal use of technology. Along with colleagues, I went into several classes. The teachers were teaching their hearts out, making constant use of the technology that all students had on their desks. The students were well-behaved, but just a few dominated the discussion. Maybe the others were just a bit shy, we thought. From the front of each class, this looked like the classroom of the future.

But then, we filed to the back of each class, where we could see over students’ shoulders. And we immediately saw what was going on. Maybe 60 or 70 percent of the students were actually on social media unrelated to the content, paying no attention to the teacher or instructional software!

blog_5-24-18_DistStudents_500x332

Now imagine that a study compared the 30-40% of students who were actually using the computers to students with similar pretests in other schools who had no computers at all. Again, the users would look terrific, but this is not a fair comparison, because all the goof-offs and laggards in the computer school had selected themselves out of the study while goof-offs and laggards in the control group were still included.

Rigorous researchers use a method called intent-to-treat, which in this case would include every student, whether or not they used the software or played non-educational computer games. “Not fair!” responds the software developer, because intent-to-treat includes a lot of students who never touched a key except to use social media. No sophisticated researcher accepts such an argument, however, because including only users gives the experimental group a big advantage.

Here’s what is happening at the policy level. Software developers are using data from studies that only include the students who made adequate use of the software. They are then claiming that such studies are correlational and meet the “promising” standard of ESSA.

Those who make this argument are correct in saying that such studies are correlational. But these studies are very, very, very bad, because they are biased toward the treatment. The ESSA standards specify well-designed and well-implemented studies, and these studies may be correlational, but they are not well-designed or well-implemented. Software developers and other vendors are very concerned about the ESSA evidence standards, and some may use the “promising” category as a loophole. Evidence for ESSA does not accept such studies, even as promising, and the What Works Clearinghouse does not even have any category that corresponds to “promising.” Yet vendors are flooding state departments of education and districts with studies they claim to meet the ESSA standards, though in the lowest category.

Recently, I heard something that could be a solution to this problem. Apparently, some states are announcing that for school improvement grants, and any other purpose that has financial consequences, they will only accept programs with “strong” and “moderate” evidence. They have the right to do this; the federal law says school improvement grants must support programs that at least meet the “promising” standard, but it does not say states cannot set a higher minimum standard.

One might argue that ignoring “promising” studies is going too far. In Evidence for ESSA (www.evidenceforessa.org), we accept studies as “promising” if they have weaknesses that do not lead to bias, such as clustered studies that were significant at the student but not the cluster level. But the danger posed by studies claiming to fit “promising” using biased designs is too great. Until the feds fix the definition of “promising” to exclude bias, the states may have to solve it for themselves.

I hope there will be further development of the “promising” standard to focus it on lower-quality but unbiased evidence, but as things are now, perhaps it is best for states themselves to declare that “promising” is no longer promising.

Eventually, evidence will prevail in education, as it has in many other fields, but on the way to that glorious future, we are going to have to make some adjustments. Requiring that “promising” be truly promising would be a good place to begin.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Ensuring That Proven Programs Stay Effective in Practice

On a recent trip to Scotland, I visited a ruined abbey. There, in what remained of its ancient cloister, was a sign containing a rule from the 1459 Statute of the Strasbourg Stonecutters’ Guild:

If a master mason has agreed to build a work and has made a drawing of the work as it is to be executed, he must not change this original. But he must carry out the work according to the plan that he has presented to the lords, towns, or villages, in such a way that the work will not be diminished or lessened in value.

Although the Stonecutters’ Guild was writing more than five centuries ago, it touched on an issue we face right now in evidence-based reform in education. Providers of educational programs may have excellent evidence that meets ESSA standards and demonstrates positive effects on educational outcomes. That’s terrific, of course. But the problem is that after a program has gone into dissemination, its developers may find that schools are not willing or able to pay for all of the professional development or software or materials used in the experiments that validated the program. So they may provide less, sometimes much less, to make the program cheaper or easier to adopt. This is the problem that concerned the Stonecutters of Strasbourg: Grand plans followed by inadequate construction.

blog_5-17-18_MedBuilding_500x422

In our work on Evidence for ESSA, we see this problem all the time. A study or studies show positive effects for a program. In writing up information on costs, personnel, and other factors, we usually look at the program’s website. All too often, we find that the program on the website provides much less than the program that was evaluated.  The studies might have provided weekly coaching, but the website promises two visits a year. A study of a tutoring program might have involved one-to-two tutoring, but the website sells or licenses the materials in sets of 20 for use with groups of that size. A study of a technology program may have provided laptops to every child and a full-time technology coordinator, while the website recommends one device for every four students and never mentions a technology coordinator.

Whenever we see this, we take on the role of the Stonecutters’ Guild, and we have to be as solid as a rock. We tell developers that we are planning to describe their program as it was implemented in their successful studies. This sometimes causes a ruckus, with vendors arguing that providing what they did in the study would make the program too expensive. “So would you like us to list your program (as it is in your website) as unevaluated?” we say. We are not unreasonable, but we are tough, because we see ourselves as helping schools make wise and informed choices, not helping vendors sell programs that may have little resemblance to the programs that were evaluated.

This is hard work, and I’m sure we do not get it right 100% of the time. And a developer may agree to an honest description but then quietly give discounts and provide less than what our descriptions say. All we can do is state the truth on our website about what was provided in the successful studies as best as we can, and the schools have to insist that they receive the program as described.

The Stonecutters’ Guild, and many other medieval guilds, represented the craftsmen, not the customers. Yet part of their function was to uphold high standards of quality. It was in the collective interest of all members of the guild to create and maintain a “brand,” indicating that any product of the guild’s members met the very highest standards. Someday, we hope publishers, software developers, professional development providers, and others who work with schools will themselves insist on an evidence base for their products, and then demand that providers ensure that their programs continue to be implemented in ways that maximize the probability that they will produce positive outcomes for children.

Stonecutters only build buildings. Educators affect the lives of children, which in turn affect families, communities, and societies. Long after a stonecutter’s work has fallen into ruin, well-educated people and their descendants and communities will still be making a difference. As researchers, developers, and educators, we have to take this responsibility at least as seriously as did the stone masons of long ago.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Effect Sizes and the 10-Foot Man

If you ever go into the Ripley’s Believe It or Not Museum in Baltimore, you will be greeted at the entrance by a statue of the tallest man who ever lived, Robert Pershing Wadlow, a gentle giant at 8 feet, 11 inches in his stocking feet. Kids and adults love to get their pictures taken standing by him, to provide a bit of perspective.

blog_5-10-18_Wadlow_292x500

I bring up Mr. Wadlow to explain a phrase I use whenever my colleagues come up with an effect size of more than 1.00. “That’s a 10-foot man,” I say. What I mean, of course, is that while it is not impossible that there could be a 10-foot man someday, it is extremely unlikely, because there has never been a man that tall in all of history. If someone reports seeing one, they are probably mistaken.

In the case of effect sizes you will never, or almost never, see an effect size of more than +1.00, assuming the following reasonable conditions:

  1. The effect size compares experimental and control groups (i.e., it is not pre-post).
  2. The experimental and control group started at the same level, or they started at similar levels and researchers statistically controlled for pretest differences.
  3. The measures involved were independent of the researcher and the treatment, not made by the developers or researchers. The test was not given by the teachers to their own students.
  4. The treatment was provided by ordinary teachers, not by researchers, and could in principle be replicated widely in ordinary schools. The experiment had a duration of at least 12 weeks.
  5. There were at least 30 students and 2 teachers in each treatment group (experimental and control).

If these conditions are met, the chances of finding effect sizes of more than +1.00 are about the same as the chances of finding a 10-foot man. That is, zero.

I was thinking about the 10-foot man when I was recently asked by a reporter about the “two sigma effect” claimed by Benjamin Bloom and much discussed in the 1970s and 1980s. Bloom’s students did a series of experiments in which students were taught about a topic none of them knew anything about, usually principles of sailing. After a short period, students were tested. Those who did not achieve at least 80% (defined as “mastery”) on the tests were tutored by University of Chicago graduate students long enough to ensure that every tutored student reached mastery. The purpose of this demonstration was to make a claim that every student could learn whatever we wanted to teach them, and the only variable was instructional time, as some students need more time to learn than others. In a system in which enough time could be given to all, “ability” would disappear as a factor in outcomes. Also, in comparison to control groups who were not taught about sailing at all, the effect size was often more than 2.0, or two sigma. That’s why this principle was called the “two sigma effect.” Doesn’t the two sigma effect violate my 10-foot man principle?

No, it does not. The two sigma studies used experimenter-made tests of content taught to the experimental but not control groups. It used University of Chicago graduate students providing far more tutoring (as a percentage of initial instruction) than any school could ever provide. The studies were very brief and sample sizes were small. The two sigma experiments were designed to prove a point, not to evaluate a feasible educational method.

A more recent example of the 10-foot man principle is found in Visible Learning, the currently fashionable book by John Hattie claiming huge effect sizes for all sorts of educational treatments. Hattie asks the reader to ignore any educational treatment with an effect size of less than +0.40, and reports many whole categories of teaching methods with average effect sizes of more than +1.00. How can this be?

The answer is that such effect sizes, like two sigma, do not incorporate the conditions I laid out. Instead, Hattie throws into his reviews entire meta-analyses which may include pre-post studies, studies using researcher-made measures, studies with tiny samples, and so on. For practicing educators, such effect sizes are useless. An educator knows that all children grow from pre- to posttest. They would not (and should not) accept measures made by researchers. The largest known effect sizes that do meet the above conditions are one-to-one tutoring studies with effect sizes up to +0.86. Still not +1.00. What could be more effective than the best of 1-1 tutoring?

It’s fun to visit Mr. Wadlow at the museum, and to imagine what an ever taller man could do on a basketball team, for example. But if you see a 10-foot man at Ripley’s Believe it or Not, or anywhere else, here’s my suggestion. Don’t believe it. And if you visit a museum of famous effect sizes that displays a whopper effect size of +1.00, don’t believe that, either. It doesn’t matter how big effect sizes are if they are not valid.

A 10-foot man would be a curiosity. An effect size of +1.00 is a distraction. Our work on evidence is too important to spend our time looking for 10-foot men, or effect sizes of +1.00, that don’t exist.

Photo credit: [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Effect Sizes: How Big is Big?

blog_4-12-18_elephantandmouseAn effect size is a measure of how much an experimental group exceeds a control group, controlling for pretests. As every quantitative researcher knows, the formula is (XT – XC)/SD, or adjusted treatment mean minus adjusted control mean divided by the unadjusted standard deviation. If this is all gobbledygook to you, I apologize, but sometimes us research types just have to let our inner nerd run free.

Effect sizes have come to be accepted as a standard indicator of the impact an experimental treatment had on a posttest. As research becomes more important in policy and practice, understanding them is becoming increasingly important.

One constant question is how important a given effect size is. How big is big? Many researchers still use a rule of thumb from Cohen to the effect that +0.20 is “small,” +0.50 is “moderate,” and +0.80 or more is “large.”  Yet Cohen himself disavowed these standards long ago.

High-quality experimental-control comparison research in schools rarely gets effect sizes as large as +0.20, and only one-to-one tutoring studies routinely get to +0.50. So Cohen’s rule of thumb was demanding effect sizes for rigorous school research far larger than those typically reported in practice.

An article by Hill, Bloom, Black, and Lipsey (2008) considered several ways to determine the importance of effect sizes. They noted that students learn more each year (in effect sizes) in the early elementary grades than do high school students. They suggested that therefore a given effect size for an experimental treatment may be more important in secondary school than the same effect size would be in elementary school. However, in four additional tables in the same article, they show that actual effect sizes from randomized studies are relatively consistent across the grades. They also found that effect sizes vary greatly depending on methodology and the nature of measures. They end up concluding that it is most reasonable to determine the importance of an effect size by comparing it to effect sizes in other studies with similar measures and designs.

A study done by Alan Cheung and myself (2016) reinforces the importance of methodology in determining what is an important effect size. We analyzed all findings from 645 high-quality studies included in all reviews in our Best Evidence Encyclopedia (www.bestevidence.org). We found that the most important factors in effect sizes were sample size and design (randomized vs. matched). Here is the key table.

Effects of Sample Size and Designs on Effect Sizes

  Sample Size
Design Small Large
Matched +0.33 +0.17
Randomized +0.23 +0.12

What this chart shows is that matched studies with small sample sizes (less than 250 students) have much higher effect sizes, on average, than, say, large randomized studies (+0.33 vs. +0.12). These differences say nothing about the impact on children, but are completely due to differences in study design.

If effect sizes are so different due to study design, then we cannot have a single standard to tell us when an effect size is large or small. All we can do is note when an effect size is large compared to similar studies. For example imagine that a study finds an effect size of +0.20. Is that big or small? If it was a matched study with a small sample size, +0.20 would be a rather small impact. If it were a randomized study with a large sample size, it might be considered quite a large impact.

Beyond study methods, a good general principle is to compare like with like. For example, some treatments may have very small effect sizes, but they may be so inexpensive or may affect so many students that a small effect may be important. For example, principal or superintendent training may affect very many students, or benchmark assessments may be so inexpensive that a small effect size may be worthwhile, and may compare favorably with equally inexpensive means of solving the same problem.

My colleagues and I will be developing a formula to enable researchers and readers to easily put in features of a study to produce an “expected effect size” to determine more accurately whether an effect size should be considered large or small.

Not long ago, it would not have mattered much how large effect sizes were considered, but now it does. That’s an indication of the progress we have made in recent years. Big indeed!

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

What Kinds of Studies Are Likely to Replicate?

Replicated scientists 03 01 18

In the hard sciences, there is a publication called the Journal of Irreproducible Results.  It really has nothing to do with replication of experiments, but is a humor journal by and for scientists.  The reason I bring it up is that to chemists and biologists and astronomers and physicists, for example, an inability to replicate an experiment is a sure indication that the original experiment was wrong.  To the scientific mind, a Journal of Irreproducible Results is inherently funny, because it is a journal of nonsense.

Replication, the ability to repeat an experiment and get a similar result, is the hallmark of a mature science.  Sad to say, replication is rare in educational research, which says a lot about our immaturity as a science.  For example, in the What Works Clearinghouse, about half of programs across all topics are represented by a single evaluation.  When there are two or more, the results are often very different.  Relatively recent funding initiatives, especially studies supported by Investing in Innovation (i3) and the Institute for Education Sciences (IES), and targeted initiatives such as Striving Readers (secondary reading) and the Preschool Curriculum Evaluation Research (PCER), have added a great deal in this regard. They have funded many large-scale, randomized, very high-quality studies of all sorts of programs in the first place, and many of these are replications themselves, or they provide a good basis for replications later.  As my colleagues and I have done many reviews of research in every area of education, pre-kindergarten to grade 12 (see www.bestevidence.org), we have gained a good intuition about what kinds of studies are likely to replicate and what kinds are less likely.

First, let me define in more detail what I mean by “replication.”  There is no value in replicating biased studies, which may well consistently find the same biased results (as when, for example, both the original studies and the replication studies used the same researcher- or developer-made outcome measures that are slanted toward the content the experimental group experienced but not what the control group experienced) (See http://www.tandfonline.com/doi/abs/10.1080/19345747.2011.558986.)

Instead, I’d consider a successful replication one that shows positive outcomes both in the original studies and in at least one large-scale, rigorous replication. One obvious way to increase the chances that a program producing a positive outcome in one or more initial studies will succeed in such a rigorous replication evaluation is to use a similar, equally rigorous evaluation design in the first place. I think a lot of treatments that fail to replicate are ones that used weak methods in the original studies. In particular, small studies tend to produce greatly inflated effect sizes (see http://www.bestevidence.org/methods/methods.html), which are unlikely to replicate in larger evaluations.

Another factor likely to contribute to replicability is use in the earlier studies of methods or conditions that can be repeated in later studies, or in schools in general. For example, providing teachers with specific manuals, videos demonstrating the methods, and specific student materials all add to the chances that a successful program can be successfully replicated. Avoiding unusual pilot sites (such as schools known to have outstanding principals or staff) may contribute to replication, as these conditions are unlikely to be found in larger-scale studies. Having experimenters or their colleagues or graduate students extensively involved in the early studies diminishes replicability, of course, because those conditions will not exist in replications.

Replications are entirely possible. I wish there were a lot more of them in our field. Showing that programs can be effective in just two rigorous evaluations is way more convincing than just one. As evidence becomes more and more important, I hope and expect that replications, perhaps carried out by states or districts, will become more common.

The Journal of Irreproducible Results is fun, but it isn’t science. I’d love to see a Journal of Replications in Education to tell us what really works for kids.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.