Effect Sizes: How Big is Big?

blog_4-12-18_elephantandmouseAn effect size is a measure of how much an experimental group exceeds a control group, controlling for pretests. As every quantitative researcher knows, the formula is (XT – XC)/SD, or adjusted treatment mean minus adjusted control mean divided by the unadjusted standard deviation. If this is all gobbledygook to you, I apologize, but sometimes us research types just have to let our inner nerd run free.

Effect sizes have come to be accepted as a standard indicator of the impact an experimental treatment had on a posttest. As research becomes more important in policy and practice, understanding them is becoming increasingly important.

One constant question is how important a given effect size is. How big is big? Many researchers still use a rule of thumb from Cohen to the effect that +0.20 is “small,” +0.50 is “moderate,” and +0.80 or more is “large.”  Yet Cohen himself disavowed these standards long ago.

High-quality experimental-control comparison research in schools rarely gets effect sizes as large as +0.20, and only one-to-one tutoring studies routinely get to +0.50. So Cohen’s rule of thumb was demanding effect sizes for rigorous school research far larger than those typically reported in practice.

An article by Hill, Bloom, Black, and Lipsey (2008) considered several ways to determine the importance of effect sizes. They noted that students learn more each year (in effect sizes) in the early elementary grades than do high school students. They suggested that therefore a given effect size for an experimental treatment may be more important in secondary school than the same effect size would be in elementary school. However, in four additional tables in the same article, they show that actual effect sizes from randomized studies are relatively consistent across the grades. They also found that effect sizes vary greatly depending on methodology and the nature of measures. They end up concluding that it is most reasonable to determine the importance of an effect size by comparing it to effect sizes in other studies with similar measures and designs.

A study done by Alan Cheung and myself (2016) reinforces the importance of methodology in determining what is an important effect size. We analyzed all findings from 645 high-quality studies included in all reviews in our Best Evidence Encyclopedia (www.bestevidence.org). We found that the most important factors in effect sizes were sample size and design (randomized vs. matched). Here is the key table.

Effects of Sample Size and Designs on Effect Sizes

  Sample Size
Design Small Large
Matched +0.33 +0.17
Randomized +0.23 +0.12

What this chart shows is that matched studies with small sample sizes (less than 250 students) have much higher effect sizes, on average, than, say, large randomized studies (+0.33 vs. +0.12). These differences say nothing about the impact on children, but are completely due to differences in study design.

If effect sizes are so different due to study design, then we cannot have a single standard to tell us when an effect size is large or small. All we can do is note when an effect size is large compared to similar studies. For example imagine that a study finds an effect size of +0.20. Is that big or small? If it was a matched study with a small sample size, +0.20 would be a rather small impact. If it were a randomized study with a large sample size, it might be considered quite a large impact.

Beyond study methods, a good general principle is to compare like with like. For example, some treatments may have very small effect sizes, but they may be so inexpensive or may affect so many students that a small effect may be important. For example, principal or superintendent training may affect very many students, or benchmark assessments may be so inexpensive that a small effect size may be worthwhile, and may compare favorably with equally inexpensive means of solving the same problem.

My colleagues and I will be developing a formula to enable researchers and readers to easily put in features of a study to produce an “expected effect size” to determine more accurately whether an effect size should be considered large or small.

Not long ago, it would not have mattered much how large effect sizes were considered, but now it does. That’s an indication of the progress we have made in recent years. Big indeed!

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

What if a Sears Catalogue Married Consumer Reports?

blog_3-15-18_familyreading_500x454When I was in high school, I had a summer job delivering Sears catalogues. I borrowed my mother’s old Chevy station wagon and headed out fully laden into the wilds of the Maryland suburbs of Washington.

I immediately learned something surprising. I thought of a Sears catalogue as a big book of advertisements. But the people to whom I was delivering them often saw it as a book of dreams. They were excited to get their catalogues. When a neighborhood saw me coming, I became a minor celebrity.

Thinking back on those days, I was thinking about our Evidence for ESSA website (www.evidenceforessa.org). I realized that what I wanted it to be was a way to communicate to educators the wonderful array of programs they could use to improve outcomes for their children. Sort of like a Sears catalogue for education. However, it provides something that a Sears catalogue does not: Evidence about the effectiveness of each catalogue entry. Imagine a Sears catalogue that was married to Consumer Reports. Where a traditional Sears catalogue describes a kitchen gadget, “It slices and dices, with no muss, no fuss!”, the marriage with Consumer Reports would instead say, “Effective at slicing and dicing, but lots of muss. Also fuss.”

If this marriage took place, it might take some of the fun out of the Sears catalogue (making it a book of realities rather than a book of dreams), but it would give confidence to buyers, and help them make wise choices. And with proper wordsmithing, it could still communicate both enthusiasm, when warranted, and truth. But even more, it could have a huge impact on the producers of consumer goods, because they would know that their products would need to be rigorously tested and found to be able to back up their claims.

In enhancing the impact of research on the practice of education, we have two problems that have to be solved. Just like the “Book of Dreams,” we have to help educators know the wonderful array of programs available to them, programs they may never had heard of. And beyond the particular programs, we need to build excitement about the opportunity to select among proven programs.

In education, we make choices not for ourselves, but on behalf of our children. Responsible educators want to choose programs and practices that improve the achievement of their students. Something like a marriage of the Sears catalogue and Consumer Reports is necessary to address educators’ dreams and their need for information on program outcomes. Users should be both excited and informed. Information usually does not excite. Excitement usually does not inform. We need a way to do both.

In Evidence for ESSA, we have tried to give educators a sense that there are many solutions to enduring instructional problems (excitement), and descriptions of programs, outcomes, costs, staffing requirements, professional development, and effects for particular subgroups, for example (information).

In contrast to Sears catalogues, Evidence for ESSA is light (Sears catalogues were huge, and ultimately broke the springs on my mother’s station wagon). In contrast to Consumer Reports, Evidence for ESSA is free.  Every marriage has its problems, but our hope is that we can capture the excitement and the information from the marriage of these two approaches.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Picture source: Nationaal Archief, the Netherlands

 

What Kinds of Studies Are Likely to Replicate?

Replicated scientists 03 01 18

In the hard sciences, there is a publication called the Journal of Irreproducible Results.  It really has nothing to do with replication of experiments, but is a humor journal by and for scientists.  The reason I bring it up is that to chemists and biologists and astronomers and physicists, for example, an inability to replicate an experiment is a sure indication that the original experiment was wrong.  To the scientific mind, a Journal of Irreproducible Results is inherently funny, because it is a journal of nonsense.

Replication, the ability to repeat an experiment and get a similar result, is the hallmark of a mature science.  Sad to say, replication is rare in educational research, which says a lot about our immaturity as a science.  For example, in the What Works Clearinghouse, about half of programs across all topics are represented by a single evaluation.  When there are two or more, the results are often very different.  Relatively recent funding initiatives, especially studies supported by Investing in Innovation (i3) and the Institute for Education Sciences (IES), and targeted initiatives such as Striving Readers (secondary reading) and the Preschool Curriculum Evaluation Research (PCER), have added a great deal in this regard. They have funded many large-scale, randomized, very high-quality studies of all sorts of programs in the first place, and many of these are replications themselves, or they provide a good basis for replications later.  As my colleagues and I have done many reviews of research in every area of education, pre-kindergarten to grade 12 (see www.bestevidence.org), we have gained a good intuition about what kinds of studies are likely to replicate and what kinds are less likely.

First, let me define in more detail what I mean by “replication.”  There is no value in replicating biased studies, which may well consistently find the same biased results (as when, for example, both the original studies and the replication studies used the same researcher- or developer-made outcome measures that are slanted toward the content the experimental group experienced but not what the control group experienced) (See http://www.tandfonline.com/doi/abs/10.1080/19345747.2011.558986.)

Instead, I’d consider a successful replication one that shows positive outcomes both in the original studies and in at least one large-scale, rigorous replication. One obvious way to increase the chances that a program producing a positive outcome in one or more initial studies will succeed in such a rigorous replication evaluation is to use a similar, equally rigorous evaluation design in the first place. I think a lot of treatments that fail to replicate are ones that used weak methods in the original studies. In particular, small studies tend to produce greatly inflated effect sizes (see http://www.bestevidence.org/methods/methods.html), which are unlikely to replicate in larger evaluations.

Another factor likely to contribute to replicability is use in the earlier studies of methods or conditions that can be repeated in later studies, or in schools in general. For example, providing teachers with specific manuals, videos demonstrating the methods, and specific student materials all add to the chances that a successful program can be successfully replicated. Avoiding unusual pilot sites (such as schools known to have outstanding principals or staff) may contribute to replication, as these conditions are unlikely to be found in larger-scale studies. Having experimenters or their colleagues or graduate students extensively involved in the early studies diminishes replicability, of course, because those conditions will not exist in replications.

Replications are entirely possible. I wish there were a lot more of them in our field. Showing that programs can be effective in just two rigorous evaluations is way more convincing than just one. As evidence becomes more and more important, I hope and expect that replications, perhaps carried out by states or districts, will become more common.

The Journal of Irreproducible Results is fun, but it isn’t science. I’d love to see a Journal of Replications in Education to tell us what really works for kids.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Maximizing the Promise of “Promising” in ESSA

As anyone who reads my blogs is aware, I’m a big fan of the ESSA evidence standards. Yet there are many questions about the specific meaning of the definitions of strength of evidence for given programs. “Strong” is pretty clear: at least one study that used a randomized design and found a significant positive effect. “Moderate” requires at least one study that used a quasi-experimental design and found significant positive effects. There are important technical questions with these, but the basic intention is clear.

Not so with the third category, “promising.” It sounds clear enough: At least one correlational study with significantly positive effects, controlling for pretests or other variables. Yet what does this mean in practice?

The biggest problem is that correlation does not imply causation. Imagine, for example, that a study found a significant correlation between the numbers of iPads in schools and student achievement. Does this imply that more iPads cause more learning? Or could wealthier schools happen to have more iPads (and children in wealthy families have many opportunities to learn that have nothing to do with their schools buying more iPads)? The ESSA definitions do require controlling for other variables, but correlational studies lend themselves to error when they try to control for big differences.

Another problem is that a correlational study may not specify how much of a given resource is needed to show an effect. In the case of the iPad study, did positive effects depend on one iPad per class, or thirty (one per student)? It’s not at all clear.

Despite these problems, the law clearly defines “promising” as requiring correlational studies, and as law-abiding citizens, we must obey. But the “promising” category also allows for some additional categories of studies that can fill some important gaps that otherwise lurk in the ESSA evidence standards.

The most important category involves studies in which schools or teachers (not individual students) were randomly assigned to experimental or control groups. Current statistical norms require that such studies use multilevel analyses, such as Hierarchical Linear Modeling (HLM). In essence, these are analyses at the cluster level (school or teacher), not the student level. The What Works Clearinghouse (WWC) requires use of statistics like HLM in clustered designs.

The problem is that it takes a lot of schools or teachers to have enough power to find significant effects. As a result, many otherwise excellent studies fail to find significant differences, and are not counted as meeting any standard in the WWC.

The Technical Working Group (TWG) that set the standards for our Evidence for ESSA website suggested a solution to this problem. Cluster randomized studies that fail to find significant effects are re-analyzed at the student level. If the student-level outcome is significantly positive, the program is rated as “promising” under ESSA. Note that all experiments are also correlational studies (just using a variable with only two possible values, experimental or control), and experiments in education almost always control for pretests and other factors, so our procedure meets the ESSA evidence standards’ definition for “promising.”

Another situation in which “promising” is used for “just-missed” experiments is in the case of quasi-experiments. Like randomized experiments, these should be analyzed at the cluster level if treatment was at the school or classroom level. So if a quasi-experiment did not find significantly positive outcomes at the cluster level but did find significant positive effects at the student level, we include it as “promising.”

These procedures are important for the ESSA standards, but they are also useful for programs that are not able to recruit a large enough sample of schools or teachers to do randomized or quasi-experimental studies. For example, imagine that a researcher evaluating a school-wide math program for tenth graders could only afford to recruit and serve 10 schools. She might deliberately use a design in which the 10 schools are randomly assigned to use the innovative math program (n=5) or serve as a control group (n=5). A cluster randomized experiment with only 10 clusters is extremely unlikely to find a significant positive effect at the school level, but with perhaps 1000 students per condition, would be very likely to find a significant effect at the student level, if the program is in fact effective. In this circumstance, the program could be rated, using our standard, as “promising,” an outcome true to the ordinary meaning of the word: not proven, but worth further investigation and investment.

Using the “promising” category in this way may encourage smaller-scale, less well funded researchers to get into the evidence game, albeit at a lower rung on the ladder. But it is not good policy to set such high standards that few programs will qualify. Defining “promising” as we have in Evidence for ESSA does not promise anyone the Promised Land, but it broadens the number of programs schools may select, knowing that they must give up a degree of certainty in exchange for a broader selection of programs.

The WWC’s 25% Loophole

I am a big fan of the concept of the What Works Clearinghouse (WWC), though I have concerns about various WWC policies and practices. For example, I have written previously with concerns about WWC’s acceptance of measures made by researchers and developers and WWC’s policy against weighting effect sizes by sample sizes when computing mean effect sizes for various programs. However, there is another WWC policy that is a problem in itself, but this problem is made more serious in light of recent Department of Education guidance on the ESSA evidence standards.

The WWC Standards and Procedures 3.0 manual sets rather tough standards for programs to be rated as having positive effects in studies meeting standards “without reservations” (essentially, randomized experiments) and “with reservations” (essentially, quasi-experiments, or matched studies). However, the WWC defines a special category of programs for which all caution is thrown to the winds. Such studies are called “substantively important,” and are treated as though they met WWC standards. Quoting from Standards and Procedures 3.0: “For the WWC, effect sizes of +0.25 standard deviations or larger are considered to be substantively important…even if they might not reach statistical significance…” The “effect size greater than +0.25” loophole (the >0.25 loophole, for short) is problematic in itself, but could lead to catastrophe for the ESSA evidence standards that now identify programs that meet “strong,” “moderate,” and “promising” levels of evidence.

The problem with the >0.25 loophole is that studies that meet the loophole criterion without meeting the usual methodological criteria are usually very, very, very bad studies, usually with a strong positive bias. These studies are often very small (far too small for statistical significance). They usually use measures made by the developers or researchers, or ones that are excessively aligned with the content of the experimental group but not the control group.

One example of the >0.25 loophole is a Brady (1990) study accepted as “substantively important” by the WWC. In it, 12 students in rural Alaska were randomly assigned to Reciprocal Teaching or to a control group. The literacy treatment was built around specific science content, but the control group never saw this content. Yet one of the outcome measures, focused on this content, was made by Mr. Brady, and two others were scored by him. Mr. Brady also happened to be the teacher of the experimental group. The effect size in this awful study was an extraordinary +0.65, though outcomes in other studies assessed on measures more fair to the control group were much smaller.

Because the WWC does not weight studies by sample size, this tiny, terrible study had the same impact in the WWC summary as studies with hundreds or thousands of students.

For the ESSA evidence standards, the >0.25 loophole can lead to serious errors. A single study meeting standards makes a program qualify for one of the top-three ESSA standards (strong, moderate, or promising). There can be financial consequences for schools using programs in the top three categories (for example, use of such programs is required for schools seeking school improvement grants). Yet a single study meeting the standards, including the awful 12-student study of Reciprocal Teaching, qualify the program for the ESSA category, no matter what is found in all other studies (unless there are qualifying studies with negative impacts). Also, the loophole works in the negative direction too, so a small, terrible study could find an effect size less than -0.25, and no amount or quality of positive findings could make that program meet WWC standards.

The >0.25 loophole is bad enough for research that already exists, but for the future, the problem is even more serious. Program developers or commercial publishers could do many small studies of their programs or could commission studies using developer-made measures. Once a single study exceeds an effect size of +0.25, the program may be considered validated forever.

To add to the problem, in recent guidance from the U. S. Department of Education, a definition of the ESSA “promising” definition specifically mentions the idea that programs can meet the promising definition if they can report statistically significant or substantively important outcomes. The guidance refers to the WWC standards for the “strong” and “moderate” categories, and the WWC standards themselves allow for the >0.25 loophole (even though this is not mentioned or implied by the law itself, which consistently requires statistically significant outcomes, not “substantially important”). In other words, programs that meet WWC standards for “positive” or “potentially positive” based on substantively important evidence alone explicitly do not meet ESSA standards, which require statistical significance. Yet the recent regulations do not recognize this problem.

The >0.25 loophole began, I’d assume, when the WWC was young and few programs met its standards. It was jokingly called the “Nothing Works Clearinghouse.” The loophole was probably added to increase the numbers of included programs. This loophole produced misleading conclusions, but since the WWC did not matter very much to educators, there were few complaints. Today, however, the WWC has greater importance because of the ESSA evidence standards.

Bad loopholes make bad laws. It is time to close this loophole, and eliminate the category of “substantively important.”

Research and Development Saved Britain. Maybe They Will Save U.S. Education

One of my summer goals is to read the entire 6 volume history of the Second World War by Winston Churchill. So far, I’m about halfway through the first volume, The Gathering Storm, about the period leading up to 1939.

The book is more or less a wonderfully written rant about the Allies’ shortsightedness. As Hitler built up his armaments, Britain, France, and their allies maintained a pacifist insistence on reducing theirs. Only in the mid-thirties, when war was inevitable, did Britain start investing in armaments, but even then at a very modest pace.

Churchill was a Member of Parliament but was out of government. However, he threw himself into the one thing he could do to help Britain prepare: research and development. In particular, he worked with top scientists to develop the capacity to track, identify, and shoot down enemy aircraft.

When the 1940 Battle of Britain came and German planes tried to destroy and demoralize Britain in advance of an invasion, the inventions by Churchill’s group were a key factor in defeating them.

Churchill’s story is a good analogue to the situation of education research and development. In the current environment, the best-evaluated, most effective programs are not in wide use in U.S. schools. But the research and development that creates and evaluates these programs is essential. It is useful right away in hundreds of schools that do use proven programs already. But imagine what would happen if federal, state, or local governments anywhere decided to use proven programs to combat their most important education problems at scale. Such a decision would be laudable in principle, but where would the proven programs come from? How would they generate convincing evidence of effectiveness?  How would they build robust and capable organizations to provide high-quality professional development materials, and software?

The answer is research and development, of course. Just as Churchill and his scientific colleagues had to create new technologies before Britain was willing to invest in air defenses and air superiority at scale, so American education needs to prepare for the day when government at all levels is ready to invest seriously in proven educational programs.

I once visited a secondary school near London. It’s an ordinary school now, but in 1940 it was a private girls’ school. A German plane, shot down in the Battle of Britain, crash landed near the school. The girls ran out and captured the pilot!

The girls were courageous, as was the British pilot who shot down the German plane. But the advanced systems the British had worked out and tested before the war were also important to saving Britain. In education reform we are building and testing effective programs and organizations to support them. When government decides to improve student learning nationwide, we will be ready, if investments in research and development continue.

This blog is sponsored by the Laura and John Arnold Foundation

Research and Practice: “Tear Down This Wall”

I was recently in Berlin. Today, it’s a lively, entirely normal European capital. But the first time I saw it, it was 1970, and the wall still divided it. Like most tourists, I went through Checkpoint Charlie to the east side. The two sides were utterly different. West Berlin was pleasant, safe, and attractive. East Berlin was a different world. On my recent trip, I met a young researcher who grew up in West Berlin. He recalls his father being taken in for questioning because he accidentally brought a West Berlin newspaper across the border. Western people could visit, but western newspapers could get you arrested.

I remember John F. Kennedy’s “Ich bin ein Berliner” speech, and Ronald Reagan’s “Mr. Gorbechev, tear down this wall.” And one day, for reasons no one seems to understand, the wall was gone. Even today, I find it thrilling and incredible to walk down Unter den Linden under the Brandenburg Gate. Not so long ago, this was impossible, even fatal.

The reason I bring up the Berlin Wall is that I want to use it as an analogy to another wall of less geopolitical consequence, perhaps, but very important to our profession. This is the wall between research and practice.

It is not my intention to disrespect the worlds on either side of the research/practice wall. People on both sides care deeply about children and bring enormous knowledge, skill, and effort to improving educational outcomes. In fact, that’s what is so sad about this wall. People on both sides have so much to teach and learn from the other, but all too often, they don’t.

What has been happening in recent years is that the federal government, at least, has been reinforcing the research/practice divide in many ways, at least until the passage of the Every Student Succeeds Act (ESSA) (more on this later). On one hand, government has invested in high-quality educational research and development, especially through Investing in Innovation (i3) and the Institute of Education Sciences (IES). As a result, over on the research side of the wall there is a growing stockpile of rigorously evaluated, ready-to-implement education programs for most subjects and grade levels.

On the practice side of the wall, however, government has implemented national policies that may or may not have a basis in research, but definitely do not focus on use of proven programs. Examples include accountability, teacher evaluation, and Common Core. Even federal School Improvement Grants (SIG) for the lowest-achieving 5% of schools in each state had loads of detailed requirements for schools to follow but said nothing at all about using proven programs or practices, until a proven whole-school reform option was permitted as one of six alternatives at the very end of No Child Left Behind. The huge Race to the Top funding program was similarly explicit about standards, assessments, teacher evaluations, and other issues, but said nothing about use of proven programs.

On the research side of the wall, developers and researchers were being encouraged by the U.S. Department of Education to write their findings clearly and “scale up” their findings to presumably eager potential adopters on the practice side. Yet the very same department was, at the same time, keeping education leaders on the practice side of the wall scrambling to meet federal standards to obtain Race to the Top, School Improvement Grants, and other funding, none of which had anything much to do with the evidence base building up on the research side of the wall. The problem posed by the Berlin Wall was not going to be resolved by sneaking well-written West Berlin newspapers into East Berlin, or East Berlin newspapers into West Berlin. Rather, someone had to tear down the wall.

The Every Student Succeeds Act (ESSA) is one attempt to tear down the research/practice wall. Its definitions of strong, moderate, and promising levels of evidence, and provision of funding incentives for using proven programs (especially in applications for school improvement), could go a long way toward tearing down the research/practice wall, but it’s too soon to tell. So far, these definitions are just words on a page. It will take national, state, and local leadership to truly make evidence central to education policy and practice.

On National Public Radio, I recently heard recorded recollections from people who were in Berlin the day the wall came down. One of them really stuck with me. West Berliners had climbed to the top of the wall and were singing and cheering as gaps were opened. Then, an East German man headed for a gap. The nearby soldiers, unsure what to do, pointed their rifles at him and told him to stop. He put his hands in the air. The West Germans on the wall fell silent, anxiously watching.

A soldier went to find the captain. The captain came out of a guardhouse and walked over to the East German man. He put his arm around his shoulders and personally walked him through the gap in the wall.

That’s leadership. That’s courage. It’s what we need to tear down our wall: leaders at all levels who actively encourage the world of research and the world of practice to become one. To do it by personal and public examples, so that educators can understand that the rules have changed, and that communication between research and practice, and use of proven programs and practices, will be encouraged and facilitated.

Our wall can come down. It’s only a question of leadership, and commitment to better outcomes for children.

This blog is sponsored by the Laura and John Arnold Foundation