Gambling With Our Children’s Futures

I recently took a business trip to Reno and Las Vegas. I don’t gamble, but it’s important to realize that casinos don’t gamble either. A casino license is permission to make a massive amount of money, risk free.

Think of a roulette table, for example, as a glitzy random numbers generator. People can put bets on any of 38 numbers, and if that number comes up, you get 36 times your bet. The difference between 38 and 36 is the “house percentage.” So as long as the wheel is spinning and people are betting, the casino is making money, no matter what the result is of a particular spin. This is true because over the course of days, weeks, or months, that small percentage becomes big money. The same principle works in every game in the casino.

In educational research, we use statistics much as the casinos do, though for a very different purpose. We want to know what the effect of a given program is on students’ achievement. Think of each student in an experiment as a separate spin of the roulette wheel. If you have just a few students, or a few spins, the results may seem very good or very bad, on average. But when you have hundreds or thousands of students (or spins), the averages stabilize.

In educational experiments, some students usually get an experimental program and others serve as controls. If there are few students (spins) in each group, the differences are unreliable. But as the numbers get larger, the difference between experimental and control groups gets reliable.

This explains why educational experiments should involve large numbers of students. With small numbers, differences could be due to chance.

Several years ago, I wrote an article on the relationship between sample size and effect size in educational experiments. Small studies (e.g., fewer than 100 students in each group) had much larger experimental-control differences (effect sizes) than big ones. How could this be?

What I think was going on is that in small studies, effect sizes could be very positive or very negative (favoring the control group). When positive results are found, results are published and publicized. When results go the other way? Not so much. The studies may disappear.

To understand this, go back to the casino. Imagine that you bet on 20 spins, and you make big money. You go home and tell your friends you are a genius, or you credit your lucky system or your rabbit’s foot. But if you lose your shirt on 20 spins, you probably slink home and stay quiet about the whole experience.

Now imagine that you bet on 1000 spins. It is statistically virtually certain that you will lose a certain amount of money (about 2/38 of what you bet, to be exact, because of 0 and 00). This outcome is not interesting, but it tells you exactly how the system works.

In big studies in education, we can also produce reliable measures of “how the system works” by comparing hundreds or thousands of experimental and control students.

Critics of quantitative research in education seem to think we are doing some sort of statistical mumbo-jumbo with our computers and baffling reports. But what we are doing is trying to get to the truth, with enough “spins” of the roulette wheel to even out chance factors.

Ironically, what large-scale research in education is intended to do is to diminish the role of chance in educational decisions. We want to help educators avoid gambling with their children’s futures.

This blog is sponsored by the Laura and John Arnold Foundation

Advertisements

On Meta-Analysis: Eight Great Tomatoes

I remember a long-ago advertisement for Contadina tomato paste. It went something like this:

Eight great tomatoes in an itsy bitsy can!

This ad creates an appealing image, or at least a provocative one, that I suppose sold a lot of tomato paste.

In educational research, we do something a lot like “eight great tomatoes.” It’s called meta-analysis, or systematic review.  I am particularly interested in meta-analyses of experimental studies of educational programs.  For example, there are meta-analyses of reading and math and science programs.  I’ve written them myself, as have many others.  In each, some number of relevant studies are identified.  From each study, one or more “effect sizes” are computed to represent the impact of the program on important outcomes, such as scores on achievement tests. These are then averaged to get an overall impact for each program or type of program.  Think of the effect size as boiling down tomatoes to make concentrated paste, to fit into an itsy bitsy can.

But here is the problem.  The Contadina ad specifies eight great tomatoes. If even one tomato is instead a really lousy one, the contents of the itsy bitsy can will be lousy.  Ultimately, lousy tomato pastes would bankrupt the company.

The same is true of meta-analyses.  Some meta-analyses include a broad range of studies – good, mediocre, and bad.  They may try to statistically control for various factors, but this does not do the job.  Bad studies lead to bad outcomes.  Years ago, I critiqued a study of “class size.”  The studies of class size in ordinary classrooms found small effects.  But there was one study that involved teaching tennis.  In small classes, the kids got a lot more court time than did kids in large classes.  This study, and only this study, found substantial effects of class size, significantly affecting the average.  There were not eight great tomatoes, there was at least one lousy tomato, which made the itsy bitsy can worthless.

The point I am making here is that when doing meta-analysis, the studies must be pre-screened for quality, and then carefully scrubbed.  Specifically, there are many factors that greatly (and falsely) inflate effect size.  Examples include use of assessments made by the researchers and ones that assess what was taught in the experimental group but not the control group, use of small samples, and provision of excessive assistance to the teachers.

Some meta-analyses just shovel all the studies onto a computer and report an average effect size.  More responsible ones shovel the studies into a computer and then test for and control for various factors that might affect outcomes. This is better, but you just can’t control for lousy studies, because they are often lousy in many ways.

Instead, high-quality meta-analyses set specific criteria for inclusion intended to minimize bias.  Studies often use both valid measures and crummy measures (such as those biased toward the experimental group).  Good meta-analyses use the good measures but not the (defined in advance) crummy ones.  Studies that only used crummy measures are excluded.  And so on.

With systematic standards, systematically applied, meta-analyses can be of great value.  Call it the Contadina method.  In order to get great tomato paste, start with great tomatoes. The rest takes care of itself.

The Rapid Advance of Rigorous Research

My colleagues and I have been reviewing a lot of research lately, as you may have noticed in recent blogs on our reviews of research on secondary reading and our work on our web site, Evidence for ESSA, which summarizes research on all of elementary and secondary reading and math according to ESSA evidence standards.  In the course of this work, I’ve noticed some interesting trends, with truly revolutionary implications.

The first is that reports of rigorous research are appearing very, very fast.  In our secondary reading review, there were 64 studies that met our very stringent standards.  55 of these used random assignment, and even the 9 quasi-experiments all specified assignment to experimental or control conditions in advance.  We eliminated all researcher-made measures.  But the most interesting fact is that of the 64 studies, 19 had publication or report dates of 2015 or 2016.  Fifty-one have appeared since 2011.  This surge of recent publications on rigorous studies was greatly helped by the publication of many studies funded by the federal Striving Readers program, but Striving Readers was not the only factor.  Seven of the studies were from England, funded by the Education Endowment Foundation (EEF).  Others were funded by the Institute of Education Sciences at the U.S. Department of Education (IES), the federal Investing in Innovation (i3) program, and many publishers, who are increasingly realizing that the future of education belongs to those with evidence of effectiveness.  With respect to i3 and EEF, we are only at the front edge of seeing the fruits of these substantial investments, as there are many more studies in the pipeline right now, adding to the continuing build-up in the number and quality of studies started by IES and other funders.  Looking more broadly at all subjects and grade levels, there is an unmistakable conclusion: high-quality research on practical programs in elementary and secondary education is arriving in amounts we never could have imagined just a few years ago.

Another unavoidable conclusion from the flood of rigorous research is that in large-scale randomized experiments, effect sizes are modest.  In a recent review I did with my colleague Alan Cheung, we found that the mean effect size for large, randomized experiments across all of elementary and second reading, math, and science is only +0.13, much smaller than effect sizes from smaller studies and from quasi-experiments.  However, unlike small and quasi-experimental studies, rigorous experiments using standardized outcome measures replicate.  These effect sizes may not be enormous, but you can take them to the bank.

In our secondary reading review, we found an extraordinary example of this. The University of Kansas has an array of programs for struggling readers in middle and high schools, collectively called the Strategic Instruction Model, or SIM.  In the Striving Readers grants, several states and districts used methods based on SIM.  In all, we found six large, randomized experiments, and one large quasi-experiment (which matched experimental and control groups).  The effect sizes across the seven varied from a low of 0.00 to +0.15, but most clustered closely around the weighted mean of +0.09.  This consistency was remarkable given that the contexts varied considerably.  Some studies were in middle schools, some in high schools, some in both.  Some studies gave students an extra period of reading each day, some did not.  Some studies went for multiple years, some did not.  Settings included inner-city and rural locations, and all parts of the U.S.

One might well argue that the SIM findings are depressing, because the effect sizes were quite modest (though usually statistically significant).  This may be true, but once we can replicate meaningful impacts, we can also start to make solid improvements.  Replication is the hallmark of a mature science, and we are getting there.  If we know how to replicate our findings, then the developers of SIM and many other programs can create better and better programs over time with confidence that once designed and thoughtfully implemented, better programs will reliably produce better outcomes, as measured in large, randomized experiments.  This means a lot.

Of course, large, randomized studies may also be reliable in telling us what does not work, or does not work yet.  When researchers get zero impacts and then seek funding to do the same treatment again, hoping for better luck, they and their funders are sure to be disappointed.  Researchers who find zero impacts may learn a lot, which may help them create something new that will, in fact, move the needle.  But they have to then use those learnings to do something meaningfully different if they expect to see meaningfully different outcomes.

Our reviews are finding that in every subject and grade level, there are programs right now that meet high standards of evidence and produce reliable impacts on student achievement.  Increasing numbers of these proven programs have been replicated with important positive outcomes in multiple high-quality studies.  If all 52,000 Title I schools adopted and implemented the best of these programs, those that reliably produce impacts of more than +0.20, the U.S. would soon rise in international rankings, achievement gaps would be cut in half, and we would have a basis for further gains as research and development build on what works to create approaches that work better.  And better.  And then better still.

There is bipartisan, totally non-political support for the idea that America’s schools should be using evidence to enhance outcomes.  However a school came into being, whoever governs it, whoever attends it, wherever it is located, at the end of the day the school exists to make a difference in the lives of children.  In every school there are teachers, principals, and parents who want and need to ensure that every child succeeds.  Research and development does not solve all problems, but it helps leverage the efforts of all educators and parents so that they can have a maximum positive impact on their children’s learning.  We have to continue to invest in that research and development, especially as we get smarter about what works and what does not, and as we get smarter about research designs that can produce reliable, replicable outcomes.  Ones you can take to the bank.

Why Rigorous Studies Get Smaller Effect Sizes

When I was a kid, I was a big fan of the hapless Washington Senators. They were awful. Year after year, they were dead last in the American League. They were the sort of team that builds diehard fans not despite but because of their hopelessness. Every once in a while, kids I knew would snap under the pressure and start rooting for the Baltimore Orioles. We shunned them forever, right up to this day.

With the Senators, any reason for hope was prized, and we were all very excited when some hotshot batter was brought up from the minor leagues. But they almost always got whammed, sent back down or traded but never heard from again. I’m sure this happens in every team. In fact, I just saw an actual study comparing batting averages for batters in their last year in the minors to their first year in the majors. The difference was dramatic. In the majors, the very same batters had much lower averages. The impact was equivalent to an effect size of -0.70. That’s huge. I’d call this effect the Curse of the Major Leagues.

Why am I carrying on about baseball? I think it provides an analogy to explain why large, randomized experiments in education have characteristically lower effect sizes than experiments that are quasi-experiments, smaller, or (especially) both.

In baseball, batting averages decline because the competition is tougher. The pitchers are faster, the fielders are better, and maybe the minor league parks are smaller, I don’t know. In education, large randomized experiments are tougher competition, too. Randomized experiments are tougher because the experimenter doesn’t get the benefit of self-selection by the schools or teachers choosing the program. In a randomized experiment everyone has to start fresh at the beginning of the study, so the experimenter does not get the benefit of working with teachers who may already be experienced in the experimental program.

In larger studies, the experimenter has more difficulty controlling every variable to ensure high-quality implementation. Large studies are more likely to use standardized tests rather than researcher-made tests. If these are state tests used for accountability, the control group can be assumed to be trying just as much as the experimental group to improve students’ scores on the objectives taught on those tests.

What these problems mean is that when a program is evaluated in a large randomized study, and the results are significantly positive, this is cause for real celebration because the program had to overcome much tougher competition. The successful program is far more likely to work in realistic settings at serious scale because it has been tested under more life-like conditions. Other experimental designs are also valuable, of course, if only because they act like the minor leagues, nurturing promising prospects and then sending the best to the majors where their mettle will really be tested. In a way, this is exactly the tiered evidence strategy used in Investing in Innovation (i3) and in the Institute for Education Sciences (IES) Goal 2-3-4 progression. In both cases, smaller grants are made available for development projects, which are nurtured and, if they show promise, may be funded at a higher level and sent to the majors (validation, scale-up) for rigorous, large-scale evaluation.

The Curse of the Major Leagues was really just the product of a system for fairly and efficiently bringing the best players into the major leagues. The same idea is the brightest hope we have for offering schools throughout the U.S. the very best instructional programs on a meaningful scale. After all those years rooting for the Washington Senators, I’m delighted to see something really powerful coming from our actual Senators in Washington. And I don’t mean baseball!

How Much Difference Does an Education Program Make?

When you use Consumer Reports car repair ratings to choose a reliable car, you are doing something a lot like what evidence-based reform in education is proposing. You look at the evidence and take it into account, but it does not drive you to a particular choice. There are other factors you’d also consider. For example, Consumer Reports might point you to reliable cars you can’t afford, or ones that are too large or too small or too ugly for your purposes and tastes, or ones with dealerships that are too far away. In the same way, there are many factors that school staffs or educational leaders might consider beyond effect size.

An effect size, or statistical significance, is only a starting point for estimating the impact a program or set of programs might have. I’d propose the term “potential impact” to subsume the following factors that a principal or staff might consider beyond effect size or statistical significance in adopting a program to improve education outcomes:

  • Cost-effectiveness
  • Evidence from similar schools
  • Immediate and long-term payoffs
  • Sustainability
  • Breadth of impact
  • Low-hanging fruit
  • Comprehensiveness

Cost-EffectivenessEconomists’ favorite criterion of effectiveness is cost-effectiveness. Cost-effectiveness is simple in concept (how much gain did the program cause at what cost?), but in fact there are two big elements of cost-effectiveness that are very difficult to determine:

1. Cost
2. Effectiveness

Cost should be easy, right? A school buys some service or technology and pays something for it. Well, it’s almost never so clear. When a school uses a given innovation, there are usually costs beyond the purchase price. For example, imagine that a school purchases digital devices for all students, loaded with all the software they will need. Easy, right? Wrong. Should you count in the cost of the time the teachers spend in professional development? The cost of tech support? Insurance? Security costs? The additional electricity required? Space for storage? Additional loaner units to replace lost or broken units? The opportunity costs for whatever else the school might have chosen to do?

Here is an even more difficult example. Imagine a school starts a tutoring program for struggling readers using paraprofessionals as tutors. Easy, right? Wrong. There is the cost for the paraprofessionals’ time, of course, but what if the paraprofessionals were already on the schools’ staff? If so, then a tutoring program may be very inexpensive, but if additional people must be hired as tutors, then tutoring is a far more expensive proposition. Also, if paraprofessionals already in the school are no longer doing what they used to do, might this diminish student outcomes? Then there is the problem with outcomes. As I explained in a recent blog, the meaning of effect sizes depends on the nature of the studies that produced them, so comparing apples to apples may be difficult. A principal might look at effect sizes for two programs and decide they look very similar. Yet one effect size might be from large-scale randomized experiments, which tend to produce smaller (and more meaningful) effect sizes, while the other might be from less rigorous studies.

Nevertheless, issues of cost and effectiveness do need to be considered. Somehow.

Evidence from Similar Schools
Clearly, a school staff would want to know that a given program has been successful in schools like theirs. For example, schools serving many English learners, or schools in rural areas, or schools in inner-city locations, might be particularly interested in data from similar schools. At a minimum, they should want to know that the developers have worked in schools like theirs, even if the evidence only exists from less similar schools.

Immediate and Long-Term Payoffs
Another factor in program impacts is the likelihood that a program will solve a very serious problem that may ultimately have a big effect on individual students and perhaps save a lot of money over time. For example, it may be that a very expensive parent training program may make a big difference for students with serious behavior problems. If this program produces lasting effects (documented in the research), its high cost might be justified, especially if it might reduce the need for even more expensive interventions, such as special education placement, expulsion, or incarceration.

Sustainability
Programs that either produce lasting impacts, or those that can be readily maintained over time, are clearly preferable to those that have short-term impacts only. In education, long-term impacts are not typically measured, but sustainability can be determined by the cost, effort, and other elements required to maintain an intervention. Most programs get a lot cheaper after the first year, so sustainability can usually be assumed. This means that even programs with modest effect sizes could bring about major changes over time.

Breadth of Impact
Some educational interventions with modest effect sizes might be justified because they apply across entire schools and for many years. For example, effective coaching for principals might have a small effect overall, but if that effect is seen across thousands of students over a period of years, it might be more than worthwhile. Similarly, training teachers in methods that become part of their permanent repertoire, such as cooperative learning, teaching metacognitive skills, or classroom management, might affect hundreds of students per teacher over time.

Low-Hanging Fruit
Some interventions may have either modest impacts on students in general, or strong outcomes for only a subset of students, but be so inexpensive or easy to adopt and implement that it would be foolish not to do so. One example might be making sure that disadvantaged students who need eyeglasses are assessed and given glasses. Not everyone needs glasses, but for those who do this makes a big difference at low cost. Another example might be implementing a whole-school behavior management approach like Positive Behavior Interventions and Support (PBIS), a low-cost, proven approach any school can implement.

Comprehensiveness
Schools have to solve many quite different problems, and they usually do this by pulling various solutions off of various shelves. The problem is that this approach can be uncoordinated and inefficient. The different elements may not link up well with each other, may compete for the time and attention of the staff, and may cost a lot more than a unified, comprehensive solution that addresses many objectives in a planful way. A comprehensive approach is likely to have a coherent plan for professional development, materials, software, and assessment across all program elements. It is likely to have a plan for sustaining its effects over time and extending into additional parts of the school or additional schools.

Potential Impact
Potential impact is the sum of all the factors that make a given program or a coordinated set of programs effective in the short and long term, broad in its impact, focused on preventing serious problems, and cost-effective. There is no numerical standard for potential impact, but the concept is just intended to give educators making important choices for their kids a set of things to consider, beyond effect size and statistical significance alone.

Sorry. I wish this were simple. But kids are complex, organizations are complex, and systems are complex. It’s always a good idea for education leaders to start with the evidence but then think through how programs can be used as tools to transform their particular schools.

Seeking Jewels, Not Boulders: Learning to Value Small, Well-Justified Effect Sizes

One of the most popular exhibits in the Smithsonian Museum of Natural History is the Hope Diamond, one of the largest and most valuable in the world. It’s always fun to see all the kids flow past it saying how wonderful it would be to own the Hope Diamond, how beautiful it is, and how such a small thing could make you rich and powerful.

The diamonds are at the end of the Hall of Minerals, which is crammed full of exotic minerals from all over the world. These are beautiful, rare, and amazing in themselves, yet most kids rush past them to get to the diamonds. But no one, ever, evaluates the minerals against one another according to their size. No one ever says, “you can have your Hope Diamond, but I’d rather have this giant malachite or feldspar.” Just getting into the Smithsonian, kids go by boulders on the National Mall far larger than anything in the Hall of Minerals, perhaps climbing on them but otherwise ignoring them completely.

Yet in educational research, we often focus on the size of study effects without considering their value. In a recent blog, I presented data from a paper with my colleague Alan Cheung analyzing effect sizes from 611 studies evaluating reading, math, and science programs, K-12, that met the inclusion standards of our Best Evidence Encyclopedia. One major finding was that in randomized evaluations with sample sizes of 250 students (10 classes) or more, the average effect size across 87 studies was only +0.11. Smaller randomized studies had effect sizes averaging +0.22, large matched quasi-experiments +0.17, and small quasi-experiments, +0.32. In this blog, I want to say more about how these findings should make us think differently about effect sizes as we increasingly apply evidence to policy and practice.

Large randomized experiments (RCTs) with significant positive outcomes are the diamonds of educational research: rare, often flawed, but incredibly valuable. The reason they are so valuable is that such studies are the closest indication of what will happen when a given program goes out into the real world of replication. Randomization removes the possibility that self-selection may account for program effects. The larger the sample size, the less likely it is that the experimenter or developer could closely monitor each class and mentor each teacher beyond what would be feasible in real-life scale up. Most large-scale RCTs use clustering, which usually means that the treatment and randomization take place at the level of the whole school. A cluster randomized experiment at the school level might require recruiting 40 to 50 schools, perhaps serving 20,000 to 25,000 students. Yet such studies might nevertheless be too “small” to detect an effect size of, say, 0.15, because it is the number of clusters, not the number of students, that matters most!

The problem is that we have been looking for much larger effect sizes, and all too often not finding them. Traditionally, researchers recruit enough schools or classes to reliably detect an effect size as small as +0.20. This means that many studies report effect sizes that turn out to be larger than average for large RCTs, but are not statistically significant (at p<.05), because they are less than +0.20. If researchers did recruit samples of schools large enough to detect an effect size of +0.15, this would greatly increase the costs of such studies. Large RCTs are already very expensive, so substantially increasing sample sizes could end up requiring resources far beyond what educational research is likely to see any time in the near future or greatly reducing the number of studies that are funded.

These issues have taken on greater importance recently due to the passage of the Every Student Succeeds Act, or ESSA, which encourages use of programs that meet strong, moderate, or promising levels of evidence. The “strong” category requires that a program have at least one randomized experiment that found a significant positive effect. Such programs are rare.

If educational researchers were mineralogists, we’d be pretty good at finding really big diamonds, but the little, million-dollar diamonds, not so much. This makes no sense in diamonds, and no sense in educational research.

So what do we do? I’m glad you asked. Here are several ways we could proceed to increase the number of programs successfully evaluated in RCTs.

1. For cluster randomized experiments at the school level, something has to give. I’d suggest that for such studies, the p value should be increased to .10 or even .20. A p value of .05 is a long-established convention, indicating that there is only one chance in 20 that the outcomes are due to luck. Yet one chance in 10 (p=.10) may be sufficient in studies likely to have tens of thousands of students.

2. For studies in the past, as well as in the future, replication should be considered the same as large sample size. For example, imagine that two studies of Program X each have 30 schools. Each gets a respectable effect size of +0.20, which would not be significant in either case. Put the two studies together, however, and voila! The combined study of 60 schools would be highly significant, even at p=.05.

3. Government or foundation funders might fund evaluations in stages. The first stage might involve a cluster randomized experiment of, say, 20 schools, which is very unlikely to produce a significant difference. But if the effect size were perhaps 0.20 or more, the funders might fund a second stage of 30 schools. The two samples together, 50 schools, would be enough to detect a small but important effect.

One might well ask why we should be interested in programs that only produce effect sizes of 0.15? Aren’t these effects too small to matter?

The answer is that they are not. Over time, I hope we will learn how to routinely produce better outcomes. Already, we know that much larger impacts are found in studies of certain approaches emphasizing professional development (e.g., cooperative learning, meta cognitive skills) and certain forms of technology. I hope and expect that over time, more studies will evaluate programs using methods like those that have been proven to work, and fewer will evaluate those that do not, thereby raising the average effects we find. But even as they are, small but reliable effect sizes are making meaningful differences in the lives of children, and will make much more meaningful differences as we learn from efforts at the Institute of Education Sciences (IES) and the Investing in Innovation (i3)/Education Innovation and Research (EIR) programs.

Small effect sizes from large randomized experiments are the Hope Diamonds of our profession. They also are the best hope for evidence-based improvements for all students.

What Is a Large Effect Size?

Ever since Gene Glass popularized the effect size in the 1970s, readers of research have wanted to know how large an effect size has to be in order to be considered important. Well, stop the presses and sound the trumpet. I am about to tell you.

First let me explain what an effect size is and why you should care. An effect size sums up the difference between an experimental (treatment) group and a control group. It is a fraction in which the numerator is the posttest difference on a given measure, adjusted for pretests and other important factors, and the denominator is the unadjusted standard deviation of the control group or the whole sample. Here is the equation in symbols.

2016-03-10-1457617715-8387971-Form3_3102016.jpg

What is cool about effect sizes is that they can standardize the findings on all relevant measures. This enables researchers to compare and average effects across all sorts of experiments all over the world. Effect sizes are averaged in meta-analyses that combine the findings of many experiments and find trends that might not be easy to find just looking at experiments one at a time.

Are you with me so far?

One of the issues that has long puzzled readers of research is how to interpret effect sizes. When are they big enough to matter for practice? Researchers frequently cite statistician Jacob Cohen, who defined an effect size of +0.20 as “small,” +0.50 as “moderate,” and +0.80 as “strong.” However, Bloom, Hill, Black, & Lipsey (2008) claim that Cohen never really supported these criteria. New Zealander John Hattie publishes numerous reviews of reviews of research, and routinely finds effect sizes of +0.80 or more, and in fact suggests that educators ignore any teaching method with an average effect size of +0.40 or less. Yet Hattie includes literally everything in his meta-meta analyses, including studies with no control groups, studies in which the control group never saw the content assessed by the posttest, and so on. In studies that do have control groups and in which experimental and control groups were tested on material they were both taught, effect sizes as large as +0.80, or even +0.40, are very unusual, even in evaluations of one-to-one tutoring by certified teachers.

So what’s the right answer? The answer turns out to mainly depend on just two factors: Sample size, and whether or not students, classes/teachers, or schools were randomly assigned (or assigned by matching) to treatment and control groups. We recently did a review of twelve published meta-analyses including only the 611 studies that met the stringent inclusion requirements of our Best-Evidence Encyclopedia (BEE). (In brief, the BEE requires well-matched or randomized control groups and measures not made up by the researchers.) The average effect sizes in the four cells formed by quasi-experimental/randomized and small/large sample size (splitting at n=250) are as follows.

2016-03-10-1457617945-1920344-Chart_3102016.jpg

Here is what this chart means. If you look at a study that meets BEE standards and students were matched before being (non-randomly) assigned to treatment and control groups, then the average effect size is +0.32. Studies that use the same sample sizes and design would need to reach an effect size like this to be at the average. In contrast, if you find a large randomized study, it will need an effect size of only +0.11 to be considered average for its type. If Program A reports an effect size of +0.20 and Program B reports the same, are the programs equally effective? Not if they used different designs. If Program A used a large randomized study design and Program B a small quasi-experiment, then Program A is a leader in its class and Program B is a laggard.

This chart only applies to studies that meet our BEE standards, which removes a lot of the awful research that gives Hattie the false impression that everything works, and fabulously.

Using the average of all studies of a given type is not a perfect way to determine what is a large or small effect size, because this method only deals with methodology. It’s sort of “grading on a curve” by comparing effect sizes to their peers, rather than using a performance criterion. But I’d argue that until something better comes along, this is as good a way as any to say which effect sizes are worth paying attention to, and which are less important.