Is All That Glitters Gold-Standard?

In the world of experimental design, studies in which students, classes, or schools are assigned at random to experimental or control treatments (randomized clinical trials or RCTs) are often referred to as meeting the “gold standard.” Programs with at least one randomized study with a statistically significant positive outcome on an important measure qualify as having “strong evidence of effectiveness” under the definitions in the Every Student Succeeds Act (ESSA). RCTs virtually eliminate selection bias in experiments. That is, readers don’t have to worry that the teachers using an experimental program might have already been better or more motivated than those who were in the control group. Yet even RCTs can have such serious flaws as to call their outcomes into question.

A recent article by distinguished researchers Alan Ginsburg and Marshall Smith severely calls into question every single elementary and secondary math study accepted by the What Works Clearinghouse (WWC) as “meeting standards without reservations,” which in practice requires a randomized experiment. If they were right, then the whole concept of gold-standard randomized evaluations would go out the window, because the same concerns would apply to all subjects, not just math.

Fortunately, Ginsburg & Smith are mostly wrong. They identify, and then discard, 27 studies accepted by the WWC. In my view, they are right about five. They raise some useful issues about the rest, but not damning ones.

The one area in which I fully agree with Ginsburg & Smith (G&S henceforth) relates to studies that use measures made by the researchers. In a recent paper with Alan Cheung and an earlier one with Nancy Madden, I reported that use of researcher-made tests resulted in greatly overstated effect sizes. Neither WWC nor ESSA should accept such measures.

From this point however, G&S are overly critical. First, they reject all studies in which the developer was one of the report authors. However, the U.S. Department of Education has been requiring third-party evaluations in its larger grants for more than a decade. This is true in IES, i3, and NSF (scale-up) grants for example, and in England’s Education Endowment Foundation (EEF). A developer may be listed as an author, but it’s been a long time since a developer could get his or her thumb on the scale in federally-funded research. Even studies funded by publishers are almost universally using third-party evaluators.

G&S complain that 25 of 27 studies evaluated programs in their first year, compromising fidelity. This is indeed a problem, but it can only affect outcomes in a negative direction. Programs showing positive outcomes in their first year may be particularly powerful.

G&S express concern that half of studies did not state what curriculum the control group was using. This would be nice to know, but does not invalidate a study.

G&S complain that in many cases the amount of instructional time for the experimental group was greater than that for the control group. This could be a problem, but given the findings of research on allocated time, it is unlikely that time alone makes much of a difference in math learning. It may be more sensible to see extra time as a question of cost-effectiveness. Did 30 extra minutes of math per day implementing Program X justify the costs of Program X, including the cost of adding the time? Future studies might evaluate the value added of 30 extra minutes doing ordinary instruction, but does anyone expect this to be a large impact?

Finally, G&S complain that most curricula used in WWC-accepted RCTs are outdated. This could be a serious concern, especially as common core and other college- and career-ready standards are adopted in most states. However, recall that at the time RCTs are done, the experimental and the control groups were subject to the same standards, so if the experimental group did better, it is worth considering as an innovation. The reality is that any program in active dissemination must update its content to meet new standards. A program proven effective before common core and then updated to align with common core standards is not proven for certain to improve common core outcomes, for example, but it is a very good bet. A school or district considering adopting a given proven program might well check to see that it meets current standards, but it would be self-defeating and unnecessary to demand that every program re-prove its effectiveness every time standards change.

Randomized experiments in education are not perfect (neither are randomized experiments in medicine or other fields). However, they provide the best evidence science knows how to produce on the effectiveness of innovations. It is entirely legitimate to raise issues about RCTs, as Ginsburg & Smith do, but rejecting what we do know until perfection is achieved would cut off the best avenue we have for progress toward solid, scientifically defensible reform in our schools.

Educationists and Economists

I used to work part time in England, and I’ve traveled around the world a good bit speaking about evidence-based reform in education and related topics. One of the things I find striking in country after country is that at the higher levels, education is not run by educators. It is run by economists.

In the U.S., this is also true, though it’s somewhat less obvious. The main committees in Congress that deal with education are the House Education and the Workforce Committee and the Senate Health, Education, Labor, and Pensions (HELP) Committee. Did you notice the words “workforce” and “labor”? That’s economists. Further, politicians listen to economists, because they consider them tough-minded, data-driven, and fact-friendly. Economists see education as contributing to the quality of the workforce, now and in the future, and this makes them influential with politicians.

A lot of the policy prescriptions that get widely discussed and implemented broadly are the sorts of things economists love to dream up. For example, they are partial to market incentives, new forms of governance, rewards and punishments, and social impact bonds. Individual economists, and the politicians who listen to them, take diverse positions on these policies, but the point is that economists rather than educators often set the terms of the debates on both sides. As one example, educators have been talking about long-term impacts of quality preschool for 30 years, but when Nobel Prize-winning economist James Heckman took up the call, preschool became a top priority of the Obama Administration.

I have nothing against economists. Some of my best friends are economists. But here is why I am bringing them up.

Evidence-based reform is creating a link between educationists and economists, and thereby to the politicians who listen to them, that did not exist before. Evidence-based reform speaks the language that economists insist on: randomized evaluations of replicable programs and practices. When an educator develops a program, successfully evaluates it at scale, and shows it can be replicated, this gives economists a tangible tool they can show will make a difference in policy. Other research designs are simply not as respected or accepted. But an economist with a proven program in hand has a measurable, powerful means to affect policy and help politicians make wise use of resources.

If we want educational innovation and research to matter to public policy, we have to speak truth to power, in the language of power. And that language is increasingly the language of rigorous evidence. If we keep speaking it, our friends the economists will finally take evidence from educational research seriously, and that is how policy will change to improve outcomes for children on a grand scale.

Liberating the Camel

Many years ago, in the time of the Shah, I hitchhiked from London to Iran. The highlight of my trip to Iran was visiting Isfahan, a beautiful, ancient royal city.

I was visiting the fabled Isfahan bazaar, when a young man offered to show me around behind the scenes. Just behind a spice shop was a room with a large stone trough. Fitted in the trough was a huge mill wheel with a wooden axle, which was attached to a blindfolded camel. The camel pushed the axle and wheel around the trough, crushing turmeric seeds continually added to the trough.

As far as I could tell, the camel was happy about this arrangement. He seemed well fed and cared for, and he got to go on regular night journeys. By grinding turmeric and I’m sure other spices, the camel was contributing an important service.

I bring up this long-ago camel because at the moment, I am writing the 12th edition of my educational psychology text. That means 36 years of writing and revising. I follow a regular 3-year cycle of researching, writing, and revision. Like the camel, I’m well fed and cared for. Like the camel, I enjoy the journey, and I’m actually producing something of value, or so I hope.

The problem is that both I and the camel get in a rut after a while. In the case of education, progress up until recently has been slow, and a lot of what I added to my text each time I revised it was, let’s be honest, just updating established educational principles by citing the latest people to say or demonstrate them, rather than truly new methods or theories with strong evidence that could change practice for the better. Each revision did contain exciting advances, but not so many.

The evidence-based reform movement is slowly taking the (metaphorical) blindfold off the (metaphorical) camel. Instead of walking around the same well-trodden paths, I and others have more and more to say that is not just the same old accepted wisdom, but that genuinely moves the field forward. New teaching methods, new technologies, new professional development methods, and new understandings about how the brain works are opening up extraordinary possibilities for change. Someday, our camel friend will retire and be replaced with a mechanical spice grinder, lose his blindfold entirely, and open up his eyes to the fantastic possibilities that were always there but that he could not access.

Our liberated camel may no longer want to grind spices, but no matter. Similarly, the world of education may find far more effective strategies once it is liberated with evidence to see new paths to effective practice. And education will become exhilaration for students and their empowered teachers. No longer a grind.

What Is a Large Effect Size?

Ever since Gene Glass popularized the effect size in the 1970s, readers of research have wanted to know how large an effect size has to be in order to be considered important. Well, stop the presses and sound the trumpet. I am about to tell you.

First let me explain what an effect size is and why you should care. An effect size sums up the difference between an experimental (treatment) group and a control group. It is a fraction in which the numerator is the posttest difference on a given measure, adjusted for pretests and other important factors, and the denominator is the unadjusted standard deviation of the control group or the whole sample. Here is the equation in symbols.


What is cool about effect sizes is that they can standardize the findings on all relevant measures. This enables researchers to compare and average effects across all sorts of experiments all over the world. Effect sizes are averaged in meta-analyses that combine the findings of many experiments and find trends that might not be easy to find just looking at experiments one at a time.

Are you with me so far?

One of the issues that has long puzzled readers of research is how to interpret effect sizes. When are they big enough to matter for practice? Researchers frequently cite statistician Jacob Cohen, who defined an effect size of +0.20 as “small,” +0.50 as “moderate,” and +0.80 as “strong.” However, Bloom, Hill, Black, & Lipsey (2008) claim that Cohen never really supported these criteria. New Zealander John Hattie publishes numerous reviews of reviews of research, and routinely finds effect sizes of +0.80 or more, and in fact suggests that educators ignore any teaching method with an average effect size of +0.40 or less. Yet Hattie includes literally everything in his meta-meta analyses, including studies with no control groups, studies in which the control group never saw the content assessed by the posttest, and so on. In studies that do have control groups and in which experimental and control groups were tested on material they were both taught, effect sizes as large as +0.80, or even +0.40, are very unusual, even in evaluations of one-to-one tutoring by certified teachers.

So what’s the right answer? The answer turns out to mainly depend on just two factors: Sample size, and whether or not students, classes/teachers, or schools were randomly assigned (or assigned by matching) to treatment and control groups. We recently did a review of twelve published meta-analyses including only the 611 studies that met the stringent inclusion requirements of our Best-Evidence Encyclopedia (BEE). (In brief, the BEE requires well-matched or randomized control groups and measures not made up by the researchers.) The average effect sizes in the four cells formed by quasi-experimental/randomized and small/large sample size (splitting at n=250) are as follows.


Here is what this chart means. If you look at a study that meets BEE standards and students were matched before being (non-randomly) assigned to treatment and control groups, then the average effect size is +0.32. Studies that use the same sample sizes and design would need to reach an effect size like this to be at the average. In contrast, if you find a large randomized study, it will need an effect size of only +0.11 to be considered average for its type. If Program A reports an effect size of +0.20 and Program B reports the same, are the programs equally effective? Not if they used different designs. If Program A used a large randomized study design and Program B a small quasi-experiment, then Program A is a leader in its class and Program B is a laggard.

This chart only applies to studies that meet our BEE standards, which removes a lot of the awful research that gives Hattie the false impression that everything works, and fabulously.

Using the average of all studies of a given type is not a perfect way to determine what is a large or small effect size, because this method only deals with methodology. It’s sort of “grading on a curve” by comparing effect sizes to their peers, rather than using a performance criterion. But I’d argue that until something better comes along, this is as good a way as any to say which effect sizes are worth paying attention to, and which are less important.

Trans-Atlantic Concord: Tutoring by Paraprofessionals Works

Whenever I speak to skeptical audiences about the enormous potential of evidence-based reform in education, three of the top complaints I always hear are as follows.

  1. In high-quality, randomized experiments, nothing works.
  2. Since educational outcomes depend so much on context, even programs that do work somewhere cannot be assumed to work elsewhere.
  3. Even if a given approach is found to be effective in many contexts, it is unlikely to be scalable to serve large numbers of students and schools.

In light of these criticisms, I was delighted to see a recent blog by Jonathan Sharples at the Education Endowment Foundation (EEF), the main funder of randomized evaluations of educational programs in England (and a former colleague at the University of York). The blog summarizes results from six experiments in England that used what they call teaching assistants (we call them paraprofessionals or aides) to tutor struggling students one-to-one or in small groups, in reading or math, at various grade levels.


Sharples included a table summarizing the results, which I have adapted here:


What is interesting about this chart is that although every study was a third-party randomized experiment, the effect sizes fall within a range from moderately positive to very positive (+0.12 to +0.51).

Another interesting thing about the table is that it resembles findings in U.S. studies of tutoring by paraprofessionals. Here is a chart of such studies:


The contents of the Tables 1 and 2 are heartening in providing relatively consistent positive effects in rigorous studies for replicable, pragmatic interventions for struggling students, a population of great substantive importance. Because paraprofessionals are relatively inexpensive and usually poorly utilized in their current roles, providing them with good training materials and software to work with individuals and small groups of students in dire need of help in reading and math just makes good sense.

However, think back to the criticisms so often thrown at evidence-based reform in general. The findings from tutoring and small-group teaching studies devastates those criticisms:

  1. Nothing works. Really? Not everything works, and it would be nice to have a larger set of positive examples. But tutoring by paraprofessionals (and also by teachers and well-supervised and trained volunteers) definitely works, over and over. There are numerous other programs also proven to work in rigorous studies.
  2. Nothing replicates. Really? Context matters, but here we have relatively consistent findings across the U.S. and England, two very different systems. The effects vary for one-to-one and small-group tutoring, reading and math, and other factors, and we can learn from this variation. But it is clear that across very different contexts, positive effects do replicate.
  3. Nothing scales. Really? Various proven forms of tutoring – by teachers, paraprofessionals, and volunteers – are working right now in schools across the U.S., U.K., and many other countries. Reading Recovery alone, a one-to-one model that uses certified teachers as tutors, works with thousands of teachers worldwide. With the slightest encouragement, proven tutoring models could be expanded to serve many more schools and students, at modest cost.

Proven tutoring models of all types should be a standard offering for every school. More research is always needed to find more effective and cost-effective strategies. But there is no reason whatsoever not to use what we have now. And I hope this example will help critics of evidence-based reform move on to better arguments.