If you ever go into the Ripley’s Believe It or Not Museum in Baltimore, you will be greeted at the entrance by a statue of the tallest man who ever lived, Robert Pershing Wadlow, a gentle giant at 8 feet, 11 inches in his stocking feet. Kids and adults love to get their pictures taken standing by him, to provide a bit of perspective.
I bring up Mr. Wadlow to explain a phrase I use whenever my colleagues come up with an effect size of more than 1.00. “That’s a 10-foot man,” I say. What I mean, of course, is that while it is not impossible that there could be a 10-foot man someday, it is extremely unlikely, because there has never been a man that tall in all of history. If someone reports seeing one, they are probably mistaken.
In the case of effect sizes you will never, or almost never, see an effect size of more than +1.00, assuming the following reasonable conditions:
- The effect size compares experimental and control groups (i.e., it is not pre-post).
- The experimental and control group started at the same level, or they started at similar levels and researchers statistically controlled for pretest differences.
- The measures involved were independent of the researcher and the treatment, not made by the developers or researchers. The test was not given by the teachers to their own students.
- The treatment was provided by ordinary teachers, not by researchers, and could in principle be replicated widely in ordinary schools. The experiment had a duration of at least 12 weeks.
- There were at least 30 students and 2 teachers in each treatment group (experimental and control).
If these conditions are met, the chances of finding effect sizes of more than +1.00 are about the same as the chances of finding a 10-foot man. That is, zero.
I was thinking about the 10-foot man when I was recently asked by a reporter about the “two sigma effect” claimed by Benjamin Bloom and much discussed in the 1970s and 1980s. Bloom’s students did a series of experiments in which students were taught about a topic none of them knew anything about, usually principles of sailing. After a short period, students were tested. Those who did not achieve at least 80% (defined as “mastery”) on the tests were tutored by University of Chicago graduate students long enough to ensure that every tutored student reached mastery. The purpose of this demonstration was to make a claim that every student could learn whatever we wanted to teach them, and the only variable was instructional time, as some students need more time to learn than others. In a system in which enough time could be given to all, “ability” would disappear as a factor in outcomes. Also, in comparison to control groups who were not taught about sailing at all, the effect size was often more than 2.0, or two sigma. That’s why this principle was called the “two sigma effect.” Doesn’t the two sigma effect violate my 10-foot man principle?
No, it does not. The two sigma studies used experimenter-made tests of content taught to the experimental but not control groups. It used University of Chicago graduate students providing far more tutoring (as a percentage of initial instruction) than any school could ever provide. The studies were very brief and sample sizes were small. The two sigma experiments were designed to prove a point, not to evaluate a feasible educational method.
A more recent example of the 10-foot man principle is found in Visible Learning, the currently fashionable book by John Hattie claiming huge effect sizes for all sorts of educational treatments. Hattie asks the reader to ignore any educational treatment with an effect size of less than +0.40, and reports many whole categories of teaching methods with average effect sizes of more than +1.00. How can this be?
The answer is that such effect sizes, like two sigma, do not incorporate the conditions I laid out. Instead, Hattie throws into his reviews entire meta-analyses which may include pre-post studies, studies using researcher-made measures, studies with tiny samples, and so on. For practicing educators, such effect sizes are useless. An educator knows that all children grow from pre- to posttest. They would not (and should not) accept measures made by researchers. The largest known effect sizes that do meet the above conditions are one-to-one tutoring studies with effect sizes up to +0.86. Still not +1.00. What could be more effective than the best of 1-1 tutoring?
It’s fun to visit Mr. Wadlow at the museum, and to imagine what an ever taller man could do on a basketball team, for example. But if you see a 10-foot man at Ripley’s Believe it or Not, or anywhere else, here’s my suggestion. Don’t believe it. And if you visit a museum of famous effect sizes that displays a whopper effect size of +1.00, don’t believe that, either. It doesn’t matter how big effect sizes are if they are not valid.
A 10-foot man would be a curiosity. An effect size of +1.00 is a distraction. Our work on evidence is too important to spend our time looking for 10-foot men, or effect sizes of +1.00, that don’t exist.
Photo credit: [Public domain], via Wikimedia Commons
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
8 thoughts on “Effect Sizes and the 10-Foot Man”
I truly enjoy your articles.
Can you elaborate on the first sentence of # 3 in the Ten foot tall man? I don’t understand what you mean by measures.
Thanks for your note. Like Mr. Wadlow, I can see you are a big fan!
What I meant in #3 is that measures of achievement (e.g., reading, math, science) must not be made by the researchers or developers. Research finds that such measures are inflated to twice what is obtained by studies with independent measures (e.g., published standardized tests, state tests, etc.). For more on this, see the “Standards and Procedures” section of http://www.evidenceforessa.org, under FAQs.
Are you not using a ‘straw man’ technique here? You have found one error in Hattie’s work and used it to condemn the whole edifice.
You haven’t pointed out that the list produced by the UK Education Endowment Foundation, which follows your rules more carefully, produces broadly similar results to Hattie (once you have removed the correlations at the top of his lost)..
The problem with the “Hattie’s results are rubbish” argument is that it leaves the reader even more at sea than they were before they encountered evidence-based material. Our combined list, with 5 sources, including hattie, shows a bigger picture.
The other probelk with ‘straw man’ is that it has an unspoken implication: “Hattie is rubbish – so listen to me.”
What you are pointing at is not a ‘straw man’ fallacy but a ‘composition fallacy’. In your view, you believe that Mr. Slavin condemns Hattie’s results as total rubbish because part of what he presents is rubbish. That is like asserting that some company is treacherous because you believe some guy who works there is treacherous.
If Mr. Slavin should answer: “I did not state that Hattie’s results are total rubbish” (which he did not state), you are guilty of a straw man fallacy yourself, because you have attributed an erroneous point of view (or argument) to your opponent.
Furthermore, you commit the fallacy “ad consequentiam” by denying the validity of a particular argument (“Hattie’s results are rubbish”) because you fear a particular consequence (“it leaves the reader even more at sea”). Statements are not true or false just because they have a certain consequence that you like or dislike. Similarly, your “combined list with 5 sources” is not commendable per se, just because it would leave any reader less at sea.
Lastly, you erroneously accuse Mr. Slavin of a fallacy “ad ignorantiam”. If Mr. Slavin should claim, or imply, that he is totally right just because Hattie is wrong (or cannot be proven to be right), that would indeed be an argumentum ad ignorantiam. But Mr. Slavin says no such thing, and the ‘implication’ you mention stems from your own imagination.
LikeLiked by 1 person
This is an interesting blog post but still appears to have the problem of equating effect size with a effectiveness of an intervention. A research paper by Simpson (https://goo.gl/SJZqZp) shows this is not the case: exactly the same intervention with either a different control treatment, different sample, different measure or different analytic design can yield highly different effect sizes.
So the claim the average of a particular set of studies of one to one interventions is 0.89 tells us only that this group of studies are quite clear on average, not that one to one interventions are more or less effective than, for example, using intelligent tutoring systems.
This is a reason, Mike, that the EEF toolkit and Hattie have about the same ordering (after you adjust for Hattie’s horrible howler of combining correlation and intervention studies): it is easier to conduct clear studies in feedback and metacognition than in class size or behaviour interventions. It just doesn’t mean that feedback should therefore be considered more effective than behaviour interventions. And, BTW, the studies collected in the EEF toolkit don’t appear that much more careful about following the rules in this blogpost: there are very many studies included which use researcher designed measures, where the groups are not balanced by randomisation or matching, where treatments were given by researchers and so on.
The key mistake (which in a recent entertaining interview, https://goo.gl/4yrMnF, Simpson describes as “punching ourselves in the face”!) is ascribing effect size (which is a measure of the whole study) as a measure of the intervention (only one component of the study). There may be a Latin name for this fallacy which hminkema could look up for us!
Afterthought: I don’t know why ‘at least 12 weeks’ is in a list of features which keep effect size low. More dose of a treatment should give higher effect sizes. Counterfactually, if you randomly assigned children at birth to one to one tutoring by the best educators in the world or to the worst education in some war torn country and then had them sit a standardised test, would they really not be more than one sd apart?
Hmm. Have you fallen for a similar error? You have found one paper which says that effect-size meta-analyses can have faults…and so we should dismiss them all.
The problem is (as Geoff Petty repeatedly has pointed out) that you need a mechanism for your decisions about what to try in your classroom to improve learning. If you don’t use meta-analyses, you will probably make your decision using your own judgement. We know for sure that doesn’t work!
For all their faults, compilations of lots of evidence is the most reliable thing we have.
Sorry. I *cited* one paper, but I’ve *found* lots. Three others worth reading are:
Berk, R. (2011). Evidence-based versus junk-based evaluation research: Some lessons from 35 years of the evaluation review. Evaluation Review, 35(3), 191–203.
Bergeron, P.-J., & Rivard, L. (2017). How to engage in pseudoscience with real data: A criticism of John Hattie’s arguments in Visible Learning from the perspective of a statistician. McGill Journal of Education / Revue Des Sciences de L’éducation de McGill, 52(1), 237–246.
Wiliam, D (2016) Leadership for Teacher Learning (Chapter 3).
But these and the Simpson paper cite still others. The issue is rather than counting how many papers there are, we should read the arguments in them – Berk particularly proves that the assumptions needed for a meta-analysis to give a meaningful result can never occur in real world studies.
The idea of ‘but we need a mechanism for decisions’ is beautifully dealt with by Simpson in the podcast ( https://goo.gl/4yrMnF) though he takes the idea from the statistician David Freeman. What he doesn’t say is that there are many more alternatives than relying on what Berk calls ‘junk based evaluation’ in the form of meta-analyses: read the original studies, read systematic reviews, look at realistic evaluations. None of this is about uninformed judgement, but none of it is handing over responsibility for decision making to a numerical process which doesn’t work. I certainly see no evidence that meta-analysis is ‘the most reliable thing we have’.
What do you think is the most reliable method of choosing what to do?