Once upon a time, there was a very famous restaurant, called The Hummingbird. It was known the world over for its unique specialty: Hummingbird Stew. It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird? You’d have to catch dozens of them just for one bowl of stew.
One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner. When they were alone, the visitor said, “You have quite an operation here! But I have been in the restaurant business for many years, and I have always wondered how you do it. No one can make money selling Hummingbird Stew! Tell me how you make it work, and I promise on my honor to keep your secret to my grave. Do you…mix just a little bit?”
The Hummingbird’s owner looked around to be sure no one was listening. “You look honest,” he said. “I will trust you with my secret. We do mix in a bit of horsemeat.”
“I knew it!,” said the visitor. “So tell me, what is the ratio?”
“One to one.”
“Really!,” said the visitor. “Even that seems amazingly generous!”
“I think you misunderstand,” said the owner. “I meant one hummingbird to one horse!”
In education, we write a lot of reviews of research. These are often very widely cited, and can be very influential. Because of the work my colleagues and I do, we have occasion to read a lot of reviews. Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically. Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators. They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.
However, all too many reviews are deeply flawed. Frequently, reviews of research make it impossible to check the validity of the findings of the original studies. As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew. Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies. If the individual studies are of good quality, this is wonderfully useful. But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles. Few readers are willing to do this.
Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals. One published review only used pre-post gains. Presumably, if the reviewers found a study with a control group, they would have ignored the control group data! Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.
I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests. Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests. Accepting this jumble of experimental designs, a review makes no sense. Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.
Many published reviews include results from measures that were made up by program developers. We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.
The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size. Study details were lacking, so I had to find and read the original studies. It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis! The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.
It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.