How Can You Tell When The Findings of a Meta-Analysis Are Likely to Be Valid?

In Baltimore, Faidley’s, founded in 1886, is a much loved seafood market inside Lexington Market. Faidley’s used to be a real old-fashioned market, with sawdust on the floor and an oyster bar in the center. People lined up behind their favorite oyster shucker. In a longstanding tradition, the oyster shuckers picked oysters out of crushed ice and tapped them with their oyster knives. If they sounded full, they opened them. But if they did not, the shuckers discarded them.

I always noticed that the line was longer behind the shucker who was discarding the most oysters. Why? Because everyone knew that the shucker who was pickier was more likely to come up with a dozen fat, delicious oysters, instead of say, nine great ones and three…not so great.

I bring this up today to tell you how to pick full, fair meta-analyses on educational programs. No, you can’t tap them with an oyster knife, but otherwise, the process is similar. You want meta-analysts who are picky about what goes into their meta-analyses. Your goal is to make sure that a meta-analysis produces results that truly represent what teachers and schools are likely to see in practice when they thoughtfully implement an innovative program. If instead you pick the meta-analysis with the biggest effect sizes, you will always be disappointed.

As a special service to my readers, I’m going to let you in on a few trade secrets about how to quickly evaluate a meta-analysis in education.

One very easy way to evaluate a meta-analysis is to look at the overall effect size, probably shown in the abstract. If the overall mean effect size is more than about +0.40, you probably don’t have to read any further. Unless the treatment is tutoring or some other treatment that you would expect to make a massive difference in student achievement, it is rare to find a single legitimate study with an effect size that large, much less an average that large. A very large effect size is almost a guarantee that a meta-analysis is full of studies with design features that greatly inflate effect sizes, not studies with outstandingly effective treatments.

Next, go to the Methods section, which will have within it a section on inclusion (or selection) criteria. It should list the types of studies that were or were not accepted into the study. Some of the criteria will have to do with the focus of the meta-analysis, specifying, for example, “studies of science programs for students in grades 6 to 12.” But your focus is on the criteria that specify how picky the meta-analysis is. As one example of a picky set of critera, here are the main ones we use in Evidence for ESSA and in every analysis we write:

  1. Studies had to use random assignment or matching to assign students to experimental or control groups, with schools and students in each specified in advance.
  2. Students assigned to the experimental group had to be compared to very similar students in a control group, which uses business-as-usual. The experimental and control students must be well matched, within a quarter standard deviation at pretest (ES=+0.25), and attrition (loss of subjects) must be no more than 15% higher in one group than the other at the end of the study. Why? It is essential that experimental and control groups start and remain the same in all ways other than the treatment. Controls for initial differences do not work well when the differences are large.
  3. There must be at least 30 experimental and 30 control students. Analyses of combined effect sizes must control for sample sizes. Why? Evidence finds substantial inflation of effect sizes in very small studies.
  4. The treatments must be provided for at least 12 weeks. Why? Evidence finds major inflation of effect sizes in very brief studies, and brief studies do not represent the reality of the classroom.
  5. Outcome measures must be measures independent of the program developers and researchers. Usually, this means using national tests of achievement, though not necessarily standardized tests. Why? Research has found that tests made by researchers can inflate effect sizes by double, or more, and research-made measures do not represent the reality of classroom assessment.

There may be other details, but these are the most important. Note that there is a double focus of these standards. Each is intended both to minimize bias, but also to maximize similarity to the conditions faced by schools. What principal or teacher who cares about evidence would be interested in adopting a program evaluated in comparison to a very different control group? Or in a study with few subjects, or a very brief duration? Or in a study that used measures made by the developers or researchers? This set is very similar to what the What Works Clearinghouse (WWC) requires, except #5 (the WWC requires exclusion of “overaligned” measures, but not developer-/researcher-made measures).

If these criteria are all there in the “Inclusion Standards,” chances are you are looking at a top-quality meta-analysis. As a rule, it will have average effect sizes lower than those you’ll see in reviews without some or all of these standards, but the effect sizes you see will probably be close to what you will actually get in student achievement gains if your school implements a given program with fidelity and thoughtfulness.

What I find astonishing is how many meta-analyses do not have standards this high. Among experts, these criteria are not controversial, except for the last one, which shouldn’t be. Yet meta-analyses are often written, and accepted by journals, with much lower standards, thereby producing greatly inflated, unrealistic effect sizes.

As one example, there was a meta-analysis of Direct Instruction programs in reading, mathematics, and language, published in the Review of Educational Research (Stockard et al., 2016). I have great respect for Direct Instruction, which has been doing good work for many years. But this meta-analysis was very disturbing.

The inclusion and exclusion criteria in this meta-analysis did not require experimental-control comparisons, did not require well-matched samples, and did not require any minimum sample size or duration. It was not clear how many of the outcomes measures were made by program developers or researchers, rather than independent of the program.

With these minimal inclusion standards, and a very long time span (back to 1966), it is not surprising that the review found a great many qualifying studies. 528, to be exact. The review also reported extraordinary effect sizes: +0.51 for reading, +0.55 for math, and +0.54 for language. If these effects were all true and meaningful, it would mean that DI is much more effective than one-to-one tutoring, for example.

But don’t get your hopes up. The article included an online appendix that showed the sample sizes, study designs, and outcomes of every study.

First, the authors identified eight experimental designs (plus single-subject designs, which were treated separately). Only two of these would meet anyone’s modern standards of meta-analysis: randomized and matched. The others included pre-post gains (no control group), comparisons to test norms, and other pre-scientific designs.

Sample sizes were often extremely small. Leaving aside single-case experiments, there were dozens of single-digit sample sizes (e.g., six students), often with very large effect sizes. Further, there was no indication of study duration.

What is truly astonishing is that RER accepted this study. RER is the top-rated journal in all of education, based on its citation count. Yet this review, and the Kulik & Fletcher (2016) review I cited in a recent blog, clearly did not meet minimal standards for meta-analyses.

My colleagues and I will be working in the coming months to better understand what has gone wrong with meta-analysis in education, and to propose solutions. Of course, our first step will be to spend a lot of time at oyster bars studying how they set such high standards. Oysters and beer will definitely be involved!

Photo credit: Annette White / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

References

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Stockard, J., Wood, T. W., Coughlin, C., & Rasplica Khoury, C. (2018). The effectiveness of Direct Instruction curricula: A meta-analysis of a half century of research. Review of Educational Research88(4), 479–507. https://doi.org/10.3102/0034654317751919

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Meta-Analysis or Muddle-Analysis?

One of the best things about living in Baltimore is eating steamed hard shell crabs every summer.  They are cooked in a very spicy mix of spices, and with Maryland corn and Maryland beer, these define the very peak of existence for Marylanders.  (To be precise, the true culture of the crab also extends into Virginia, but does not really exist more than 20 miles inland from the bay).  

As every crab eater knows, a steamed crab comes with a lot of inedible shell and other inner furniture.  So you get perhaps an ounce of delicious meat for every pound of whole crab. Here is a bit of crab math.  Let’s say you have ten pounds of whole crabs, and I have 20 ounces of delicious crabmeat.  Who gets more to eat?  Obviously I do, because your ten pounds of crabs will only yield 10 ounces of meat. 

How Baltimoreans learn about meta-analysis.

All Baltimoreans instinctively understand this from birth.  So why is this same principle not understood by so many meta-analysts?

I recently ran across a meta-analysis of research on intelligent tutoring programs by Kulik & Fletcher (2016),  published in the Review of Educational Research (RER). The meta-analysis reported an overall effect size of +0.66! Considering that the single largest effect size of one-to-one tutoring in mathematics was “only” +0.31 (Torgerson et al., 2013), it is just plain implausible that the average effect size for a computer-assisted instruction intervention is twice as large. Consider that a meta-analysis our group did on elementary mathematics programs found a mean effect size of +0.19 for all digital programs, across 38 rigorous studies (Slavin & Lake, 2008). So how did Kulik & Fletcher come up with +0.66?

The answer is clear. The authors excluded very few studies except for those of less than 30 minutes’ duration. The studies they included used methods known to greatly inflate effect sizes, but they did not exclude or control for them. To the authors’ credit, they then carefully documented the effects of some key methodological factors. For example, they found that “local” measures (presumably made by researchers) had a mean effect size of +0.73, while standardized measures had an effect size of +0.13, replicating findings of many other reviews (e.g., Cheung & Slavin, 2016). They found that studies with sample sizes less than 80 had an effect size of +0.78, while those with samples of more than 250 had an effect size of +0.30. Brief studies had higher effect sizes than those of longer studies, as found in many studies. All of this is nice to know, but even knowing it all, Kulik & Fletcher failed to control for any of it, not even to weight by sample size. So, for example, the implausible mean effect size of +0.66 includes a study with a sample size of 33, a duration of 80 minutes, and an effect size of +1.17, on a “local” test. Another had 48 students, a duration of 50 minutes, and an effect size of +0.95. Now, if you believe that 80 minutes on a computer is three times as effective for math achievement than months of one-to-one tutoring by a teacher, then I have a lovely bridge in Baltimore I’d like to sell you.

I’ve long been aware of these problems with meta-analyses that neither exclude nor control for characteristics of studies known to greatly inflate effect sizes. This was precisely the flaw for which I criticized John Hattie’s equally implausible reviews. But what I did not know until recently was just how widespread this is.

I was working on a proposal to do a meta-analysis of research on technology applications in mathematics. A colleague located every meta-analysis published on this topic since 2013. She found 20 of them. After looking at the remarkable outcomes on a few, I computed a median effect size across all twenty. It was +0.44. That is, to put it mildly, implausible. Looking further, I discovered that only one of the reviews adjusted for sample size (inverse variances). Its mean effect size was +0.05. Every one of the other 19 meta-analyses, all in respectable journals, did not control for methodological features or exclude studies based on them, and reported effect sizes up to +1.02 and +1.05.

Meta-analyses are important, because they are widely read and widely cited, in comparison to individual studies. Yet until meta-analyses start consistently excluding, or at least controlling for studies with factors known to inflate mean effect sizes, then they will have little if any meaning for practice. As things stand now, the overall mean impacts reported by meta-analyses in education depend on how stringent the inclusion standards were, not how effective the interventions truly were.

This is a serious problem for evidence-based reform. Our field knows how to solve it, but all too many meta-analysts do not do so. This needs to change. We see meta-analyses claiming huge impacts, and then wonder why these effects do not transfer to practice. In fact, these big effect sizes do not transfer because they are due to methodological artifacts, not to actual impacts teachers are likely to obtain in real schools with real students.

Ten pounds (160 ounces) of crabs only appear to be more than 20 ounces of crabmeat,  because the crabs contain a lot you need to discard.  The same is true of meta-analyses.  Using small samples, brief durations, and researcher-made measures in evaluations inflate effect sizes without adding anything to the actual impact of treatments for students.  Our job as meta-analysts is to strip away the bias the best we can, and get to the actual impact.  Then we can make comparisons and generalizations that make sense, and move forward understanding of what really works in education.

In our research group, when we deal with thorny issues of meta-analysis, I often ask my colleagues to consider that they had a sister who is a principal.  “What would you say to her,” I ask, “if she asked what really works, all BS aside?  Would you suggest a program that was very effective in a 30-minute study?  One that has only been evaluated with 20 students?  One that has only been shown to be effective if the researcher gets to make the measure?  Principals are sharp, and appropriately skeptical.  Your sister would never accept such evidence.  Especially if she’s experienced with Baltimore crabs.”

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Slavin, R., & Lake, C. (2008). Effective programs in elementary mathematics: A best-evidence synthesis. Review of Educational Research, 78 (3), 427-515.

Torgerson, C. J., Wiggins, A., Torgerson, D., Ainsworth, H., & Hewitt, C. (2013). Every Child Counts: Testing policy effectiveness using a randomised controlled trial, designed, conducted and reported to CONSORT standards. Research In Mathematics Education, 15(2), 141–153. doi:10.1080/14794802.2013.797746.

Photo credit: Kathleen Tyler Conklin/(CC BY 2.0)

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Would Your School or District Like to Participate in Research?

As research becomes more influential in educational practice, it becomes important that studies take place in all kinds of schools. However, this does not happen. In particular, the large-scale quantitative research evaluating practical solutions for schools tends to take place in large, urban districts near major research universities. Sometimes they take place in large, suburban districts near major research universities. This is not terribly surprising, because in order to meet the highest standards of the What Works Clearinghouse or Evidence for ESSA, a study of a school-level program will need 40 to 50 schools willing to be assigned at random to either use a new program or to serve as a control group.

Naturally, researchers want to have to deal with a small number of districts (to avoid having to deal with many different district-level rules and leaders), so they try to sign up districts in which they might find 40 or 50 schools willing to participate, or perhaps split between two or three districts at most. But there are not that many districts with that number of schools. Further, researchers do not want to spend their time or money flying around to visit schools, so they usually try to find schools close to home.

As a result of these dynamics, of course, it is easy to predict where high-quality quantitative research on innovative programs is not going to take place very often. Small districts (even urban ones) can be hard to serve, but the main category of schools left out of big studies are ones in rural districts. This is not only unfair, but it deprives rural schools of a robust evidence base for practice. Also, it can be a good thing for schools and districts anywhere to participate in research. Typically, schools are paired and assigned at random to treatment or control groups. Treatment groups get the treatment, and control schools usually get some incentive, such as money, or an opportunity to use the innovative treatment a year after the experiment is over. So why should some places get all this attention and opportunity, while others complain that they never get to participate and that there are few programs evaluated in districts like theirs?

I have a solution to propose for this problem: A “Registry of Districts and Schools Seeking Research Opportunities.” The idea is that district leaders or principals could list information about themselves and the kinds of research they might be willing to host in their schools or districts. Researchers seeking district or school partners for proposals or funded projects could post invitations for participation. In this way, researchers could find out about districts they might never have otherwise considered, and district and school leaders could find out about research opportunities. Sort of like a dating site, but adapted to the interests of researchers and potential research partners (i.e., no photos would be required).

blog_6-28-18_scientists_500x424
Scientists consulting a registry of volunteer participants.

If this idea interests you, or if you would like to participate, please write to Susan Davis at sdavi168@jh.edu . If you wish, you can share any opinions and ideas about how such a registry might best accomplish its goals. If you represent a district or school and are interested in participating in research, tell us, and I’ll see what I can do.

If I get lots of encouragement, we might create such a directory and operate it on behalf of all districts, schools, and researchers, to benefit students. I’ll look forward to hearing from you!

 This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Cherry Picking? Or Making Better Trees?

Everyone knows that cherry picking is bad. Bad, bad, bad, bad, bad, bad, bad. Cherry picking means showing non-representative examples to give a false impression of the quality of a product, an argument, or a scientific finding. In educational research, for example, cherry picking might mean a publisher or software developer showing off a school using their product that is getting great results, without saying anything about all the other schools using their product that are getting mediocre results, or worse. Very bad, and very common. The existence of cherry picking is one major reason that educational leaders should always look at valid research involving many schools to evaluate the likely impact of a program. The existence of cherry picking also explains why preregistration of experiments is so important, to make it difficult for developers to do many experiments and then publicize only the ones that got the best results, ignoring the others.

However, something that looks a bit like cherry picking can be entirely appropriate, and is in fact an important way to improve educational programs and outcomes. This is when there are variations in outcomes among various programs of a given type. The average across all programs of that type is unimpressive, but some individual programs have done very well, and have replicated their findings in multiple studies.

As an analogy, let’s move from cherries to apples. The first delicious apple was grown by a farmer in Iowa in 1880. He happened to notice that fruit from one particular branch or one tree had a beautiful shape and a terrific flavor. The Stark Seed Company was looking for new apple varieties, and they bought his tree They grafted the branch on an ordinary rootstock, and (as apples are wont to do), every apple on the grafted tree looked and tasted like the ones from that one unusual branch.

blog_4-16-20_applepicking_333x500 Had the farmer been hoping to sell his whole orchard, and had he taken potential buyers to see this one tree, and offered potential buyers picked apples from this particular branch, then that would be gross cherry-picking. However, he knew (and the Stark Seed Company knew) all about grafting, so instead of using his exceptional branch to fool anyone (note that I am resisting the urge to mention “graft and corruption”), the farmer and Stark could replicate that amazing branch. The key here is the word “replicate.” If it were impossible to replicate the amazing branch, the farmer would have had a local curiosity at most, or perhaps just a delicious occasional snack. But with replication, this one branch transformed the eating apple for a century.

Now let’s get back to education. Imagine that there were a category of educational programs that generally had mediocre results in rigorous experiments. There is always variation in educational outcomes, so the developers of each program would know of individual schools using their program and getting fantastic results. This would be useful for marketing, but if the program developers are honest, they would make all studies of their program available, rather than claiming that the unusual super-duper schools represent what an average school that adopts their program is likely to obtain.

However, imagine that there is a program that resembles others in its category in most ways, yet time and again gets results far beyond those obtained by similar programs of the same type. Perhaps there is a “secret sauce,” some specific factor that explains the exceptional outcomes, or perhaps the organization that created and/or disseminates the program is exceptionally capable. Either way, any potential user would be missing something if they selected a program based on the mediocre average achievement outcomes for its category. If the outcomes for one or more programs are outstanding (and assuming costs and implementation characteristics are similar), then the average achievement effects for the category should no longer be particularly relevant, because any educator who cares about evidence should be looking for the most effective programs, since no one would want to implement an entire category.

I was thinking about apples and cherries because of our group’s work reviewing research on various tutoring programs (Neitzel et al., 2020). As is typical of reviews, we were computing average effect sizes for achievement impacts of categories. Yet these average impacts were much less than the replicated impacts for particular programs. For example, the mean effect size for one-to-small group tutoring was +0.20. Yet various individual programs had mean effect sizes of +0.31, +0.39, +0.42, +0.43, +0.46, and +0.64. In light of these findings, is the practical impact of small group tutoring truly +0.20, or is it somewhere in the range of +0.31 to +0.64? If educators chose programs based on evidence, they would be looking a lot harder at the programs with the larger impacts, not at the mean of all small-group tutoring approaches

Educational programs cannot be replicated (grafted) as easily as apple trees can. But just as the value to the Stark Seed Company of the Iowa farmer’s orchard could not be determined by averaging ratings of a sampling of all of his apples, the value of a category of educational programs cannot be determined by its average effects on achievement. Rather, the value of the category should depend on the effectiveness of its best, replicated, and replicable examples.

At least, you have to admit it’s a delicious idea!

References

Neitzel, A., Lake, C., Pellegrini, M., & Slavin, R. (2020). A synthesis of quantitative research on programs for struggling readers in elementary schools. Available at www.bestevidence.org. Manuscript submitted for publication.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

 

Preschool: A Step, Not a Journey

“A journey of a thousand miles begins with a single step.”

So said Lau Tzi (or Lau Tzu), the great Chinese scholar who lived in the 6th century BC.

For many years, especially since the extraordinary long-term outcomes of the Perry Preschool became known, many educators have seen high-quality preschool as an essential “first step” in a quality education. Truly, a first step in a journey of a thousand miles. Further, due to the Perry Preschool findings, educators, researchers, and policy makers have maintained that quality preschool is not only the first step in a quality education, but it is the most important, capable of making substantial differences in the lives of disadvantaged students.

I believe, based on the evidence, that high-quality preschool helps students enter kindergarten and, perhaps, first grade, with important advantages in academic and social skills. It is clear that quality preschool can provide a good start, and for this reason, I’d support investments in providing the best preschool experiences we can afford.

But the claims of most preschool advocates go far beyond benefits through kindergarten. We have been led to expect benefits that last throughout children’s lives.

Would that this were so, but it is not. The problem is that randomized studies rarely find long-term impacts. In such studies, children are randomly assigned to receive specific, high-quality preschool services or to serve in a control group, in which children may remain at home or may receive various daycare or preschool experiences of varying quality. In randomized long-term studies comparing students randomly assigned to preschool or business as usual, the usual pattern of findings shows positive effects on many measures at the end of the preschool year, fading effects at the end of kindergarten, and no differences in later years. One outstanding example is the Tennessee Voluntary Prekindergarten Program (Lipsey, Farran, & Durkin, 2018). A national study of Head Start by Puma, Bell, Cook, & Heid (2010) found the same pattern, as did randomized studies in England (Melhuish et al., 2010) and Australia (Claessens & Garrett, 2014). Reviews of research routinely identify this consistent pattern (Chambers, Cheung, & Slavin, 2017; Camilli et al., 2009; Melhuish et al., 2010).

So why do so many researchers and educators believe that there are long-term positive effects of preschool? There are two answers. One is the Perry Preschool, and the other is the use of matched rather than randomized study designs.

blog_4-9-20_preschoolers_500x333

The Perry Preschool study (Schweinhart & Weikart, 1997) did use a randomized design, but it had many features that made it an interesting pilot rather than a conclusive demonstration of powerful and scalable impacts. First, the Perry Preschool study had a very small sample (initially, 123 students in a single school in Ypsilanti, Michigan). It allowed deviations from random assignment, such as assigning children whose mothers worked to the control group. It provided an extraordinary level of services, never intended to be broadly replicable. Further, the long-term effects were never seen on elementary achievement, but only appeared when students were in secondary school. It seems unlikely that powerful impacts could be seen after there were no detectable impacts in all of elementary school. No one can fully explain what happened, but it is important to note that no one has replicated anything like what the Perry Preschool did, in all the years since the program was implemented in 1962-1967.

With respect to matched study designs, which do sometimes find positive longitudinal effects, a likely explanation is that with preschool children, matching fails to adequately control for initial differences. Families that enroll their four-year-olds in preschool tend, on average, to be more positively oriented toward learning and more eager to promote their children’s academic success. Well-implemented matched designs in the elementary and secondary grades invariably control for prior achievement, and this usually does a good job of equalizing matched samples. With four-year-olds, however, early achievement or IQ tests are not very reliable or well-correlated with outcomes, so it is impossible to know how much matching has equalized the groups on key variables.

Preparing for a Journey

Lao Tzi’s observation reminds us that any great accomplishment is composed of many small, simple activities. Representing a student’s educational career as a journey, this fits. One grand intervention at one point in that journey may be necessary, but it is not sufficient to ensure the success of the journey. In the journey of education, it is surely important to begin with a positive experience, one that provides children with a positive orientation toward school, skills needed to get along with teachers and classmates, knowledge about how the world works, a love for books, stories, and drama, early mathematical ideas, and much more. This is the importance of preschool. Yet it is not enough. Major make-or-break objectives lie in the future. In the years after preschool, students must learn to read proficiently, they must learn basic concepts of mathematics, and they must continue to build social-emotional skills for the formal classroom setting. In the upper elementary grades, they must learn to use their reading and math skills to learn to write effectively, and to learn science and social studies. Then they must make a successful transition to master the challenges of secondary school, leading to successful graduation and entry into valued careers or post-secondary education. Each of these accomplishments, along with many others, requires the best teaching possible, and each is as important and as difficult to achieve for every child as is success in preschool.

A journey of a thousand miles may begin with a single step, but what matters is how the traveler negotiates all the challenges between the first step and the last one. This is true of education. We need to find effective and replicable methods to maximize the possibility that every student will succeed at every stage of the learning process. This can be done, and every year our profession finds more and better ways to improve outcomes at every grade level, in every subject. Preschool is only the first of a series of opportunities to enable all children to reach challenging goals. An important step, to be sure, but not the whole journey.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

References

Camilli, G., Vargas, S., Ryan, S., & Barnett, S. (2009). Meta-analysis of the effects of early education interventions on cognitive and social development. Teachers College Record, 112 (3), 579-620.

Chambers, B., Cheung, A., & Slavin, R.E. (2016) Literacy and language outcomes of comprehensive and developmental-constructivist approaches to early childhood education: A systematic review. Educational Research Review, 18, 88-111..

Claessens, A., & Garrett, R. (2014). The role of early childhood settings for 4-5 year old children in early academic skills and later achievement in Australia. Early Childhood Research Quarterly, 29, (4), 550-561.

Lipsey, M., Farran, D., & Durkin, K. (2018). Effects of the Tennessee Prekindergarten Program on children’s achievement and behavior through third grade. Early Childhood Research Quarterly, 45 (4), 155-176.

Melhuish, E., Belsky, J., & Leyland, R. (2010). The impact of Sure Start local programmes on five year olds and their families. London: Department for Education.

Puma, M., Bell, S., Cook, R., & Heid, C. (2010). Head Start impact study: Final report. Washington, DC: U.S. Department of Health and Human Services.

Schweinhart, L. J., & Weikart, D. P. (1997). Lasting differences: The High/Scope Preschool curriculum comparison study through age 23 (Monographs of the High/Scope Educational Research Foundation No. 12) Ypsilanti, MI: High/Scope Press.

 Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Even Magic Johnson Sometimes Had Bad Games: Why Research Reviews Should Not Be Limited to Published Studies

When my sons were young, they loved to read books about sports heroes, like Magic Johnson. These books would all start off with touching stories about the heroes’ early days, but as soon as they got to athletic feats, it was all victories, against overwhelming odds. Sure, there were a few disappointments along the way, but these only set the stage for ultimate triumph. If this weren’t the case, Magic Johnson would have just been known by his given name, Earvin, and no one would write a book about him.

Magic Johnson was truly a great athlete and is an inspiring leader, no doubt about it. However, like all athletes, he surely had good days and bad ones, good years and bad. Yet the published and electronic media naturally emphasize his very best days and years. The sports press distorts the reality to play up its heroes’ accomplishments, but no one really minds. It’s part of the fun.

Blog_2-13-20_magicjohnson_333x500In educational research evaluating replicable programs and practices, our objectives are quite different. Sports reporting builds up heroes, because that’s what readers want to hear about. But in educational research, we want fair, complete, and meaningful evidence documenting the effectiveness of practical means of improving achievement or other outcomes. The problem is that academic publications in education also distort understanding of outcomes of educational interventions, because studies with significant positive effects (analogous to Magic’s best days) are far more likely to be published than are studies with non-significant differences (like Magic’s worst days). Unlike the situation in sports, these distortions are harmful, usually overstating the impact of programs and practices. Then when educators implement interventions and fail to get the results reported in the journals, this undermines faith in the entire research process.

It has been known for a long time that studies reporting large, positive effects are far more likely to be published than are studies with smaller or null effects. One long-ago study, by Atkinson, Furlong, & Wampold (1982), randomly assigned APA consulting editors to review articles that were identical in all respects except that half got versions with significant positive effects and half got versions with the same outcomes but marked as not significant. The articles with outcomes marked “significant” were twice as likely as those marked “not significant” to be recommended for publication. Reviewers of the “significant” studies even tended to state that the research designs were excellent much more often than did those who reviewed the “non-significant” versions.

Not only do journals tend not to accept articles with null results, but authors of such studies are less likely to submit them, or to seek any sort of publicity. This is called the “file-drawer effect,” where less successful experiments disappear from public view (Glass et al., 1981).

The combination of reviewers’ preferences for significant findings and authors’ reluctance to submit failed experiments leads to a substantial bias in favor of published vs. unpublished sources (e.g., technical reports, dissertations, and theses, often collectively termed “gray literature”). A review of 645 K-12 reading, mathematics, and science studies by Cheung & Slavin (2016) found almost a two-to-one ratio of effect sizes between published and gray literature reports of experimental studies, +0.30 to +0.16. Lipsey & Wilson (1993) reported a difference of +0.53 (published) to +0.39 (unpublished) in a study of psychological, behavioral and educational interventions. Similar outcomes have been reported by Polanin, Tanner-Smith, & Hennessy (2016), and many others. Based on these long-established findings, Lipsey & Wilson (1993) suggested that meta-analyses should establish clear, rigorous criteria for study inclusion, but should then include every study that meets those standards, published or not.

The rationale for restricting interest (or meta-analyses) to published articles was always weak, but in recent years it is diminishing. An increasing proportion of the gray literature consists of technical reports, usually by third-party evaluators, of highly funded experiments. For example, experiments funded by IES and i3 in the U.S., the Education Endowment Foundation (EEF) in the U.K., and the World Bank and other funders in developing countries, provide sufficient resources to do thorough, high-quality implementations of experimental treatments, as well as state-of-the-art evaluations. These evaluations almost always meet the standards of the What Works Clearinghouse, Evidence for ESSA, and other review facilities, but they are rarely published, especially because third-party evaluators have little incentive to publish.

It is important to note that the number of high-quality unpublished studies is very large. Among the 645 studies reviewed by Cheung & Slavin (2016), all had to meet rigorous standards. Across all of them, 383 (59%) were unpublished. Excluding such studies would greatly diminish the number of high-quality experiments in any review.

I have the greatest respect for articles published in top refereed journals. Journal articles provide much that tech reports rarely do, such as extensive reviews of the literature, context for the study, and discussions of theory and policy. However, the fact that an experimental study appeared in a top journal does not indicate that the article’s findings are representative of all the research on the topic at hand.

The upshot of this discussion is clear. First, meta-analyses of experimental studies should always establish methodological criteria for inclusion (e.g., use of control groups, measures not overaligned or made by developers or researchers, duration, sample size), but never restrict studies to those that appeared in published sources. Second, readers of reviews of research on experimental studies should ignore the findings of reviews that were limited to published articles.

In the popular press, it’s fine to celebrate Magic Johnson’s triumphs and ignore his bad days. But if you want to know his stats, you need to include all of his games, not just the great ones. So it is with research in education. Focusing only on published findings can make us believe in magic, when what we need are the facts.

 References

Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship? Journal of Counseling Psychology, 29(2), 189–194. https://doi.org/10.1037/0022-0167.29.2.189

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills: Sage Publications.

Lipsey, M.W. & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181-1209.

Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between published and unpublished effect sizes: A meta-review. Review of Educational Research86(1), 207–236. https://doi.org/10.3102/0034654315582067

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Compared to What? Getting Control Groups Right

Several years ago, I had a grant from the National Science Foundation to review research on elementary science programs. I therefore got to attend NSF conferences for principal investigators. At one such conference, we were asked to present poster sessions. The group next to mine was showing an experiment in science education that had remarkably large effect sizes. I got to talking with the very friendly researcher, and discovered that the experiment involved a four-week unit on a topic in middle school science. I think it was electricity. Initially, I was very excited, electrified even, but then I asked a few questions about the control group.

“Of course there was a control group,” he said. “They would have taught electricity too. It’s pretty much a required portion of middle school science.”

Then I asked, “When did the control group teach about electricity?”

“We had no way of knowing,” said my new friend.

“So it’s possible that they had a four-week electricity unit before the time when your program was in use?”

“Sure,” he responded.

“Or possibly after?”

“Could have been,” he said. “It would have varied.”

Being the nerdy sort of person I am, I couldn’t just let this go.

“I assume you pretested students at the beginning of your electricity unit and at the end?”

“Of course.”

“But wouldn’t this create the possibility that control classes that received their electricity unit before you began would have already finished the topic, so they would make no more progress in this topic during your experiment?”

“…I guess so.”

“And,” I continued, “students who received their electricity instruction after your experiment would make no progress either because they had no electricity instruction between pre- and posttest?”

I don’t recall how the conversation ended, but the point is, wonderful though my neighbor’s science program might be, the science achievement outcome of his experiment were, well, meaningless.

In the course of writing many reviews of research, my colleagues and I encounter misuses of control groups all the time, even in articles in respected journals written by well-known researchers. So I thought I’d write a blog on the fundamental issues involved in using control groups properly, and the ways in which control groups are often misused.

The purpose of a control group

The purpose of a control group in any experiment, randomized or matched, is to provide a valid estimate of what the experimental group would have achieved had it not received the experimental treatment, or if the study had not taken place at all. Through random assignment or matching, the experimental and control groups are essentially equal at pretest on all important variables (e.g., pretest scores, demographics), and nothing happens in the course of the experiment to upset this initial equality.

How control groups go wrong

Inequality in opportunities to learn tested content. Often, experiments appear to be legitimate (e.g., experimental and control groups are well matched at pretest), but the design contains major bias, because the content being taught in the experimental group is not the same as the content taught in the control group, and the final outcome measure is aligned to what the experimental group was taught but not what the control group was taught. My story at the start of this blog was an example of this. Between pre- and posttest, all students in the experimental group were learning about electricity, but many of those in the control group had already completed electricity or had not received it yet, so they might have been making great progress on other topics, which were not tested, but were unlikely to make much progress on the electricity content that was tested. In this case, the experimental and control groups could be said to be unequal in opportunities to learn electricity. In such a case, it matters little what the exact content or teaching methods were for the experimental program. Teaching a lot more about electricity is sure to add to learning of that topic regardless of how it is taught.

There are many other circumstances in which opportunities to learn are unequal. Many studies use unusual content, and then use tests partially or completely aligned to this unusual content, but not to what the control group was learning. Another common case is where experimental students learn something involving use of technology, but the control group uses paper and pencil to learn the same content. If the final test is given on the technology used by the experimental but not the control group, the potential for bias is obvious.

blog_2-20-20_schoolstudy_500x333 (2)

Unequal opportunities to learn (as a source of bias in experiments) relates to a topic I’ve written a lot about. Use of developer- or researcher-made outcome measures may introduce unequal opportunities to learn, because these measures are more aligned with what the experimental group was learning than what the control group was learning. However, the problem of unequal opportunities to learn is broader than that of developer/researcher-made measures. For example, the story that began this blog illustrated serious bias, but the measure could have been an off-the-shelf, valid measure of electricity concepts.

Problems with control groups that arise during the experiment. Many problems with control groups only arise after an experiment is under way, or completed. These involve situations in which there are different numbers of students/classes/schools that are not counted in the analysis. Usually, these are cases in which, in theory, experimental and control groups have equal opportunities to learn the tested content at the beginning of the experiment. However, some number of students assigned to the experimental group do not participate in the experiment enough to be considered to have truly received the treatment. Typical examples of this include after-school and summer-school programs. A group of students is randomly assigned to receive after-school services, for example, but perhaps only 60% of the students actually show up, or attend enough days to constitute sufficient participation. The problem is that the researchers know exactly who attended and who did not in the experimental group, but they have no idea which control students would or would not have attended if the control group had had the opportunity. The 40% of students who did not attend can probably be assumed to be less highly motivated, lower achieving, have less supportive parents, or to possess other characteristics that, on average, may identify students who are less likely to do well than students in general. If the researchers drop these 40% of students, the remaining 60% who did participate are likely (on average) to be more motivated, higher achieving, and so on, so the experimental program may look a lot more effective than it truly is. This kind of problem comes up quite often in studies of technology programs, because researchers can easily find out how often students in the experimental group actually logged in and did the required work. If they drop students who did not use the technology as prescribed, then the remaining students who did use the technology as intended are likely to perform better than control students, who will be a mix of students who would or would not have used the technology if they’d had the chance. Because these control groups contain more and less motivated students, while the experimental group only contains the students who were motivated to use the technology, the experimental group may have a huge advantage.

Problems of this kind can be avoided by using intent to treat (ITT) methods, in which all students who were pretested remain in the sample and are analyzed whether or not they used the software or attended the after-school program. Both the What Works Clearinghouse and Evidence for ESSA require use of ITT models in situations of this kind. The problem is that use of ITT analyses almost invariably reduces estimates of effect sizes, but to do otherwise may introduce quite a lot of bias in favor of the experimental groups.

Experiments without control groups

Of course, there are valid research designs that do not require use of control groups at all. These include regression discontinuity designs (in which long-term data trends are studied to see if there is a sharp change at the point when a treatment is introduced) and single-case experimental designs (in which as few as one student/class/school is observed frequently to see what happens when treatment conditions change). However, these designs have their own problems, and single case designs are rarely used outside of special education.

Control groups are essential in most rigorous experimental research in education, and with proper design they can do what they were intended to do with little bias. Education researchers are becoming increasingly sophisticated about fair use of control groups. Next time I go to an NSF conference, for example, I hope I won’t see posters on experiments that compare students who received an experimental treatment to those who did not even receive instruction on the same topic between pretest and posttest.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Queasy about Quasi-Experiments? How Rigorous Quasi-Experiments Can Minimize Bias

I once had a statistics professor who loved to start discussions of experimental design with the following:

“First, pick your favorite random number.”

Obviously, if you pick a favorite random number, it isn’t random. I was recalling this bit of absurdity recently when discussing with colleagues the relative value of randomized experiments (RCTs) and matched studies, or quasi-experimental designs (QED). In randomized experiments, students, teachers, classes, or schools are assigned at random to experimental or control conditions. In quasi-experiments, a group of students, teachers, classes, or schools is identified as the experimental group, and then other schools are located (usually in the same districts) and then matched on key variables, such as prior test scores, percent free lunch, ethnicity, and perhaps other factors. The ESSA evidence standards, the What Works Clearinghouse, Evidence for ESSA, and most methodologists favor randomized experiments over QEDs, but there are situations in which RCTs are not feasible. In a recent “Straight Talk on Evidence,” Jon Baron discussed how QEDs can approach the usefulness of RCTs. In this blog, I build on Baron’s article and go further into strategies for getting the best, most unbiased results possible from QEDs.

Randomized and quasi-experimental studies are very similar in most ways. Both almost always compare experimental and control schools that were very similar on key performance and demographic factors. Both use the same statistics, and require the same number of students or clusters for adequate power. Both apply the same logic, that the control group mean represents a good approximation of what the experimental group would have achieved, on average, if the experiment had never taken place.

However, there is one big difference between randomized and quasi-experiments. In a well-designed randomized experiment, the experimental and control groups can be assumed to be equal not only on observed variables, such as pretests and socio-economic status, but also on unobserved variables. The unobserved variables we worry most about have to do with selection bias. How did it happen (in a quasi-experiment) that the experimental group chose to use the experimental treatment, or was assigned to the experimental treatment? If a set of schools decided to use the experimental treatment on their own, then these schools might be composed of teachers or principals who are more oriented toward innovation, for example. Or if the experimental treatment is difficult, the teachers who would choose it might be more hard-working. If it is expensive, then perhaps the experimental schools have more money. Any of these factors could bias the study toward finding positive effects, because schools that have teachers who are motivated or hard-working, in schools with more resources, might perform better than control schools with or without the experimental treatment.

blog_1-16-20_normalcurve_500x333

Because of this problem of selection bias, studies that use quasi-experimental designs generally have larger effect sizes than do randomized experiments. Cheung & Slavin (2016) studied the effects of methodological features of studies on effect sizes. They obtained effect sizes from 645 studies of elementary and secondary reading, mathematics, and science, as well as early childhood programs. These studies had already passed a screening in which they would have been excluded if they had serious design flaws. The results were as follows:

  No. of studies Mean effect size
Quasi-experiments 449 +0.23
Randomized experiments 196 +0.16

Clearly, mean effect sizes were larger in the quasi-experiments, suggesting the possibility that there was bias. Compared to factors such as sample size and use of developer- or researcher-made measures, the amount of effect size inflation in quasi-experiments was modest, and some meta-analyses comparing randomized and quasi-experimental studies have found no difference at all.

Relative Advantages of Randomized and Quasi-Experiments

Because of the problems of selection bias, randomized experiments are preferred to quasi-experiments, all other factors being equal. However, there are times when quasi-experiments may be necessary for practical reasons. For example, it can be easier to recruit and serve schools in a quasi-experiment, and it can be less expensive. A randomized experiment requires that schools be recruited with the promise that they will receive an exciting program. Yet half of them will instead be in a control group, and to keep them willing to sign up, they may be given a lot of money, or an opportunity to receive the program later on. In a quasi-experiment, the experimental schools all get the treatment they want, and control schools just have to agree to be tested.  A quasi-experiment allows schools in a given district to work together, instead of insisting that experimental and control schools both exist in each district. This better simulates the reality schools are likely to face when a program goes into dissemination. If the problems of selection bias can be minimized, quasi-experiments have many attractions.

An ideal design for quasi-experiments would obtain the same unbiased outcomes as a randomized evaluation of the same treatment might do. The purpose of this blog is to discuss ways to minimize bias in quasi-experiments.

In practice, there are several distinct forms of quasi-experiments. Some have considerable likelihood of bias. However, others have much less potential for bias. In general, quasi-experiments to avoid are forms of post-hoc, or after-the-fact designs, in which determination of experimental and control groups takes place after the experiment. Quasi-experiments with much less likelihood of bias are pre-specified designs, in which experimental and control schools, classrooms, or students are identified and registered in advance. In the following sections, I will discuss these very different types of quasi-experiments.

Post-Hoc Designs

Post-hoc designs generally identify schools, teachers, classes, or students who participated in a given treatment, and then find matches for each in routinely collected data, such as district or school standardized test scores, attendance, or retention rates. The routinely collected data (such as state test scores or attendance) are collected as pre-and posttests from school records, so it may be that neither experimental nor control schools’ staffs are even aware that the experiment happened.

Post-hoc designs sound valid; the experimental and control groups were well matched at pretest, so if the experimental group gained more than the control group, that indicates an effective treatment, right?

Not so fast. There is much potential for bias in this design. First, the experimental schools are almost invariably those that actually implemented the treatment. Any schools that dropped out or (even worse) any that were deemed not to have implemented the treatment enough have disappeared from the study. This means that the surviving schools were different in some important way from those that dropped out. For example, imagine that in a study of computer-assisted instruction, schools were dropped if fewer than 50% of students used the software as much as the developers thought they should. The schools that dropped out must have had characteristics that made them unable to implement the program sufficiently. For example, they might have been deficient in teachers’ motivation, organization, skill with technology, or leadership, all factors that might also impact achievement with or without the computers. The experimental group is only keeping the “best” schools, but the control schools will represent the full range, from best to worst. That’s bias. Similarly, if individual students are included in the experimental group only if they actually used the experimental treatment a certain amount, that introduces bias, because the students who did not use the treatment may be less motivated, have lower attendance, or have other deficits.

As another example, developers or researchers may select experimental schools that they know did exceptionally well with the treatment. Then they may find control schools that match on pretest. The problem is that there could be unmeasured characteristics of the experimental schools that could cause these schools to get good results even without the treatment. This introduces serious bias. This is a particular problem if researchers pick experimental or control schools from a large database. The schools will be matched at pretest, but since the researchers may have many potential control schools to choose among, they may use selection rules that, while they maintain initial equality, introduce bias. The readers of the study might never be able to find out if this happened.

Pre-Specified Designs

The best way to minimize bias in quasi-experiments is to identify experimental and control schools in advance (as contrasted with post hoc), before the treatment is applied. After experimental and control schools, classes, or students are identified and matched on pretest scores and other factors, the names of schools, teachers, and possibly students on each list should be registered on the Registry of Efficacy and Effectiveness Studies. This way, all schools (and all students) involved in the study are counted in intent-to-treat (ITT) analyses, just as is expected in randomized studies. The total effect of the treatment is based on this list, even if some schools or students dropped out along the way. An ITT analysis reflects the reality of program effects, because it is rare that all schools or students actually use educational treatments. Such studies also usually report effects of treatment on the treated (TOT), focusing on schools and students who did implement for treatment, but such analyses are of only minor interest, as they are known to reflect bias in favor of the treatment group.

Because most government funders in effect require use of random assignment, the number of quasi-experiments is rapidly diminishing. All things being equal, randomized studies should be preferred. However, quasi-experiments may better fit the practical realities of a given treatment or population, and as such, I hope there can be a place for rigorous quasi-experiments. We need not be so queasy about quasi-experiments if they are designed to minimize bias.

References

Baron, J. (2019, December 12). Why most non-RCT program evaluation findings are unreliable (and a way to improve them). Washington, DC: Arnold Ventures.

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Proven Programs Can’t Replicate, Just Like Bees Can’t Fly

In the 1930’s, scientists in France announced that based on principles of aerodynamics, bees could not fly. The only evidence to the contrary was observational, atheoretical, quasi-scientific reports that bees do in fact fly.

The widely known story about bees’ ability to fly came up in a discussion about the dissemination of proven programs in education. Many education researchers and policy makers maintain that the research-development-evaluation-dissemination sequence relied upon for decades to create better ways to educate children has failed. Many observers note that few practitioners seek out research when they consider selection of programs intended to improve student learning or other important outcomes. Research Practice Partnerships, in which researchers work in partnership with local educators to solve problems of importance to the educators, is largely based on the idea that educators are unlikely to use programs or practices unless they personally were involved in creating them. Opponents of evidence-based education policies invariably complain that because schools are so diverse, they are unlikely to adopt programs developed and researched elsewhere, and this is why few research-based programs are widely disseminated.

Dissemination of proven programs is in fact difficult, and there is little evidence of how proven programs might be best disseminated. Recognizing these and many other problems, however, it is important to note one small fact in all this doom and gloom: Proven programs are disseminated. Among the 113 reading and mathematics programs that have met the stringent standards of Evidence for ESSA (www.evidenceforessa.org), most have been disseminated to dozens, hundreds, or thousands of schools. In fact, we do not accept programs that are not in active dissemination (because it is not terribly useful for educators, our target audience, to find out that a proven program is no longer available, or never was). Some (generally newer) programs may only operate in a few schools, but they intend to grow. But most programs, supported by non-profit or commercial organizations, are widely disseminated.

Examples of elementary reading programs with strong, moderate, or promising evidence of effectiveness (by ESSA standards) and wide dissemination include Reading Recovery, Success for All, Sound Partners, Lindamood, Targeted Reading Intervention, QuickReads, SMART, Reading Plus, Spell Read, Acuity, Corrective Reading, Reading Rescue, SuperKids, and REACH. For middle/high, effective and disseminated reading programs include SIM, Read180, Reading Apprenticeship, Comprehension Circuit Training, BARR, ITSS, Passport Reading Journeys, Expository Reading and Writing Course, Talent Development, Collaborative Strategic Reading, Every Classroom Every Day, and Word Generation.

In elementary math, effective and disseminated programs include Math in Focus, Math Expressions, Acuity, FocusMath, Math Recovery, Time to Know, Jump Math, ST Math, and Saxon Math. Middle/high school programs include ASSISTments, Every Classroom Every Day, eMINTS, Carnegie Learning, Core-Plus, and Larson Pre-Algebra.

These are programs that I know have strong, moderate, or promising evidence and are widely disseminated. There may be others I do not know about.

I hope this list convinces any doubters that proven programs can be disseminated. In light of this list, how can it be that so many educators, researchers, and policy makers think that proven educational programs cannot be disseminated?

One answer may be that dissemination of educational programs and practices almost never happens the way many educational researchers wish it did. Researchers put enormous energy into doing research and publishing their results in top journals. Then they are disappointed to find out that publishing in a research journal usually has no impact whatever on practice. They then often try to make their findings more accessible by writing them in plain English in more practitioner-oriented journals. Still, this usually has little or no impact on dissemination.

But writing in journals is rarely how serious dissemination happens. The way it does happen is that the developer or an expert partner (such as a publisher or software company) takes the research ideas and makes them into a program, one that solves a problem that is important to educators, is attractive, professional, and complete, and is not too expensive. Effective programs almost always provide extensive professional development, materials, and software. Programs that provide excellent, appealing, effective professional development, materials, and software become likely candidates for dissemination. I’d guess that virtually every one of the programs I listed earlier took a great idea and made it into an appealing program.

A depressing part of this process is that programs that have no evidence of effectiveness, or even have evidence of ineffectiveness, follow the same dissemination process as do proven programs. Until the 2015 ESSA evidence standards appeared, evidence had a very limited role in the whole development-dissemination process. So far, ESSA has pointed more of a spotlight on evidence of effectiveness, but it is still the case that having strong evidence of effectiveness does not provide a program with a decisive advantage over programs lacking positive evidence. Regardless of their actual evidence bases, most programs today make claims that their programs are “evidence-based” or at least “evidence-informed,” so users can easily be fooled.

However, this situation is changing. First, the government itself is identifying programs with evidence of effectiveness, and may publicize them. Government initiatives such as Investing in Innovation (i3; now called EIR) actually provide funding to proven programs to enable them to begin to scale up their programs. The What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/), Evidence for ESSA (www.evidenceforessa.org), and other sources provide easy access to information on proven programs. In other words, government is starting to intervene to nudge the longstanding dissemination process toward programs proven to work.

blog_10-3-19_Bee_art_500x444Back to the bees, the 1930 conclusion that bees should not be able to fly was overturned in 2005, when American researchers observed what bees actually do when they fly, and discovered that bees do not flap their wings like birds. Instead, they push air forward and back with their wings, creating a low pressure zone above them. This pressure keeps them in the air.

In the same way, educational researchers might stop theorizing about how disseminating proven programs is impossible, but instead, observe several programs that have actually done it. Then we can design government policies to further assist proven programs to build the capital and the organizational capacity to effectively disseminate, and to provide incentives and assistance to help schools in need of proven programs to learn about and adopt them.

Perhaps we could call this Plan Bee.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.