Science of Reading: Can We Get Beyond Our 30-Year Pillar Fight?

How is it possible that the “reading wars” are back on? The reading wars primarily revolve around what are often called the five pillars of early reading: phonemic awareness, phonics, comprehension, vocabulary, and fluency. Actually, there is little debate about the importance of comprehension, vocabulary, or fluency, so the reading wars are mainly about phonemic awareness and phonics. Diehard anti-phonics advocates exist, but in all of educational research, there are few issues that have been more convincingly settled by high-quality evidence. The National Reading Panel (2000), the source of the five pillars, has been widely cited as conclusive evidence that success in the early stages of reading depends on ensuring that students are all successful in phonemic awareness, phonics, and the other pillars. I was invited to serve on that panel, but declined, because I thought it was redundant. Just a short time earlier, the National Research Council’s Committee on the Prevention of Reading Difficulties (Snow, Burns, & Griffin, 1998) had covered essentially the same ground and came to essentially the same conclusion, as had Marilyn Adams’ (1990) Beginning to Read, and many individual studies. To my knowledge, there is little credible evidence to the contrary. Certainly, then and now there have been many students who learn to read successfully with or without a focus on phonemic awareness and phonics. However, I do not think there are many students who could succeed with non-phonetic approaches but cannot learn to read with phonics-emphasis methods. In other words, there is little if any evidence that phonemic awareness or phonics cause harm, but a great deal of evidence that for perhaps more than half of students, effective instruction emphasizing phonemic awareness and phonics are essential.  Since it is impossible to know in advance which students will need phonics and which will not, it just makes sense to teach using methods likely to maximize the chances that all children (those who need phonics and those who would succeed with or without them) will succeed in reading.

However…

The importance of the five pillars of the National Reading Panel (NRP) catechism are not in doubt among people who believe in rigorous evidence, as far as I know. The reading wars ended in the 2000s and the five pillars won. However, this does not mean that knowing all about these pillars and the evidence behind them is sufficient to solve America’s reading problems. The NRP pillars describe essential elements of curriculum, but not of instruction.

blog_3-19-20_readinggroup_333x500Improving reading outcomes for all children requires the five pillars, but they are not enough. The five pillars could be extensively and accurately taught in every school of education, and this would surely help, but it would not solve the problem. State and district standards could emphasize the five pillars and this would help, but would not solve the problem. Reading textbooks, software, and professional development could emphasize the five pillars and this would help, but it would not solve the problem.

The reason that such necessary policies would still not be sufficient is that teaching effectiveness does not just depend on getting curriculum right. It also depends on the nature of instruction, classroom management, grouping, and other factors. Teaching reading without teaching phonics is surely harmful to large numbers of students, but teaching phonics does not guarantee success.

As one example, consider grouping. For a very long time, most reading teachers have used homogeneous reading groups. For example, the “Stars” might contain the highest-performing readers, the “Rockets” the middle readers, and the “Planets” the lowest readers. The teacher calls up groups one at a time. No problem there, but what are the students doing back at their desks? Mostly worksheets, on paper or computers. The problem is that if there are three groups, each student spends two thirds of reading class time doing, well, not much of value. Worse, the students are sitting for long periods of time, with not much to do, and the teacher is fully occupied elsewhere. Does anyone see the potential for idle hands to become the devil’s playground? The kids do.

There are alternatives to reading groups, such as the Joplin Plan (cross-grade grouping by reading level), forms of whole-class instruction, or forms of cooperative learning. These provide active teaching to all students all period. There is good evidence for these alternatives (Slavin, 1994, 2017). My main point is that a reading strategy that follows NRP guidelines 100% may still succeed or fail based on its grouping strategy. The same could be true of the use of proven classroom management strategies or motivational strategies during reading periods.

To make the point most strongly, imagine that a district’s teachers have all thoroughly mastered all five pillars of science of reading, which (we’ll assume) are strongly supported by their district and state. In an experiment, 40 teachers of grades 1 to 3 are selected, and 20 of these are chosen at random to receive sufficient tutors to work with their lowest-achieving 33% of students in groups of four, using a proven model based on science of reading principles. The other 20 schools just use their usual materials and methods, also emphasizing science of reading curricula and methods.

The evidence from many studies of tutoring (Inns et al., 2020), as well as common sense, tell us what would happen. The teachers supported by tutors would produce far greater achievement among their lowest readers than would the other equally science-of-reading-oriented teachers in the control group.

None of these examples diminish the importance of science of reading. But they illustrate that knowing science of reading is not enough.

At www.evidenceforessa.org, you can find 65 elementary reading programs of all kinds that meet high standards of effectiveness. Almost all of these use approaches that emphasize the five pillars. Yet Evidence for ESSA also lists many programs that equally emphasize the five pillars and yet have not found positive impacts. Rather than re-starting our thirty-year-old pillar fight, don’t you think we might move on to advocating programs that not only use the right curricula, but are also proven to get excellent results for kids?

References

Adams, M.J. (1990).  Beginning to read:  Thinking and learning about print.  Cambridge, MA:  MIT Press.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2020). A synthesis of quantitative research on programs for struggling readers in elementary schools. Available at www.bestevidence.org. Manuscript submitted for publication.

National Reading Panel (2000).  Teaching children to read: An evidence-based assessment of the scientific research literature on reading and its implications for reading instruction.  Rockville, MD: National Institute of Child Health and Human Development.

Slavin, R. E. (1994). School and classroom organization in beginning reading:  Class size, aides, and instructional grouping. In R. E. Slavin, N. L. Karweit, and B. A. Wasik (Eds.), Preventing early school failure. Boston:  Allyn and Bacon.

Slavin, R. E. (2017). Instruction based on cooperative learning. In R. Mayer & P. Alexander (Eds.), Handbook of research on learning and instruction. New York: Routledge.

Snow, C.E., Burns, S.M., & Griffin, P. (Eds.) (1998).  Preventing reading difficulties in young children.  Washington, DC: National Academy Press.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

 

After the Pandemic: Can We Welcome Students Back to Better Schools?

I am writing in March, 2020, at what may be the scariest point in the COVID-19 pandemic in the U.S. We are just now beginning to understand the potential catastrophe, and also to begin taking actions most likely to reduce the incidence of the disease.

One of the most important preventive measures is school closure. At this writing, thirty entire states have closed their schools, as have many individual districts, including Los Angeles. It is clear that school closures will go far beyond this, both in the U.S. and elsewhere.

I am not an expert on epidemiology, but I did want to make some observations about how widespread school closure could affect education, and (ever the optimist) how this disaster could provide a basis for major improvements in the long run.

Right now, schools are closing for a few weeks, with an expectation that after spring break, all will be well again, and schools might re-open. From what I read, this is unlikely. The virus will continue to spread until it runs out of vulnerable people. The purpose of school closures is to reduce the rate of transmission. Children themselves tend not to get the disease, for some reason, but they do transmit the disease, mostly at school (and then to adults). Only when there are few new cases to transmit can schools be responsibly re-opened. No one knows for sure, but a recent article in Education Week predicted that schools will probably not re-open this school year (Will, 2020). Kansas is the first state to announce that schools will be closed for the rest of the school year, but others will surely follow.

Will students suffer from school closure? There will be lasting damage if students lose parents, grandparents, and other relatives, of course. Their achievement may take a dip, but a remarkable study reported by Ceci (1991) examined the impact of two or more years of school closures in the Netherlands in World War II, and found an initial loss in IQ scores that quickly rebounded after schools re-opened after the war. From an educational perspective, the long-term impact of closure itself may not be so bad. A colleague, Nancy Karweit (1989), studied achievement in districts with long teacher strikes, and did not find much of a lasting impact.

In fact, there is a way in which wise state and local governments might use an opportunity presented by school closures. If schools closing now stay closed through the end of the school year, that could leave large numbers of teachers and administrators with not much to do (assuming they are not furloughed, which could happen). Imagine that, where feasible, this time were used for school leaders to consider how they could welcome students back to much improved schools, and to blog_3-26_20_teleconference2_500x334provide teachers with (electronic) professional development to implement proven programs. This might involve local, regional, or national conversations focused on what strategies are known to be effective for each of the key objectives of schooling. For example, a national series of conversations could take place on proven strategies for beginning reading, for middle school mathematics, for high school science, and so on. By design, the conversations would be focused not just on opinions, but on rigorous evidence of what works. A focus on improving health and disease prevention would be particularly relevant to the current crisis, along with implementing proven academic solutions.

Particular districts might decide to implement proven programs, and then use school closure to provide time for high-quality professional development on instructional strategies that meet the ESSA evidence standards.

Of course, all of the discussion and professional development would have to be done using electronic communications, for obvious reasons of public health. But might it be possible to make wise use of school closure to improve the outcomes of schooling using professional development in proven strategies? With rapid rollout of existing proven programs and dedicated funding, it certainly seems possible.

States and districts are making a wide variety of decisions about what to do during the time that schools are closed. Many are moving to e-learning, but this may be of little help in areas where many students lack computers or access to the internet at home. In some places, a focus on professional development for next school year may be the best way to make the best of a difficult situation.

There have been many times in the past when disasters have led to lasting improvements in health and education. This could be one of these opportunities, if we seize the moment.

Photo credit: Liam Griesacker

References

Ceci, S. J. (1991). How much does schooling influence general intelligence and its cognitive components? A reassessment of the evidence. Developmental Psychology, 27(5), 703–722. https://doi.org/10.1037/0012-1649.27.5.703

Karweit, N. (1989). Time and learning: A review. In R. E. Slavin (Ed.), School and Classroom Organization. Hillsdale, NJ: Erlbaum.

Will, M. (2020, March 15). School closure for the coronavirus could extend to the end of the school year, some say. Education Week.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Florence Nightingale, Statistician

Everyone knows about Florence Nightingale, whose 200th birthday is this year. You probably know of her courageous reform of hospitals and aid stations in the Crimean War, and her insistence on sanitary conditions for wounded soldiers that saved thousands of lives. You may know that she founded the world’s first school for nurses, and of her lifelong fight for the professionalization of nursing, formerly a refuge for uneducated, often alcoholic young women who had no other way to support themselves. You may know her as a bold feminist, who taught by example what women could accomplish.

But did you know that she was also a statistician? In fact, she was the first woman ever to be admitted to Britain’s Royal Statistical Society, in 1858.

blog_3-12-20_FlorenceNightingale_500x347Nightingale was not only a statistician, she was an innovator among statisticians. Her life’s goal was to improve medical care, public health, and nursing for all, but especially for people in poverty. In her time, landless people were pouring into large, filthy industrial cities. Death rates from unclean water and air, and unsafe working conditions, were appalling. Women suffered most, and deaths from childbirth in unsanitary hospitals were all too common. This was the sentimental Victorian age, and there were people who wanted to help. But how could they link particular conditions to particular outcomes? Opponents of investments in prevention and health care argued that the poor brought the problems on themselves, through alcoholism or slovenly behavior, or that these problems had always existed, or even that they were God’s will. The numbers of people and variables involved were enormous. How could these numbers be summarized in a way that would stand up to scrutiny, but also communicate the essence of the process leading from cause to effect?

As a child, Nightingale and her sister were taught by her brilliant and liberal father. He gave his daughters a mathematics education that few (male) students in the very finest schools could match. She put these skills to work in her work in hospital reform, demonstrating, for example, that when her hospital in the Crimean War ordered reforms such as cleaning out latrines and cesspools, the mortality rate dropped from 42.7 percent to 2.2 percent in a few months. She invented a circular graph that showed changes month by month, as the reforms were implemented. She also made it immediately clear to anyone that deaths due to disease far outnumbered those due to war wounds. No numbers, just colors and patterns, made the situation obvious to the least mathematical of readers.

When she returned from Crimea, Nightingale had a disease, probably spondylitis, that forced her to be bedridden much of the time for the rest of her life. Yet this did not dim her commitment to health reform. In fact, it gave her a lot of time to focus on her statistical work, often published in the top newspapers of the day. From her bedroom, she had a profound effect on the reform of Britain’s Poor Laws, and the repeal of the Contagious Diseases Act, which her statistics showed to be counterproductive.

Note that so far, I haven’t said a word about education. In many ways, the analogy is obvious. But I’d like to emphasize one contribution of Nightingale’s work that has particular importance to our field.

Everyone who works in education cares deeply for all children, and especially for disadvantaged, underserved children. As a consequence of our profound concern, we advocate fiercely for policies and solutions that we believe to be good for children. Each of us comes down on one side or another of controversial policies, and then advocates for our positions, certain that our favored position would be hugely beneficial if it prevails, and disastrous if it does not. The same was true in Victorian Britain, where people had heated, interminable arguments about all sorts of public policy.

What Florence Nightingale did, more than a century ago, was to subject various policies affecting the health and welfare of poor people to statistical analysis. She worked hard to be sure that her findings were correct and that they communicated to readers. Then she advocated in the public arena for the policies that were beneficial, and against those that were counterproductive.

In education, we have loads of statistics that bear on various policies, but we do not often commit ourselves to advocate for the ones that actually work. As one example, there have been arguments for decades about charter schools. Yet a national CREDO (2013) study found that, on average, charter schools made no difference at all on reading or math performance. A later CREDO (2015) study found that effects were slightly more positive in urban settings, but these effects were tiny. Other studies have had similar outcomes, although there are more positive outcomes for “no-excuses” charters such as KIPP, a small percentage of all charter schools.

If charters make no major differences in student learning, I suppose one might conclude that they might be maintained or not maintained based on other factors. Yet neither side can plausibly argue, based on evidence of achievement outcomes, that charters should be an important policy focus in the quest for higher achievement. In contrast, there are many programs that have impacts on achievement far greater than those of charters. Yet use of such programs is not particularly controversial, and is not part of anyone’s political agenda.

The principle that Florence Nightingale established in public health was simple: Follow the data. This principle now dominates policy and practice in medicine. Yet more than a hundred years after Nightingale’s death, have we arrived at that common-sense conclusion in educational policy and practice? We’re moving in that direction, but at the current rate, I’m afraid it will be a very long time before this becomes the core of educational policy or practice.

Photo credit: Florence Nightingale, Illustrated London News (February 24, 1855)

References

CREDO (2013). National charter school study. At http://credo.stanford.edu

CREDO (2015). Urban charter school study. At http://credo.stanford.edu

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Cooperative Learning and Achievement

Once upon a time, two teachers went together to an evening workshop on effective teaching strategies. The speaker was dynamic, her ideas were interesting, and everyone in the large audience enjoyed the speech. Afterwards, the two teachers drove back to the town where they lived. The driver talked excitedly with her friend about all the wonderful ideas they’d heard, raised questions about how to put them into practice, and related them to things she’d read, heard, and experienced before.

After an hour’s drive, however, the driver realized that her friend had been asleep for the whole return trip.

Now here’s my question: who learned the most from the speech? Both the driver and her friend were equally excited by the speech and paid equal attention to it. Yet no one would doubt that the driver learned much more, because after the lecture, she talked all about it, thinking her friend was awake.

Every teacher knows how much they learn about any topic by teaching it, or discussing it with others. Imagine how much more the driver and her friend would have learned from the lecture if they had both been participating fully, sharing ideas, perceptions, agreements, disagreements, and new ideas.

So far, this is all obvious, right? Everyone knows that people learn when they are engaged, when they have opportunities to discuss with others, explain to others, ask questions of others, and receive explanations.

Yet in traditionally organized classes, learning does not often happen like this. Teachers teach, students listen, and if genuine discussion takes place at all, it is between the teacher and a small minority of students who always raise their hands and ask good questions. Even in the most exciting and interactive of classes, many students, often a majority, say little or nothing. They may give an answer if called upon, but “giving an answer” is not at all the same as engagement. Even in classes that are organized in groups and encourage group interaction, some students do most of the participating, while others just watch, at best. Evidence from research, especially studies by Noreen Webb (2008), find that the students who learn the most in group settings are those who give full explanations to others. These are the drivers, returning to my opening story. Those who receive a lot of explanations also learn. Who learns least? Those who neither explain nor receive explanations.

For achievement outcomes, it is not enough to put students into groups and let them talk. Research finds that cooperative learning works best when there are group goals and individual accountability. That is, groups can earn recognition or small privileges (e.g., lining up first for recess) if the average of each team member’s score meets a high standard. The purpose of group goals and individual accountability is to incentivize team members to help and encourage each other to excel, and to avoid having, for example, one student do all the work while the others watch (Chapman, 2001). Students can be silent in groups, as they can be in class, but this is less likely if they are working with others toward a common goal that they can achieve only if all team members succeed.

blog_3-5-20_coopstudents_500x333

The effectiveness of cooperative learning for enhancing achievement has been known for a long time (see Rohrbeck et al., 2003; Roseth et al., 2008; Slavin, 1995, 2014). Forms of cooperative learning are frequently seen in elementary and secondary schools, but they are far from standard practice. Forms of cooperative learning that use group goals and individual accountability are even more rare.

There are many examples of programs that incorporate cooperative learning and meet the ESSA Strong or Moderate standards in reading, math, SEL, and attendance. You can see descriptions of the programs by visiting www.evidenceforessa.org and clicking on the cooperative learning filter. As you can see, it is remarkable how many of the programs identified as effective for improving student achievement by the What Works Clearinghouse or Evidence for ESSA make use of well-structured cooperative learning, usually with students working in teams or groups of 4-5 students, mixed in past performance. In fact, in reading and mathematics, only one-to-one or small-group tutoring are more effective than approaches that make extensive use of cooperative learning.

There are many successful approaches to cooperative learning adapted for different subjects, specific objectives, and age levels (see Slavin, 1995). There is no magic to cooperative learning; outcomes depend on use of proven strategies and high-quality implementation. The successful forms of cooperative learning provide at least a good start for educators seeking ways to make school engaging, exciting, social, and effective for learning. Students not only learn from cooperation in small groups, but they love to do so. They are typically eager to work with their classmates. Why shouldn’t we routinely give them this opportunity?

References

Chapman, E. (2001, April). More on moderations in cooperative learning outcomes. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.

Rohrbeck, C. A., Ginsburg-Block, M. D., Fantuzzo, J. W., & Miller, T. R. (2003). Peer-assisted learning interventions with elementary school students: A meta-analytic review. Journal of Educational Psychology, 94(2), 240–257.

Roseth, C., Johnson, D., & Johnson, R. (2008). Promoting early adolescents’ achievement and peer relationships: The effects of cooperative, competitive, and individualistic goal structures. Psychological Bulletin, 134(2), 223–246.

Slavin, R. E. (1995). Cooperative learning: Theory, research, and practice (2nd ed.). Boston, MA: Allyn & Bacon.

Slavin, R. E. (2014). Make cooperative learning powerful: Five essential strategies to make cooperative learning effective. Educational Leadership, 72 (2), 22-26.

Webb, N. M. (2008). Learning in small groups. In T. L. Good (Ed.), 21st century learning (Vol. 1, pp. 203–211). Thousand Oaks, CA: Sage.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

 

Even Magic Johnson Sometimes Had Bad Games: Why Research Reviews Should Not Be Limited to Published Studies

When my sons were young, they loved to read books about sports heroes, like Magic Johnson. These books would all start off with touching stories about the heroes’ early days, but as soon as they got to athletic feats, it was all victories, against overwhelming odds. Sure, there were a few disappointments along the way, but these only set the stage for ultimate triumph. If this weren’t the case, Magic Johnson would have just been known by his given name, Earvin, and no one would write a book about him.

Magic Johnson was truly a great athlete and is an inspiring leader, no doubt about it. However, like all athletes, he surely had good days and bad ones, good years and bad. Yet the published and electronic media naturally emphasize his very best days and years. The sports press distorts the reality to play up its heroes’ accomplishments, but no one really minds. It’s part of the fun.

Blog_2-13-20_magicjohnson_333x500In educational research evaluating replicable programs and practices, our objectives are quite different. Sports reporting builds up heroes, because that’s what readers want to hear about. But in educational research, we want fair, complete, and meaningful evidence documenting the effectiveness of practical means of improving achievement or other outcomes. The problem is that academic publications in education also distort understanding of outcomes of educational interventions, because studies with significant positive effects (analogous to Magic’s best days) are far more likely to be published than are studies with non-significant differences (like Magic’s worst days). Unlike the situation in sports, these distortions are harmful, usually overstating the impact of programs and practices. Then when educators implement interventions and fail to get the results reported in the journals, this undermines faith in the entire research process.

It has been known for a long time that studies reporting large, positive effects are far more likely to be published than are studies with smaller or null effects. One long-ago study, by Atkinson, Furlong, & Wampold (1982), randomly assigned APA consulting editors to review articles that were identical in all respects except that half got versions with significant positive effects and half got versions with the same outcomes but marked as not significant. The articles with outcomes marked “significant” were twice as likely as those marked “not significant” to be recommended for publication. Reviewers of the “significant” studies even tended to state that the research designs were excellent much more often than did those who reviewed the “non-significant” versions.

Not only do journals tend not to accept articles with null results, but authors of such studies are less likely to submit them, or to seek any sort of publicity. This is called the “file-drawer effect,” where less successful experiments disappear from public view (Glass et al., 1981).

The combination of reviewers’ preferences for significant findings and authors’ reluctance to submit failed experiments leads to a substantial bias in favor of published vs. unpublished sources (e.g., technical reports, dissertations, and theses, often collectively termed “gray literature”). A review of 645 K-12 reading, mathematics, and science studies by Cheung & Slavin (2016) found almost a two-to-one ratio of effect sizes between published and gray literature reports of experimental studies, +0.30 to +0.16. Lipsey & Wilson (1993) reported a difference of +0.53 (published) to +0.39 (unpublished) in a study of psychological, behavioral and educational interventions. Similar outcomes have been reported by Polanin, Tanner-Smith, & Hennessy (2016), and many others. Based on these long-established findings, Lipsey & Wilson (1993) suggested that meta-analyses should establish clear, rigorous criteria for study inclusion, but should then include every study that meets those standards, published or not.

The rationale for restricting interest (or meta-analyses) to published articles was always weak, but in recent years it is diminishing. An increasing proportion of the gray literature consists of technical reports, usually by third-party evaluators, of highly funded experiments. For example, experiments funded by IES and i3 in the U.S., the Education Endowment Foundation (EEF) in the U.K., and the World Bank and other funders in developing countries, provide sufficient resources to do thorough, high-quality implementations of experimental treatments, as well as state-of-the-art evaluations. These evaluations almost always meet the standards of the What Works Clearinghouse, Evidence for ESSA, and other review facilities, but they are rarely published, especially because third-party evaluators have little incentive to publish.

It is important to note that the number of high-quality unpublished studies is very large. Among the 645 studies reviewed by Cheung & Slavin (2016), all had to meet rigorous standards. Across all of them, 383 (59%) were unpublished. Excluding such studies would greatly diminish the number of high-quality experiments in any review.

I have the greatest respect for articles published in top refereed journals. Journal articles provide much that tech reports rarely do, such as extensive reviews of the literature, context for the study, and discussions of theory and policy. However, the fact that an experimental study appeared in a top journal does not indicate that the article’s findings are representative of all the research on the topic at hand.

The upshot of this discussion is clear. First, meta-analyses of experimental studies should always establish methodological criteria for inclusion (e.g., use of control groups, measures not overaligned or made by developers or researchers, duration, sample size), but never restrict studies to those that appeared in published sources. Second, readers of reviews of research on experimental studies should ignore the findings of reviews that were limited to published articles.

In the popular press, it’s fine to celebrate Magic Johnson’s triumphs and ignore his bad days. But if you want to know his stats, you need to include all of his games, not just the great ones. So it is with research in education. Focusing only on published findings can make us believe in magic, when what we need are the facts.

 References

Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship? Journal of Counseling Psychology, 29(2), 189–194. https://doi.org/10.1037/0022-0167.29.2.189

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills: Sage Publications.

Lipsey, M.W. & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181-1209.

Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between published and unpublished effect sizes: A meta-review. Review of Educational Research86(1), 207–236. https://doi.org/10.3102/0034654315582067

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Compared to What? Getting Control Groups Right

Several years ago, I had a grant from the National Science Foundation to review research on elementary science programs. I therefore got to attend NSF conferences for principal investigators. At one such conference, we were asked to present poster sessions. The group next to mine was showing an experiment in science education that had remarkably large effect sizes. I got to talking with the very friendly researcher, and discovered that the experiment involved a four-week unit on a topic in middle school science. I think it was electricity. Initially, I was very excited, electrified even, but then I asked a few questions about the control group.

“Of course there was a control group,” he said. “They would have taught electricity too. It’s pretty much a required portion of middle school science.”

Then I asked, “When did the control group teach about electricity?”

“We had no way of knowing,” said my new friend.

“So it’s possible that they had a four-week electricity unit before the time when your program was in use?”

“Sure,” he responded.

“Or possibly after?”

“Could have been,” he said. “It would have varied.”

Being the nerdy sort of person I am, I couldn’t just let this go.

“I assume you pretested students at the beginning of your electricity unit and at the end?”

“Of course.”

“But wouldn’t this create the possibility that control classes that received their electricity unit before you began would have already finished the topic, so they would make no more progress in this topic during your experiment?”

“…I guess so.”

“And,” I continued, “students who received their electricity instruction after your experiment would make no progress either because they had no electricity instruction between pre- and posttest?”

I don’t recall how the conversation ended, but the point is, wonderful though my neighbor’s science program might be, the science achievement outcome of his experiment were, well, meaningless.

In the course of writing many reviews of research, my colleagues and I encounter misuses of control groups all the time, even in articles in respected journals written by well-known researchers. So I thought I’d write a blog on the fundamental issues involved in using control groups properly, and the ways in which control groups are often misused.

The purpose of a control group

The purpose of a control group in any experiment, randomized or matched, is to provide a valid estimate of what the experimental group would have achieved had it not received the experimental treatment, or if the study had not taken place at all. Through random assignment or matching, the experimental and control groups are essentially equal at pretest on all important variables (e.g., pretest scores, demographics), and nothing happens in the course of the experiment to upset this initial equality.

How control groups go wrong

Inequality in opportunities to learn tested content. Often, experiments appear to be legitimate (e.g., experimental and control groups are well matched at pretest), but the design contains major bias, because the content being taught in the experimental group is not the same as the content taught in the control group, and the final outcome measure is aligned to what the experimental group was taught but not what the control group was taught. My story at the start of this blog was an example of this. Between pre- and posttest, all students in the experimental group were learning about electricity, but many of those in the control group had already completed electricity or had not received it yet, so they might have been making great progress on other topics, which were not tested, but were unlikely to make much progress on the electricity content that was tested. In this case, the experimental and control groups could be said to be unequal in opportunities to learn electricity. In such a case, it matters little what the exact content or teaching methods were for the experimental program. Teaching a lot more about electricity is sure to add to learning of that topic regardless of how it is taught.

There are many other circumstances in which opportunities to learn are unequal. Many studies use unusual content, and then use tests partially or completely aligned to this unusual content, but not to what the control group was learning. Another common case is where experimental students learn something involving use of technology, but the control group uses paper and pencil to learn the same content. If the final test is given on the technology used by the experimental but not the control group, the potential for bias is obvious.

blog_2-20-20_schoolstudy_500x333 (2)

Unequal opportunities to learn (as a source of bias in experiments) relates to a topic I’ve written a lot about. Use of developer- or researcher-made outcome measures may introduce unequal opportunities to learn, because these measures are more aligned with what the experimental group was learning than what the control group was learning. However, the problem of unequal opportunities to learn is broader than that of developer/researcher-made measures. For example, the story that began this blog illustrated serious bias, but the measure could have been an off-the-shelf, valid measure of electricity concepts.

Problems with control groups that arise during the experiment. Many problems with control groups only arise after an experiment is under way, or completed. These involve situations in which there are different numbers of students/classes/schools that are not counted in the analysis. Usually, these are cases in which, in theory, experimental and control groups have equal opportunities to learn the tested content at the beginning of the experiment. However, some number of students assigned to the experimental group do not participate in the experiment enough to be considered to have truly received the treatment. Typical examples of this include after-school and summer-school programs. A group of students is randomly assigned to receive after-school services, for example, but perhaps only 60% of the students actually show up, or attend enough days to constitute sufficient participation. The problem is that the researchers know exactly who attended and who did not in the experimental group, but they have no idea which control students would or would not have attended if the control group had had the opportunity. The 40% of students who did not attend can probably be assumed to be less highly motivated, lower achieving, have less supportive parents, or to possess other characteristics that, on average, may identify students who are less likely to do well than students in general. If the researchers drop these 40% of students, the remaining 60% who did participate are likely (on average) to be more motivated, higher achieving, and so on, so the experimental program may look a lot more effective than it truly is. This kind of problem comes up quite often in studies of technology programs, because researchers can easily find out how often students in the experimental group actually logged in and did the required work. If they drop students who did not use the technology as prescribed, then the remaining students who did use the technology as intended are likely to perform better than control students, who will be a mix of students who would or would not have used the technology if they’d had the chance. Because these control groups contain more and less motivated students, while the experimental group only contains the students who were motivated to use the technology, the experimental group may have a huge advantage.

Problems of this kind can be avoided by using intent to treat (ITT) methods, in which all students who were pretested remain in the sample and are analyzed whether or not they used the software or attended the after-school program. Both the What Works Clearinghouse and Evidence for ESSA require use of ITT models in situations of this kind. The problem is that use of ITT analyses almost invariably reduces estimates of effect sizes, but to do otherwise may introduce quite a lot of bias in favor of the experimental groups.

Experiments without control groups

Of course, there are valid research designs that do not require use of control groups at all. These include regression discontinuity designs (in which long-term data trends are studied to see if there is a sharp change at the point when a treatment is introduced) and single-case experimental designs (in which as few as one student/class/school is observed frequently to see what happens when treatment conditions change). However, these designs have their own problems, and single case designs are rarely used outside of special education.

Control groups are essential in most rigorous experimental research in education, and with proper design they can do what they were intended to do with little bias. Education researchers are becoming increasingly sophisticated about fair use of control groups. Next time I go to an NSF conference, for example, I hope I won’t see posters on experiments that compare students who received an experimental treatment to those who did not even receive instruction on the same topic between pretest and posttest.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Getting Schools Excited About Participating in Research

If America’s school leaders are ever going to get excited about evidence, they need to participate in it. It’s not enough to just make school leaders aware of programs and practices. Instead, they need to serve as sites for experiments evaluating programs that they are eager to implement, or at least have friends or peers nearby who are doing so.

The U.S. Department of Education has funded quite a lot of research on attractive programs A lot of the studies they have funded have not shown positive impacts, but many have been found to be effective. Those effective programs could provide a means of engaging many schools in rigorous research, while at the same time serving as examples of how evidence can help schools improve their results.

Here is my proposal. It quite often happens that some part of the U.S. Department of Education wants to expand the use of proven programs on a given topic. For example, imagine that they wanted to expand use of proven reading programs for struggling readers in elementary schools, or proven mathematics programs in Title I middle schools.

Rather than putting out the usual request for proposals, the Department might announce that schools could qualify for funding to implement a qualifying proven program, but in order to participate they had to agree to participate in an evaluation of the program. They would have to identify two similar schools from a district, or from neighboring districts, that would agree to participate if their proposal is successful. One school in each pair would be assigned at random to use a given program in the first year or two, and the second school could start after the one- or two-year evaluation period was over. Schools would select from a list of proven programs and choose one that seems appropriate to their needs.

blog_2-6-20_celebrate_500x334            Many pairs of schools would be funded to use each proven program, so across all schools involved, this would create many large, randomized experiments. Independent evaluation groups would carry out the experiments. Students in participating schools would be pretested at the beginning of the evaluation period (one or two years), and posttested at the end, using tests independent of the developers or researchers.

There are many attractions to this plan. First, large randomized evaluations on promising programs could be carried out nationwide in real schools under normal conditions. Second, since the Department was going to fund expansion of promising programs anyway, the additional cost might be minimal, just the evaluation cost. Third, the experiment would provide a side-by-side comparison of many programs focusing on high-priority topics in very diverse locations. Fourth, the school leaders would have the opportunity to select the program they want, and would be motivated, presumably, to put energy into high-quality implementation. At the end of such a study, we would know a great deal about which programs really work in ordinary circumstances with many types of students and schools. But just as importantly, the many schools that participated would have had a positive experience, implementing a program they believe in and finding out in their own schools what outcomes the program can bring them. Their friends and peers would be envious and eager to get into the next study.

A few sets of studies of this kind could build a constituency of educators that might support the very idea of evidence. And this could transform the evidence movement, providing it with a national, enthusiastic audience for research.

Wouldn’t that be great?

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.