Lessons for Educational Research from the COVID-19 Vaccines

Since the beginning of the COVID-19 pandemic, more than 130 biotech companies have launched major efforts to develop and test vaccines. Only four have been approved so far (Pfizer, Moderna, Johnson & Johnson, and AstraZeneca). Among the others, many have outright failed, and others are considered highly unlikely. Some of the failed vaccines are small, fringe companies, but they also include some of the largest and most successful drug companies in the world: Merck (U.S.), Glaxo-Smith-Kline (U.K.), and Sanofi (France).

Kamala Harris gets her vaccine.

Photo courtesy of NIH

If no further companies succeed, the score is something like 4 successes and 126 failures.  Based on this, is the COVID vaccine a triumph of science, or a failure? Obviously, if you believe that even one of the successful programs is truly effective, you would have to agree that this is one of the most extraordinary successes in the history of medicine. In less than one year, companies were able to create, evaluate, and roll out successful vaccines, already saving hundreds of thousands of lives worldwide.

Meanwhile, Back in Education . . .

The example of COVID vaccines contrasts sharply with the way research findings are treated in education. As one example, Borman et al. (2003) reviewed research on 33 comprehensive school reform programs. Only three of these had solid evidence of effectiveness, according to the authors (one of these was our program, Success for All; see Cheung et al., in press). Actually, few of the programs failed; most had just not been evaluated adequately. Yet the response from government and educational leaders was “comprehensive school reform doesn’t work” rather than, “How wonderful! Let’s use the programs proven to work.” As a result, a federal program supporting comprehensive school reform was canceled, use of comprehensive school reform plummeted, and most CSR programs went out of operation (we survived, just barely, but the other two successful programs soon disappeared).

Similarly, the What Works Clearinghouse, and our Evidence for ESSA website (www.evidenceforessa.org), are often criticized because so few of the programs we review turn out to have significant positive outcomes in rigorous studies.

The reality is that in any field in which rigorous experiments are used to evaluate innovations, most of the innovations fail. Mature science-focused fields, like medicine and agriculture, expect this and honor it, because the only way to prevent failures is to do no experiments at all, or only flawed experiments. Without rigorous experiments, we would have no reliable successes.  Also, we learn from failures, as scientists are learning from the findings of the evaluations of all 130 of the COVID vaccines.

Unfortunately, education is not a mature science-focused field, and in our field, failure to show positive effects in rigorous experiments leads to cover-ups, despair, abandonment of proven and promising approaches, or abandonment of rigorous research itself. About 20 years ago, a popular federally-funded education program was found to be ineffective in a large, randomized experiment. Supporters of this program actually got Congress to enact legislation that forbade the use of randomized experiments to evaluate this program!

Research has improved in the past two decades, and acceptance of research has improved as well. Yet we are a long way from medicine, for example, which accepts both success and failure as part of a process of using science to improve health. In our field, we need to commit to broad scale, rigorous evaluations of promising approaches, wide dissemination of programs that work, and learning from experiments that do not (yet) show positive outcomes. In this way, we could achieve the astonishing gains that take place in medicine, and learn how to produce these gains even faster using all the knowledge acquired in experiments, successful or not.

References

Borman, G. D., Hews, G. M., Overman, L. T., & Brown, S. (2003). Comprehensive school reform and achievement: A meta-analysis. Review of Educational Research, 12(2), 125-230.

Cheung, A., Xie, C., Zhang, T. & Slavin, R. E. (in press). Success for All: A quantitative synthesis of evaluations. Journal of Research on Educational Effectiveness.

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just enter your email address here.

When Scientific Literacy is a Matter of Life and Death

The Covid-19 crisis has put a spotlight on the importance of science.  More than at any time I can recall (with the possible exception of panic over the Soviet launch of Sputnik), scientists are in the news.  We count on them to find a cure for people with the Covid-19 virus and a vaccine to prevent new cases.  We count on them to predict the progression of the pandemic, and to discover public health strategies to minimize its spread.  We are justifiably proud of the brilliance, dedication, and hard work scientists are exhibiting every day.

Yet the Covid-19 pandemic is also throwing a harsh light on the scientific understanding of the whole population.  Today, scientific literacy can be a matter of life or death.  Although political leaders, advised by science experts, may recommend what we should do to minimize risks to ourselves and our families, people have to make their own judgments about what is safe and what is not.  The graphs in the newspaper showing how new infections and deaths are trending have real meaning.  They should inform what choices people make.  We are bombarded with advice on the Internet, from friends and neighbors, from television, in the news.  Yet these sources are likely to conflict.  Which should we believe?  Is it safe to go for a walk?  To the grocery store?  To church?  To a party?  Is Grandpa safer at home or in assisted living?

Scientific literacy is something we all should have learned in school. I would define scientific literacy as an understanding of scientific method, a basic understanding of how things work in nature and in technology, and an understanding of how scientists generate new knowledge and subject possible treatments, such as medicines, to rigorous tests.  All of these understandings, and many more, are ordinarily useful in generally understanding the news, for example, but for most people they do not have major personal consequences.  But now they do, and it is terrifying to hear the misconceptions and misinformation people have.  In the current situation, a misconception or misinformation can kill you, or cause you to make decisions that can lead to the death of a family member.

blog_7-2-20_sciencelab_500x333

The importance of scientific literacy in the whole population is now apparent in everyday life.  Yet scientific literacy has not been emphasized in our schools.  Especially in elementary schools, science has taken a back seat, because reading and mathematics are tested every year on state tests, beginning in third grade, but science is not tested in most years.  Many elementary teachers will admit that their own preparation in science was insufficient.  In secondary schools, science classes seem to have been developed to produce scientists, which is of course necessary, but not to produce a population that values and understands scientific information.  And now we are paying the price for this limited focus.

One indicator of our limited focus on science education is the substantial imbalance between the amount of rigorous research in science compared to the amount in mathematics and reading.  I have written reviews of research in each of these areas (see www.bestevidence.org), and it is striking how many fewer experimental studies there are in elementary and secondary science.  Take a look at the What Works Clearinghouse, for another example.  There are many programs in the WWC that focus on reading and mathematics, but science?  Not so many.   Given the obvious importance of science and technology to our economy, you would imagine that investments in research in science education would be a top priority, but judging from the numbers of studies of science programs for elementary and secondary schools, that is certainly not taking place.

The Covid-19 pandemic is giving us a hard lesson in the importance of science for all Americans, not just those preparing to become scientists.  I hope we are learning this lesson, and when the crisis is over, I hope our government and private foundations will greatly increase their investments in research, development, evaluation, and dissemination of proven science approaches for all students.

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Photo credit: Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

Would Your School or District Like to Participate in Research?

As research becomes more influential in educational practice, it becomes important that studies take place in all kinds of schools. However, this does not happen. In particular, the large-scale quantitative research evaluating practical solutions for schools tends to take place in large, urban districts near major research universities. Sometimes they take place in large, suburban districts near major research universities. This is not terribly surprising, because in order to meet the highest standards of the What Works Clearinghouse or Evidence for ESSA, a study of a school-level program will need 40 to 50 schools willing to be assigned at random to either use a new program or to serve as a control group.

Naturally, researchers want to have to deal with a small number of districts (to avoid having to deal with many different district-level rules and leaders), so they try to sign up districts in which they might find 40 or 50 schools willing to participate, or perhaps split between two or three districts at most. But there are not that many districts with that number of schools. Further, researchers do not want to spend their time or money flying around to visit schools, so they usually try to find schools close to home.

As a result of these dynamics, of course, it is easy to predict where high-quality quantitative research on innovative programs is not going to take place very often. Small districts (even urban ones) can be hard to serve, but the main category of schools left out of big studies are ones in rural districts. This is not only unfair, but it deprives rural schools of a robust evidence base for practice. Also, it can be a good thing for schools and districts anywhere to participate in research. Typically, schools are paired and assigned at random to treatment or control groups. Treatment groups get the treatment, and control schools usually get some incentive, such as money, or an opportunity to use the innovative treatment a year after the experiment is over. So why should some places get all this attention and opportunity, while others complain that they never get to participate and that there are few programs evaluated in districts like theirs?

I have a solution to propose for this problem: A “Registry of Districts and Schools Seeking Research Opportunities.” The idea is that district leaders or principals could list information about themselves and the kinds of research they might be willing to host in their schools or districts. Researchers seeking district or school partners for proposals or funded projects could post invitations for participation. In this way, researchers could find out about districts they might never have otherwise considered, and district and school leaders could find out about research opportunities. Sort of like a dating site, but adapted to the interests of researchers and potential research partners (i.e., no photos would be required).

blog_6-28-18_scientists_500x424
Scientists consulting a registry of volunteer participants.

If this idea interests you, or if you would like to participate, please write to Susan Davis at sdavi168@jh.edu . If you wish, you can share any opinions and ideas about how such a registry might best accomplish its goals. If you represent a district or school and are interested in participating in research, tell us, and I’ll see what I can do.

If I get lots of encouragement, we might create such a directory and operate it on behalf of all districts, schools, and researchers, to benefit students. I’ll look forward to hearing from you!

 This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Getting Schools Excited About Participating in Research

If America’s school leaders are ever going to get excited about evidence, they need to participate in it. It’s not enough to just make school leaders aware of programs and practices. Instead, they need to serve as sites for experiments evaluating programs that they are eager to implement, or at least have friends or peers nearby who are doing so.

The U.S. Department of Education has funded quite a lot of research on attractive programs A lot of the studies they have funded have not shown positive impacts, but many have been found to be effective. Those effective programs could provide a means of engaging many schools in rigorous research, while at the same time serving as examples of how evidence can help schools improve their results.

Here is my proposal. It quite often happens that some part of the U.S. Department of Education wants to expand the use of proven programs on a given topic. For example, imagine that they wanted to expand use of proven reading programs for struggling readers in elementary schools, or proven mathematics programs in Title I middle schools.

Rather than putting out the usual request for proposals, the Department might announce that schools could qualify for funding to implement a qualifying proven program, but in order to participate they had to agree to participate in an evaluation of the program. They would have to identify two similar schools from a district, or from neighboring districts, that would agree to participate if their proposal is successful. One school in each pair would be assigned at random to use a given program in the first year or two, and the second school could start after the one- or two-year evaluation period was over. Schools would select from a list of proven programs and choose one that seems appropriate to their needs.

blog_2-6-20_celebrate_500x334            Many pairs of schools would be funded to use each proven program, so across all schools involved, this would create many large, randomized experiments. Independent evaluation groups would carry out the experiments. Students in participating schools would be pretested at the beginning of the evaluation period (one or two years), and posttested at the end, using tests independent of the developers or researchers.

There are many attractions to this plan. First, large randomized evaluations on promising programs could be carried out nationwide in real schools under normal conditions. Second, since the Department was going to fund expansion of promising programs anyway, the additional cost might be minimal, just the evaluation cost. Third, the experiment would provide a side-by-side comparison of many programs focusing on high-priority topics in very diverse locations. Fourth, the school leaders would have the opportunity to select the program they want, and would be motivated, presumably, to put energy into high-quality implementation. At the end of such a study, we would know a great deal about which programs really work in ordinary circumstances with many types of students and schools. But just as importantly, the many schools that participated would have had a positive experience, implementing a program they believe in and finding out in their own schools what outcomes the program can bring them. Their friends and peers would be envious and eager to get into the next study.

A few sets of studies of this kind could build a constituency of educators that might support the very idea of evidence. And this could transform the evidence movement, providing it with a national, enthusiastic audience for research.

Wouldn’t that be great?

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Queasy about Quasi-Experiments? How Rigorous Quasi-Experiments Can Minimize Bias

I once had a statistics professor who loved to start discussions of experimental design with the following:

“First, pick your favorite random number.”

Obviously, if you pick a favorite random number, it isn’t random. I was recalling this bit of absurdity recently when discussing with colleagues the relative value of randomized experiments (RCTs) and matched studies, or quasi-experimental designs (QED). In randomized experiments, students, teachers, classes, or schools are assigned at random to experimental or control conditions. In quasi-experiments, a group of students, teachers, classes, or schools is identified as the experimental group, and then other schools are located (usually in the same districts) and then matched on key variables, such as prior test scores, percent free lunch, ethnicity, and perhaps other factors. The ESSA evidence standards, the What Works Clearinghouse, Evidence for ESSA, and most methodologists favor randomized experiments over QEDs, but there are situations in which RCTs are not feasible. In a recent “Straight Talk on Evidence,” Jon Baron discussed how QEDs can approach the usefulness of RCTs. In this blog, I build on Baron’s article and go further into strategies for getting the best, most unbiased results possible from QEDs.

Randomized and quasi-experimental studies are very similar in most ways. Both almost always compare experimental and control schools that were very similar on key performance and demographic factors. Both use the same statistics, and require the same number of students or clusters for adequate power. Both apply the same logic, that the control group mean represents a good approximation of what the experimental group would have achieved, on average, if the experiment had never taken place.

However, there is one big difference between randomized and quasi-experiments. In a well-designed randomized experiment, the experimental and control groups can be assumed to be equal not only on observed variables, such as pretests and socio-economic status, but also on unobserved variables. The unobserved variables we worry most about have to do with selection bias. How did it happen (in a quasi-experiment) that the experimental group chose to use the experimental treatment, or was assigned to the experimental treatment? If a set of schools decided to use the experimental treatment on their own, then these schools might be composed of teachers or principals who are more oriented toward innovation, for example. Or if the experimental treatment is difficult, the teachers who would choose it might be more hard-working. If it is expensive, then perhaps the experimental schools have more money. Any of these factors could bias the study toward finding positive effects, because schools that have teachers who are motivated or hard-working, in schools with more resources, might perform better than control schools with or without the experimental treatment.

blog_1-16-20_normalcurve_500x333

Because of this problem of selection bias, studies that use quasi-experimental designs generally have larger effect sizes than do randomized experiments. Cheung & Slavin (2016) studied the effects of methodological features of studies on effect sizes. They obtained effect sizes from 645 studies of elementary and secondary reading, mathematics, and science, as well as early childhood programs. These studies had already passed a screening in which they would have been excluded if they had serious design flaws. The results were as follows:

  No. of studies Mean effect size
Quasi-experiments 449 +0.23
Randomized experiments 196 +0.16

Clearly, mean effect sizes were larger in the quasi-experiments, suggesting the possibility that there was bias. Compared to factors such as sample size and use of developer- or researcher-made measures, the amount of effect size inflation in quasi-experiments was modest, and some meta-analyses comparing randomized and quasi-experimental studies have found no difference at all.

Relative Advantages of Randomized and Quasi-Experiments

Because of the problems of selection bias, randomized experiments are preferred to quasi-experiments, all other factors being equal. However, there are times when quasi-experiments may be necessary for practical reasons. For example, it can be easier to recruit and serve schools in a quasi-experiment, and it can be less expensive. A randomized experiment requires that schools be recruited with the promise that they will receive an exciting program. Yet half of them will instead be in a control group, and to keep them willing to sign up, they may be given a lot of money, or an opportunity to receive the program later on. In a quasi-experiment, the experimental schools all get the treatment they want, and control schools just have to agree to be tested.  A quasi-experiment allows schools in a given district to work together, instead of insisting that experimental and control schools both exist in each district. This better simulates the reality schools are likely to face when a program goes into dissemination. If the problems of selection bias can be minimized, quasi-experiments have many attractions.

An ideal design for quasi-experiments would obtain the same unbiased outcomes as a randomized evaluation of the same treatment might do. The purpose of this blog is to discuss ways to minimize bias in quasi-experiments.

In practice, there are several distinct forms of quasi-experiments. Some have considerable likelihood of bias. However, others have much less potential for bias. In general, quasi-experiments to avoid are forms of post-hoc, or after-the-fact designs, in which determination of experimental and control groups takes place after the experiment. Quasi-experiments with much less likelihood of bias are pre-specified designs, in which experimental and control schools, classrooms, or students are identified and registered in advance. In the following sections, I will discuss these very different types of quasi-experiments.

Post-Hoc Designs

Post-hoc designs generally identify schools, teachers, classes, or students who participated in a given treatment, and then find matches for each in routinely collected data, such as district or school standardized test scores, attendance, or retention rates. The routinely collected data (such as state test scores or attendance) are collected as pre-and posttests from school records, so it may be that neither experimental nor control schools’ staffs are even aware that the experiment happened.

Post-hoc designs sound valid; the experimental and control groups were well matched at pretest, so if the experimental group gained more than the control group, that indicates an effective treatment, right?

Not so fast. There is much potential for bias in this design. First, the experimental schools are almost invariably those that actually implemented the treatment. Any schools that dropped out or (even worse) any that were deemed not to have implemented the treatment enough have disappeared from the study. This means that the surviving schools were different in some important way from those that dropped out. For example, imagine that in a study of computer-assisted instruction, schools were dropped if fewer than 50% of students used the software as much as the developers thought they should. The schools that dropped out must have had characteristics that made them unable to implement the program sufficiently. For example, they might have been deficient in teachers’ motivation, organization, skill with technology, or leadership, all factors that might also impact achievement with or without the computers. The experimental group is only keeping the “best” schools, but the control schools will represent the full range, from best to worst. That’s bias. Similarly, if individual students are included in the experimental group only if they actually used the experimental treatment a certain amount, that introduces bias, because the students who did not use the treatment may be less motivated, have lower attendance, or have other deficits.

As another example, developers or researchers may select experimental schools that they know did exceptionally well with the treatment. Then they may find control schools that match on pretest. The problem is that there could be unmeasured characteristics of the experimental schools that could cause these schools to get good results even without the treatment. This introduces serious bias. This is a particular problem if researchers pick experimental or control schools from a large database. The schools will be matched at pretest, but since the researchers may have many potential control schools to choose among, they may use selection rules that, while they maintain initial equality, introduce bias. The readers of the study might never be able to find out if this happened.

Pre-Specified Designs

The best way to minimize bias in quasi-experiments is to identify experimental and control schools in advance (as contrasted with post hoc), before the treatment is applied. After experimental and control schools, classes, or students are identified and matched on pretest scores and other factors, the names of schools, teachers, and possibly students on each list should be registered on the Registry of Efficacy and Effectiveness Studies. This way, all schools (and all students) involved in the study are counted in intent-to-treat (ITT) analyses, just as is expected in randomized studies. The total effect of the treatment is based on this list, even if some schools or students dropped out along the way. An ITT analysis reflects the reality of program effects, because it is rare that all schools or students actually use educational treatments. Such studies also usually report effects of treatment on the treated (TOT), focusing on schools and students who did implement for treatment, but such analyses are of only minor interest, as they are known to reflect bias in favor of the treatment group.

Because most government funders in effect require use of random assignment, the number of quasi-experiments is rapidly diminishing. All things being equal, randomized studies should be preferred. However, quasi-experiments may better fit the practical realities of a given treatment or population, and as such, I hope there can be a place for rigorous quasi-experiments. We need not be so queasy about quasi-experiments if they are designed to minimize bias.

References

Baron, J. (2019, December 12). Why most non-RCT program evaluation findings are unreliable (and a way to improve them). Washington, DC: Arnold Ventures.

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.