I once had a statistics professor who loved to start discussions of experimental design with the following:
“First, pick your favorite random number.”
Obviously, if you pick a favorite random number, it isn’t random. I was recalling this bit of absurdity recently when discussing with colleagues the relative value of randomized experiments (RCTs) and matched studies, or quasi-experimental designs (QED). In randomized experiments, students, teachers, classes, or schools are assigned at random to experimental or control conditions. In quasi-experiments, a group of students, teachers, classes, or schools is identified as the experimental group, and then other schools are located (usually in the same districts) and then matched on key variables, such as prior test scores, percent free lunch, ethnicity, and perhaps other factors. The ESSA evidence standards, the What Works Clearinghouse, Evidence for ESSA, and most methodologists favor randomized experiments over QEDs, but there are situations in which RCTs are not feasible. In a recent “Straight Talk on Evidence,” Jon Baron discussed how QEDs can approach the usefulness of RCTs. In this blog, I build on Baron’s article and go further into strategies for getting the best, most unbiased results possible from QEDs.
Randomized and quasi-experimental studies are very similar in most ways. Both almost always compare experimental and control schools that were very similar on key performance and demographic factors. Both use the same statistics, and require the same number of students or clusters for adequate power. Both apply the same logic, that the control group mean represents a good approximation of what the experimental group would have achieved, on average, if the experiment had never taken place.
However, there is one big difference between randomized and quasi-experiments. In a well-designed randomized experiment, the experimental and control groups can be assumed to be equal not only on observed variables, such as pretests and socio-economic status, but also on unobserved variables. The unobserved variables we worry most about have to do with selection bias. How did it happen (in a quasi-experiment) that the experimental group chose to use the experimental treatment, or was assigned to the experimental treatment? If a set of schools decided to use the experimental treatment on their own, then these schools might be composed of teachers or principals who are more oriented toward innovation, for example. Or if the experimental treatment is difficult, the teachers who would choose it might be more hard-working. If it is expensive, then perhaps the experimental schools have more money. Any of these factors could bias the study toward finding positive effects, because schools that have teachers who are motivated or hard-working, in schools with more resources, might perform better than control schools with or without the experimental treatment.
Because of this problem of selection bias, studies that use quasi-experimental designs generally have larger effect sizes than do randomized experiments. Cheung & Slavin (2016) studied the effects of methodological features of studies on effect sizes. They obtained effect sizes from 645 studies of elementary and secondary reading, mathematics, and science, as well as early childhood programs. These studies had already passed a screening in which they would have been excluded if they had serious design flaws. The results were as follows:
|No. of studies||Mean effect size|
Clearly, mean effect sizes were larger in the quasi-experiments, suggesting the possibility that there was bias. Compared to factors such as sample size and use of developer- or researcher-made measures, the amount of effect size inflation in quasi-experiments was modest, and some meta-analyses comparing randomized and quasi-experimental studies have found no difference at all.
Relative Advantages of Randomized and Quasi-Experiments
Because of the problems of selection bias, randomized experiments are preferred to quasi-experiments, all other factors being equal. However, there are times when quasi-experiments may be necessary for practical reasons. For example, it can be easier to recruit and serve schools in a quasi-experiment, and it can be less expensive. A randomized experiment requires that schools be recruited with the promise that they will receive an exciting program. Yet half of them will instead be in a control group, and to keep them willing to sign up, they may be given a lot of money, or an opportunity to receive the program later on. In a quasi-experiment, the experimental schools all get the treatment they want, and control schools just have to agree to be tested. A quasi-experiment allows schools in a given district to work together, instead of insisting that experimental and control schools both exist in each district. This better simulates the reality schools are likely to face when a program goes into dissemination. If the problems of selection bias can be minimized, quasi-experiments have many attractions.
An ideal design for quasi-experiments would obtain the same unbiased outcomes as a randomized evaluation of the same treatment might do. The purpose of this blog is to discuss ways to minimize bias in quasi-experiments.
In practice, there are several distinct forms of quasi-experiments. Some have considerable likelihood of bias. However, others have much less potential for bias. In general, quasi-experiments to avoid are forms of post-hoc, or after-the-fact designs, in which determination of experimental and control groups takes place after the experiment. Quasi-experiments with much less likelihood of bias are pre-specified designs, in which experimental and control schools, classrooms, or students are identified and registered in advance. In the following sections, I will discuss these very different types of quasi-experiments.
Post-hoc designs generally identify schools, teachers, classes, or students who participated in a given treatment, and then find matches for each in routinely collected data, such as district or school standardized test scores, attendance, or retention rates. The routinely collected data (such as state test scores or attendance) are collected as pre-and posttests from school records, so it may be that neither experimental nor control schools’ staffs are even aware that the experiment happened.
Post-hoc designs sound valid; the experimental and control groups were well matched at pretest, so if the experimental group gained more than the control group, that indicates an effective treatment, right?
Not so fast. There is much potential for bias in this design. First, the experimental schools are almost invariably those that actually implemented the treatment. Any schools that dropped out or (even worse) any that were deemed not to have implemented the treatment enough have disappeared from the study. This means that the surviving schools were different in some important way from those that dropped out. For example, imagine that in a study of computer-assisted instruction, schools were dropped if fewer than 50% of students used the software as much as the developers thought they should. The schools that dropped out must have had characteristics that made them unable to implement the program sufficiently. For example, they might have been deficient in teachers’ motivation, organization, skill with technology, or leadership, all factors that might also impact achievement with or without the computers. The experimental group is only keeping the “best” schools, but the control schools will represent the full range, from best to worst. That’s bias. Similarly, if individual students are included in the experimental group only if they actually used the experimental treatment a certain amount, that introduces bias, because the students who did not use the treatment may be less motivated, have lower attendance, or have other deficits.
As another example, developers or researchers may select experimental schools that they know did exceptionally well with the treatment. Then they may find control schools that match on pretest. The problem is that there could be unmeasured characteristics of the experimental schools that could cause these schools to get good results even without the treatment. This introduces serious bias. This is a particular problem if researchers pick experimental or control schools from a large database. The schools will be matched at pretest, but since the researchers may have many potential control schools to choose among, they may use selection rules that, while they maintain initial equality, introduce bias. The readers of the study might never be able to find out if this happened.
The best way to minimize bias in quasi-experiments is to identify experimental and control schools in advance (as contrasted with post hoc), before the treatment is applied. After experimental and control schools, classes, or students are identified and matched on pretest scores and other factors, the names of schools, teachers, and possibly students on each list should be registered on the Registry of Efficacy and Effectiveness Studies. This way, all schools (and all students) involved in the study are counted in intent-to-treat (ITT) analyses, just as is expected in randomized studies. The total effect of the treatment is based on this list, even if some schools or students dropped out along the way. An ITT analysis reflects the reality of program effects, because it is rare that all schools or students actually use educational treatments. Such studies also usually report effects of treatment on the treated (TOT), focusing on schools and students who did implement for treatment, but such analyses are of only minor interest, as they are known to reflect bias in favor of the treatment group.
Because most government funders in effect require use of random assignment, the number of quasi-experiments is rapidly diminishing. All things being equal, randomized studies should be preferred. However, quasi-experiments may better fit the practical realities of a given treatment or population, and as such, I hope there can be a place for rigorous quasi-experiments. We need not be so queasy about quasi-experiments if they are designed to minimize bias.
Baron, J. (2019, December 12). Why most non-RCT program evaluation findings are unreliable (and a way to improve them). Washington, DC: Arnold Ventures.
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.