“To err is human. But it takes a computer to really (mess) things up.” – Anonymous
Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.
Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals! The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).
Here is one common problem. Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.
Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.
This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?
The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.
Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.
Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included. To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.
Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.
Solutions to Problems in Evaluating Computer-based Programs.
Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.
Randomly Assigning Schools.
The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance. Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes. Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.
High-Quality Matched Studies.
It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.
In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.
In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.
All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).
This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.
Yup. That’s all. These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.
If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.
4 thoughts on “How Computers Can Help Do Bad Research”
Dr. Slavin wants to throw away relevance to school decision-makers in order to maintain an unnecessary purity of research design. School decision-makers care whether the product is likely to work with their school’s population and available resources. Can it solve their problem (e.g., reduce achievement gaps among demographic categories) if they can implement it adequately? I’ve posted a reply here: https://www.empiricaleducation.com/blog/view-from-the-west-coast/
Empirical Education Inc.
Palo Alto, CA
In his blog, Denis Newman makes a rather extraordinary claim, that the ESSA evidence standards allow anything they do not specifically prohibit. Actually, each of the standards contains the requirement that studies be well-designed and well-implemented. This has been universally interpreted as meaning that they must meet modern standards for randomized, quasi-experimental, and correlational research, respectively. If all research met ESSA standards unless it used methods specifically excluded by the ESSA language, the ESSA standards would be meaningless. Neither the What Works Clearinghouse nor Evidence for ESSA nor the EEF standards in England accept what Newman is proposing, the experimental group not as consisting of the students assigned in advance to receive a given program, as all sophisticated methodologists demand, but as consisting of students who happen to have used a given program. The potential for bias is not merely theoretical, as Newman states, but has been demonstrated many times in real experiments.
We don’t believe there’s a monolithic, universal, agreement among methodologists as to what counts as evidence about the working of educational programs, as Dr. Slavin claims. But the more important difference between the position he takes and what our team is doing is the purpose of evaluation. For Slavin, the goal is to rank programs on the strength of the available evidence. The idea is to amass for any program/intervention, experimental effect-size estimates allowing the programs to be ranked from the least to the most effective programs. For our team, the goal is to improve the design and implementation of programs used in schools. The idea is to amass information about how programs work and for whom and with what resources. This requires opening up the “black box” and understanding mediators inherent in implementation. Slavin’s rules lead to a narrowing of analytic options and ignoring the information that will help educators and developers build and implement education programs.
Denis Newman is correct. There are many purposes for research, and methods should match their purposes. In my response to his comment, I was addressing the purpose posed by the ESSA evidence standards, by the What Works Clearinghouse, and by Evidence for ESSA., and the purposes emphasized by IES, i3/EIR, and EEF. As long as researchers are clear that their studies that do not meet these standards are not intended to be listed in the WWC or Evidence for ESSA, for example, and their results are not used in any way for marketing, then they are perfectly fine, as internal evaluations to inform their developers, and nothing more. The problem is that such research is, in fact, submitted to WWC or Evidence for ESSA and is used in marketing, and this has a lot of potential to mislead educational leaders.