How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.


Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.


How do Textbooks Fit Into Evidence-Based Reform?

In a blog I wrote recently, “Evidence, Standards, and Chicken Feathers,” I discussed my perception that states, districts, and schools, in choosing textbooks and other educational materials, put a lot of emphasis on alignment with standards, and very little on evidence of effectiveness.  My colleague Steve Ross objected, at least in the case of textbooks.  He noted that it was very difficult for a textbook to prove its effectiveness, because textbooks so closely resemble other textbooks that showing a difference between them is somewhere between difficult and impossible.  Since the great majority of classrooms use textbooks (paper or digital) or sets of reading materials that collectively resemble textbooks, the control group in any educational experiment is almost certainly also using a textbook (or equivalents).  So as evidence becomes more and more important, is it fair to hold textbooks to such a difficult standard of evidence? Steve and I had an interesting conversation about this point, so I thought I would share it with other readers of my blog.


First, let me define a couple of key words.  Most of what schools purchase could be called commodities.  These include desks, lighting, carpets, non-electronic whiteboards, playground equipment, and so on. Schools need these resources to provide students with safe, pleasant, attractive places in which to learn. I’m happy to pay taxes to ensure that every child has all of the facilities and materials they need. However, no one should expect such expenditures to make a measurable difference in achievement beyond ordinary levels.

In contrast, other expenditures are interventions.  These include teacher preparation, professional development, innovative technology, tutoring, and other services clearly intended to improve achievement beyond ordinary levels.   Educators would generally agree that such investments should be asked to justify themselves by showing their effectiveness in raising achievement scores, since that is their goal.

By analogy, hospitals invest a great deal in their physical plants, furniture, lighting, carpets, and so on. These are all necessary commodities.   No one should have to go to a hospital that is not attractive, bright, airy, comfortable, and convenient, with plenty of parking.  These things may contribute to patients’ wellness in subtle ways, but no one would expect them to make major differences in patient health.  What does make a measurable difference is the preparation and training provided to the staff, medicines, equipment, and procedures, all of which can be (and are) constantly improved through ongoing research, development, and dissemination.

So is a textbook a commodity or an intervention?  If we accept that every classroom must have a textbook or its equivalent (such as a digital text), then a textbook is a commodity, just an ordinary, basic requirement for every classroom.  We would expect textbooks-as-commodities to be well written, up-to-date, attractive, and pedagogically sensible, and, if possible, aligned with state and national standards.  But it might be unfair and perhaps futile to expect textbooks-as-commodities to significantly increase student achievement in comparison to business as usual, because they are, in effect, business as usual.

If, somehow, a print or digital textbook, with associated professional development, digital add-ons, and so forth, turns out to be significantly more effective than alternative, state-of-the-art textbooks, then a textbook could also be considered an intervention, and marketed as such.  It would then be considered in comparison to other interventions that exist only, or primarily, to increase achievement beyond ordinary levels.

The distinction between commodities and interventions would be academic but for the appearance of the ESSA evidence standards.  The ESSA law requires that schools seeking school improvement funding select and implement programs that meet one of the top three standards (strong, moderate, or promising). It gives preference points on other federal grants, especially Title II (professional development), to applicants who promise to implement proven programs. Some states have applied more stringent criteria, and some have extended use of the standards to additional funding initiatives, including state initiatives.  These are all very positive developments. However, they are making textbook publishers anxious. How are they going to meet the new standards, given that their products are not so different from others now in use?

My answer is that I do not think it was the intent of the ESSA standards to forbid schools from using textbooks that lack evidence of effectiveness. To do so would be unrealistic, as it would wipe out at least 90% of textbooks.  Instead, the purpose of the ESSA evidence standards was to encourage and incentivize the use of interventions proven to be effective.  The concept, I think, was to assume that other funding (especially state and local funds) would support the purchase of commodities, including ordinary textbooks.  In contrast, the federal role was intended to focus on interventions to boost achievement in high-poverty and low-achieving schools.  Ordinary textbooks that are no more effective than any others are clearly not appropriate for those purposes, where there is an urgent need for approaches proven to have significantly greater impacts than methods in use today.

It would be a great step forward if federal, state, and local funding intended to support major improvements in student outcomes were held to tough standards of evidence.  Such programs should be eligible for generous and strategic funding from federal, state, and local sources dedicated to the enhancement of student outcomes.  But no one should limit schools in spending their funds on attractive desks, safe and fun playground equipment, and well-written textbooks, even though these necessary commodities are unlikely to accelerate student achievement beyond current expectations.

Photo credit: Laurentius de Voltolina [Public domain]

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.