Bad Science II: Brief, Small, and Artificial Studies

Bad Science II 072612.png

“We learned from correlational research that students who speak Latin do better in school. So this year we’re teaching everything in Latin.”

The oldest joke in academia goes like this. A professor is shown the results of an impressive experiment. “That may work in practice,” she says, “but how will it work in the laboratory?”

For practitioners trying to make sense of the findings of educational research, this is no laughing matter. They are often left to figure out whether or not there is meaningful evidence supporting a given practice or policy. Yet all too often academics report findings from experiments that are too brief, too small, and/or too artificial to be reliable for making educational decisions.

Looking at the original articles, this problem is easy to see. Would you use or recommend a classroom management approach that has been successfully evaluated in a one-hour experiment? Or one evaluated with only 20 students? Or evaluated in a situation in which teachers in the experimental group had graduate students helping them in class every day?

The problem comes when busy educators or researchers rely on reviews of research. The reviews may make sweeping statements about the about the effects of various practices based on very brief, small, or artificial experiments, yet a lot of detective work may be necessary to find this out. Years ago, I was re-analyzing a review of research on class size and found one study with a far larger effect than all others. After much sleuthing I found out why: It was a study of tennis instruction, where students in larger tennis groups get a lot less court time.

So what should a reader do? Some reviews, including Social Programs that WorkBlueprints for Violence Prevention, and our own Best Evidence Encyclopedia, take sample size, duration, and artificiality into account. Otherwise, if you want to know for sure, you’ll have to put on your own deerstalker and do your own detective work, finding the essential experiments that took place in real schools over real periods of time, under realistic conditions. Evidence-based reform in education won’t really take hold until readers can consistently find reliable, easily interpretable and unbiased information on practical programs and practices available to them.

In case you missed last week first part in the series, check it out here: Bad Science I: Bad Measures

Illustration: Slavin, R.E. (2007). Educational research in the age of accountability. Boston: Allyn & Bacon. Reprinted with permission of the author.

Find Bob Slavin on Facebook!

Bad Science I: Bad Measures

Bad Medicine graphic 07192012.png

“My multiple choice test on bike riding was very reliable.
How come none of my kids can ride a bike?”

As an advocate for evidence-based reform in education, I’m always celebrating the glorious possibilities of having educational policies and practices be based on the findings of “rigorous” research. Who could disagree? For this idea to have a little bite in it, however, it is important to understand what I mean by “rigorous.”

In general, a rigorous study evaluating an educational program is one that compares, say, some number of teachers or schools in an experimental program using program X, to others in a control group of very similar characteristics using program Y, which may just be traditional education. Clear enough so far.

One problem arises when we ask, “On what measures should programs X and Y be compared?” Often, this debate revolves around measures felt to be insensitive to real learning gains, as when a study of a science program uses a multiple choice science test. Such studies tend to understate likely program effects.

An even bigger problem occurs when experimenters make up their own measures that are closely linked to the experimental program (X) but not the control program (Y). For example, imagine that a researcher develops a vocabulary-building treatment for English learning and then creates a test around the words emphasized in the program (these words may never have even been introduced to control group). Or, imagine that a researcher develops a science program that spends twice as much time as usual on properties of light, and then develops a test with a heavy emphasis on the very concepts about light added in the extra time. Or a researcher introduces a topic earlier than usual (such as topics of mathematics in preschool) and then uses a measure of the content taught, to which the control group was never exposed. In each of these cases, the experimental group has a huge advantage over the control group, just because they received a lot more teaching on the topic being assessed.

There is a simple solution to this problem. Hold constant the content of instruction while varying the methods, or use widely accepted measures not developed by the experimenter. Studies using measures that are fair to the experimental group and control group tend to report much smaller impacts, but these impacts are a lot more believable than those from studies using measures slanted toward the experimental treatments.

Illustration: Slavin, R.E. (2007). Educational research in the age of accountability. Boston: Allyn & Bacon. Reprinted with permission of the author.

Next Week: Bad Science II: Brief, Small, and Artificial Studies
Find Bob Slavin on Facebook!