If you follow my blogs, you’ve probably noticed that I stay away from three topics on which reasoned discourse is impossible: religion, politics, and statistics. However, just this once I’d like to break my own rule and talk about statistics, or rather research design. And I promise not to be too nerdy.
While there is little argument about basic principles of statistics and research design, things do get a bit dicey in the real world. Some of my colleagues resolve any situation that is less than ideal by ignoring studies with the slightest flaw. I think that can be a huge waste of (usually) government money, and can deprive researchers and educators alike of valuable information.
My personal position is that all flaws are not created equal. In particular, some flaws introduce bias and some do not. For example, use of researcher-made measures, small sample sizes, and matched rather than randomized designs introduce bias, so they should be avoided or minimized in importance.
On the other hand, accounting for clustering in designs in which students are grouped in classes or schools is now considered essential. That is, if you randomly assign 20 schools to experimental (n=10) or control (n=10) conditions, you might have 5000 students per treatment. Randomly assigning 5000 students one at a time would be a huge study. In fact, 300 students might be enough. However, in a clustered study, 5000 per treatment may be too small. Current statistical principles demand that you use a method called Hierarchical Linear Modeling (HLM) to analyze the data, and unless the effect size is very large, 20 schools will not be sufficient for statistical significance.
Yet here’s the rub: failing to account for clustering does not introduce bias. That is, if you (mistakenly) analyzed at the student level in a study in which treatments were implemented at the class or school level, the effect size would be about the same. All that would change would be statistical significance. That is, you would overstate the number of experimental-control differences claimed to be significant (i.e., beyond what you’d expect by chance).
All right, let’s accept that clustered data should be analyzed using HLM, which accounts for clustering. But while we are straining at the clustering gnat, what camels are we swallowing?
My personal bugbear is researcher-made measures. Often, the very same researchers who take an unyielding position on clustering happily accept research designs in which the researcher made the test, even if the test is clearly aligned with the content the experimental group (but not the control group) was taught. In some studies, the teachers who provided tutoring, for example, also gave the tests. Strict-on-clustering researchers also often accept studies that were very brief, sometimes a week or less, or often just an hour. They may accept studies in which conditions in the experimental groups were substantially enhanced beyond what could ever be done in real life, as in technology studies in which a graduate student is placed in every class or even every small group every day to “help with the technology.”
All of these research designs are far more likely to produce misleading findings than are studies that only suffer from clustering problems, and worse, these effects introduce bias, while failing to attend to clustering does not.
Why is this of importance to non-statisticians? It matters because in education, students are usually taught in large groups, so except for studies of one-to-one or small-group tutoring, clustering almost always has to be accounted for, and as a consequence, randomized experiments typically must involve 40-50 schools (20-25 per treatment) to detect an effect size as small as 0.20. Such experiments are very expensive, and they are difficult to do if you are not an expert already. The clustering requirement, therefore, makes it difficult for researchers early in their careers to get funding and to show success if they do, because managing implementation and collecting data in 50 schools is really, really hard.
I do not have a good solution for this problem, and I upset my colleagues when I bring it up. But we have to face it. Making accounting for clustering an absolute makes educational research too expensive, and put another way it means that we can do too few studies for the dollars we do invest. And this requirement bars entry to the field to those unable to get multi-million dollar grants or to manage large field experiments.
One solution to the cluster problem might be to have research funders fund step-by-step studies. For example, imagine that funding were offered for studies of 10 schools to be analyzed at the cluster level (correct but hopelessly underpowered) and at the student level (Bad! But affordable.). If the outcomes are promising, funders could fund another 10-school study, and researchers could combine the samples, repeating this process until there are enough schools to collectively justify a proper clustered analysis. This would also enable neophyte researchers to learn from experience, it would allow everyone to learn over time what the potential impacts are, and it could save billions of dollars now being spent on monster randomized studies of programs never before having shown promising effects (which then turn out to be ineffective).
A gradual approach to clustering might enable the field of education to focus on the real enemy, which is bias. If we systematically stamp out design elements that add bias, then over time the field will converge upon truth, and will cost-effectively move forward knowledge of what works, in time to benefit today’s children. The curse of the cluster is holding back the whole field. With all due respect to the real problems clustered designs present, let’s find ways to compromise so we can learn from unbiased but modest-sized studies and go step-by-step toward better information for practice.