How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Higher Ponytails (And Researcher-Made Measures)

blog220_basketball_333x500

Some time ago, I coached my daughter’s fifth grade basketball team. I knew next to nothing about basketball (my sport was…well, chess), but fortunately my research assistant, Holly Roback, eagerly volunteered. She’d played basketball in college, so our girls got outstanding coaching. However, they got whammed. My assistant coach explained it after another disastrous game, “The other team’s ponytails were just higher than ours.” Basically, our girls were terrific at ball handling and free shots, but they came up short in the height department.

Now imagine that in addition to being our team’s coach I was also the league’s commissioner. Imagine that I changed the rules. From now on, lay-ups and jump shots were abolished, and the ball had to be passed three times from player to player before a team could score.

My new rules could be fairly and consistently enforced, but their entire effect would be to diminish the importance of height and enhance the importance of ball handling and set shots.

Of course, I could never get away with this. Every fifth grader, not to mention their parents and coaches, would immediately understand that my rule changes unfairly favored my own team, and disadvantaged theirs (at least the ones with the higher ponytails).

This blog is not about basketball, of course. It is about researcher-made measures or developer-made measures. (I’m using “researcher-made” to refer to both). I’ve been writing a lot about such measures in various blogs on the What Works Clearinghouse (https://wordpress.com/post/robertslavinsblog.wordpress.com/795 and https://wordpress.com/post/robertslavinsblog.wordpress.com/792).

The reason I’m writing again about this topic is that I’ve gotten some criticism for my criticism of researcher-made measures, and I wanted to respond to these concerns.

First, here is my case, simply put. Measures made by researchers or developers are likely to favor whatever content was taught in the experimental group. I’m not in any way suggesting that researchers or developers are deliberately making measures to favor the experimental group. However, it usually works out that way. If the program teaches unusual content, no matter how laudable that content may be, and the control group never saw that content, then the potential for bias is obvious. If the experimental group was taught on computers and control group was not, and the test was given on a computer, the bias is obvious. If the experimental treatment emphasized certain vocabulary, and the control group did not, then a test of those particular words has obvious bias. If a math program spends a lot of time teaching students to do mental rotations of shapes, and the control treatment never did such exercises, a test that includes mental rotations is obviously biased. In our BEE full-scale reviews of pre-K to 12 reading, math, and science programs, available at www.bestevidence.org, we have long excluded such measures, calling them “treatment-inherent.” The WWC calls such measures “over-aligned,” and says it excludes them.

However, the problem turns out to be much deeper. In a 2016 article in the Educational Researcher, Alan Cheung and I tested outcomes from all 645 studies in the BEE achievement reviews, and found that even after excluding treatment-inherent measures, measures from studies that were made by researchers or developers had effect sizes that were far higher than those for measures not made by researchers or developers, by a ratio of two to one (effect sizes =+0.40 for researcher-made measures, +0.20 for independent measures). Graduate student Marta Pellegrini more recently analyzed data from all WWC reading and math studies. The ratio among WWC studies was 2.7 to 1 (effect sizes = +0.52 for researcher-made measures, +0.19 for independent ones). Again, the WWC was supposed to have already removed overaligned studies, all of which (I’d assume) were also researcher-made.

Some of my critics argue that because the WWC already excludes overaligned measures, they have already taken care of the problem. But if that were true, there would not be a ratio of 2.7 to 1 in effect sizes between researcher-made and independent measures, after removing measures considered by the WWC to be overaligned.

Other critics express concern that my analyses (of bias due to researcher-made measures) have only involved reading, math, and science measures, and the situation might be different for measures of social-emotional outcomes, for example, where appropriate measures may not exist.

I will admit that in areas other than achievement the issues are different, and I’ve written about them. So I’ll be happy to limit the simple version of “no researcher-made measures” to achievement measures. The problems of measuring social- emotional outcomes fairly are far more complex, and for another day.

Other critics express concern that even on achievement measures, there are situations in which appropriate measures don’t exist. That may be so, but in policy-oriented reviews such as the WWC or Evidence for ESSA, it’s hard to imagine that there would be no existing measures of reading, writing, math, science, or other achievement outcomes. An achievement objective so rarified that it has never been measured is probably not particularly relevant for policy or practice.

The WWC is not an academic journal, and it is not primarily intended for academics. If a researcher needs to develop a new measure to test a question of theoretical interest, they should do so by all means. But the findings from that measure should not be accepted or reported by the WWC, even if a journal might accept it.

Another version of this criticism is that researchers often have a strong argument that the program they are evaluating emphasizes standards that should be taught to all students, but are not. Therefore, enhanced performance on a (researcher-made) measure of the better standard is prima facie evidence of a positive program impact. This argument confuses the purpose of experimental evaluations with the purpose of standards. Standards exist to express what we want students to know and be able to do. Arguing for a given standard involves considerations of the needs of the economy, standards of other states or countries, norms of the profession, technological or social developments, and so on—but not comparisons of experimental groups scoring well on tests of a new proposed standard to control groups never exposed to content relating to that standard. It’s just not fair.

To get back to basketball, I could have argued that the rules should be changed to emphasize ball handling and reduce the importance of height. Perhaps this would be a good idea, for all I know. But what I could not do was change the rules to benefit my team. In the same way, researchers cannot make their own measures and then celebrate higher scores on them as indicating higher or better standards. As any fifth grader could tell you, advocating for better rules is fine, but changing the rules in the middle of the season is wrong.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measures, failure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

…But It Was The Very Best Butter! How Tests Can Be Reliable, Valid, and Worthless

I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”

I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.

Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.

Wow.

How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.

Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.

So how can it be that using even the most reliable and valid tests doesn’t solve this problem?

In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”

The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).

Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.

Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).

Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.

Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Curse of the Cluster

If you follow my blogs, you’ve probably noticed that I stay away from three topics on which reasoned discourse is impossible: religion, politics, and statistics. However, just this once I’d like to break my own rule and talk about statistics, or rather research design. And I promise not to be too nerdy.

While there is little argument about basic principles of statistics and research design, things do get a bit dicey in the real world. Some of my colleagues resolve any situation that is less than ideal by ignoring studies with the slightest flaw. I think that can be a huge waste of (usually) government money, and can deprive researchers and educators alike of valuable information.

My personal position is that all flaws are not created equal. In particular, some flaws introduce bias and some do not. For example, use of researcher-made measures, small sample sizes, and matched rather than randomized designs introduce bias, so they should be avoided or minimized in importance.

On the other hand, accounting for clustering in designs in which students are grouped in classes or schools is now considered essential. That is, if you randomly assign 20 schools to experimental (n=10) or control (n=10) conditions, you might have 5000 students per treatment. Randomly assigning 5000 students one at a time would be a huge study. In fact, 300 students might be enough. However, in a clustered study, 5000 per treatment may be too small. Current statistical principles demand that you use a method called Hierarchical Linear Modeling (HLM) to analyze the data, and unless the effect size is very large, 20 schools will not be sufficient for statistical significance.

Yet here’s the rub: failing to account for clustering does not introduce bias. That is, if you (mistakenly) analyzed at the student level in a study in which treatments were implemented at the class or school level, the effect size would be about the same. All that would change would be statistical significance. That is, you would overstate the number of experimental-control differences claimed to be significant (i.e., beyond what you’d expect by chance).

All right, let’s accept that clustered data should be analyzed using HLM, which accounts for clustering. But while we are straining at the clustering gnat, what camels are we swallowing?

My personal bugbear is researcher-made measures. Often, the very same researchers who take an unyielding position on clustering happily accept research designs in which the researcher made the test, even if the test is clearly aligned with the content the experimental group (but not the control group) was taught. In some studies, the teachers who provided tutoring, for example, also gave the tests. Strict-on-clustering researchers also often accept studies that were very brief, sometimes a week or less, or often just an hour. They may accept studies in which conditions in the experimental groups were substantially enhanced beyond what could ever be done in real life, as in technology studies in which a graduate student is placed in every class or even every small group every day to “help with the technology.”

All of these research designs are far more likely to produce misleading findings than are studies that only suffer from clustering problems, and worse, these effects introduce bias, while failing to attend to clustering does not.

Why is this of importance to non-statisticians? It matters because in education, students are usually taught in large groups, so except for studies of one-to-one or small-group tutoring, clustering almost always has to be accounted for, and as a consequence, randomized experiments typically must involve 40-50 schools (20-25 per treatment) to detect an effect size as small as 0.20. Such experiments are very expensive, and they are difficult to do if you are not an expert already. The clustering requirement, therefore, makes it difficult for researchers early in their careers to get funding and to show success if they do, because managing implementation and collecting data in 50 schools is really, really hard.

I do not have a good solution for this problem, and I upset my colleagues when I bring it up. But we have to face it. Making accounting for clustering an absolute makes educational research too expensive, and put another way it means that we can do too few studies for the dollars we do invest. And this requirement bars entry to the field to those unable to get multi-million dollar grants or to manage large field experiments.

One solution to the cluster problem might be to have research funders fund step-by-step studies. For example, imagine that funding were offered for studies of 10 schools to be analyzed at the cluster level (correct but hopelessly underpowered) and at the student level (Bad! But affordable.). If the outcomes are promising, funders could fund another 10-school study, and researchers could combine the samples, repeating this process until there are enough schools to collectively justify a proper clustered analysis. This would also enable neophyte researchers to learn from experience, it would allow everyone to learn over time what the potential impacts are, and it could save billions of dollars now being spent on monster randomized studies of programs never before having shown promising effects (which then turn out to be ineffective).

A gradual approach to clustering might enable the field of education to focus on the real enemy, which is bias. If we systematically stamp out design elements that add bias, then over time the field will converge upon truth, and will cost-effectively move forward knowledge of what works, in time to benefit today’s children. The curse of the cluster is holding back the whole field. With all due respect to the real problems clustered designs present, let’s find ways to compromise so we can learn from unbiased but modest-sized studies and go step-by-step toward better information for practice.

Beware of Do-It-Yourself Assessments

Faithful readers of this blog, and followers of the Best Evidence Encyclopedia (BEE), will know that I am always cautioning readers of program evaluations to pay no attention to findings from measures overly aligned with the experimental but not the control treatment. For example, when researchers teach a set of vocabulary words to the experimental students (but not the controls), it is not surprising to find strong impacts. Unfortunately this happens all too often, but we carefully winnow such measures out of our BEE reviews.

In a recent paper written with my colleague Alan Cheung, we looked at 645 studies accepted across all BEE reviews done so far to find out which methodological factors are associated with excessive, improbable effect sizes. In an earlier blog I wrote about the profound impact of sample size: small studies get (improbably) big effect sizes.

Another important factor, however, was the use of experimenter-made measures. Even after our careful, conservative weeding out of studies with over-aligned measures, we were surprised to find out that effect sizes on measures made by experimenters were twice as high as effect sizes on measures made by someone else (usually standardized tests).

It may be going too far to suggest that no one should ever use or accept experimenter-made measures, no matter how fair they appear to be to the experimental and control groups. However, what it does say is that we need to be very cautious in accepting experimenter-made measures. Standardized tests are far from perfect, but they are almost always fair to experimental and control groups, as control teachers can be assumed to be trying as hard as experimental teachers to improve outcomes on these measures. This may not be so on experimental-made tests.

I’m all for do-it-yourself cooking, home repairs, and other projects. But when it comes to do-it-yourself educational measurement, let the reader beware!