Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Higher Ponytails (And Researcher-Made Measures)

blog220_basketball_333x500

Some time ago, I coached my daughter’s fifth grade basketball team. I knew next to nothing about basketball (my sport was…well, chess), but fortunately my research assistant, Holly Roback, eagerly volunteered. She’d played basketball in college, so our girls got outstanding coaching. However, they got whammed. My assistant coach explained it after another disastrous game, “The other team’s ponytails were just higher than ours.” Basically, our girls were terrific at ball handling and free shots, but they came up short in the height department.

Now imagine that in addition to being our team’s coach I was also the league’s commissioner. Imagine that I changed the rules. From now on, lay-ups and jump shots were abolished, and the ball had to be passed three times from player to player before a team could score.

My new rules could be fairly and consistently enforced, but their entire effect would be to diminish the importance of height and enhance the importance of ball handling and set shots.

Of course, I could never get away with this. Every fifth grader, not to mention their parents and coaches, would immediately understand that my rule changes unfairly favored my own team, and disadvantaged theirs (at least the ones with the higher ponytails).

This blog is not about basketball, of course. It is about researcher-made measures or developer-made measures. (I’m using “researcher-made” to refer to both). I’ve been writing a lot about such measures in various blogs on the What Works Clearinghouse (https://wordpress.com/post/robertslavinsblog.wordpress.com/795 and https://wordpress.com/post/robertslavinsblog.wordpress.com/792).

The reason I’m writing again about this topic is that I’ve gotten some criticism for my criticism of researcher-made measures, and I wanted to respond to these concerns.

First, here is my case, simply put. Measures made by researchers or developers are likely to favor whatever content was taught in the experimental group. I’m not in any way suggesting that researchers or developers are deliberately making measures to favor the experimental group. However, it usually works out that way. If the program teaches unusual content, no matter how laudable that content may be, and the control group never saw that content, then the potential for bias is obvious. If the experimental group was taught on computers and control group was not, and the test was given on a computer, the bias is obvious. If the experimental treatment emphasized certain vocabulary, and the control group did not, then a test of those particular words has obvious bias. If a math program spends a lot of time teaching students to do mental rotations of shapes, and the control treatment never did such exercises, a test that includes mental rotations is obviously biased. In our BEE full-scale reviews of pre-K to 12 reading, math, and science programs, available at www.bestevidence.org, we have long excluded such measures, calling them “treatment-inherent.” The WWC calls such measures “over-aligned,” and says it excludes them.

However, the problem turns out to be much deeper. In a 2016 article in the Educational Researcher, Alan Cheung and I tested outcomes from all 645 studies in the BEE achievement reviews, and found that even after excluding treatment-inherent measures, measures from studies that were made by researchers or developers had effect sizes that were far higher than those for measures not made by researchers or developers, by a ratio of two to one (effect sizes =+0.40 for researcher-made measures, +0.20 for independent measures). Graduate student Marta Pellegrini more recently analyzed data from all WWC reading and math studies. The ratio among WWC studies was 2.7 to 1 (effect sizes = +0.52 for researcher-made measures, +0.19 for independent ones). Again, the WWC was supposed to have already removed overaligned studies, all of which (I’d assume) were also researcher-made.

Some of my critics argue that because the WWC already excludes overaligned measures, they have already taken care of the problem. But if that were true, there would not be a ratio of 2.7 to 1 in effect sizes between researcher-made and independent measures, after removing measures considered by the WWC to be overaligned.

Other critics express concern that my analyses (of bias due to researcher-made measures) have only involved reading, math, and science measures, and the situation might be different for measures of social-emotional outcomes, for example, where appropriate measures may not exist.

I will admit that in areas other than achievement the issues are different, and I’ve written about them. So I’ll be happy to limit the simple version of “no researcher-made measures” to achievement measures. The problems of measuring social- emotional outcomes fairly are far more complex, and for another day.

Other critics express concern that even on achievement measures, there are situations in which appropriate measures don’t exist. That may be so, but in policy-oriented reviews such as the WWC or Evidence for ESSA, it’s hard to imagine that there would be no existing measures of reading, writing, math, science, or other achievement outcomes. An achievement objective so rarified that it has never been measured is probably not particularly relevant for policy or practice.

The WWC is not an academic journal, and it is not primarily intended for academics. If a researcher needs to develop a new measure to test a question of theoretical interest, they should do so by all means. But the findings from that measure should not be accepted or reported by the WWC, even if a journal might accept it.

Another version of this criticism is that researchers often have a strong argument that the program they are evaluating emphasizes standards that should be taught to all students, but are not. Therefore, enhanced performance on a (researcher-made) measure of the better standard is prima facie evidence of a positive program impact. This argument confuses the purpose of experimental evaluations with the purpose of standards. Standards exist to express what we want students to know and be able to do. Arguing for a given standard involves considerations of the needs of the economy, standards of other states or countries, norms of the profession, technological or social developments, and so on—but not comparisons of experimental groups scoring well on tests of a new proposed standard to control groups never exposed to content relating to that standard. It’s just not fair.

To get back to basketball, I could have argued that the rules should be changed to emphasize ball handling and reduce the importance of height. Perhaps this would be a good idea, for all I know. But what I could not do was change the rules to benefit my team. In the same way, researchers cannot make their own measures and then celebrate higher scores on them as indicating higher or better standards. As any fifth grader could tell you, advocating for better rules is fine, but changing the rules in the middle of the season is wrong.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measures, failure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

…But It Was The Very Best Butter! How Tests Can Be Reliable, Valid, and Worthless

I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”

I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.

Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.

Wow.

How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.

Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.

So how can it be that using even the most reliable and valid tests doesn’t solve this problem?

In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”

The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).

Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.

Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).

Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.

Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Curse of the Cluster

If you follow my blogs, you’ve probably noticed that I stay away from three topics on which reasoned discourse is impossible: religion, politics, and statistics. However, just this once I’d like to break my own rule and talk about statistics, or rather research design. And I promise not to be too nerdy.

While there is little argument about basic principles of statistics and research design, things do get a bit dicey in the real world. Some of my colleagues resolve any situation that is less than ideal by ignoring studies with the slightest flaw. I think that can be a huge waste of (usually) government money, and can deprive researchers and educators alike of valuable information.

My personal position is that all flaws are not created equal. In particular, some flaws introduce bias and some do not. For example, use of researcher-made measures, small sample sizes, and matched rather than randomized designs introduce bias, so they should be avoided or minimized in importance.

On the other hand, accounting for clustering in designs in which students are grouped in classes or schools is now considered essential. That is, if you randomly assign 20 schools to experimental (n=10) or control (n=10) conditions, you might have 5000 students per treatment. Randomly assigning 5000 students one at a time would be a huge study. In fact, 300 students might be enough. However, in a clustered study, 5000 per treatment may be too small. Current statistical principles demand that you use a method called Hierarchical Linear Modeling (HLM) to analyze the data, and unless the effect size is very large, 20 schools will not be sufficient for statistical significance.

Yet here’s the rub: failing to account for clustering does not introduce bias. That is, if you (mistakenly) analyzed at the student level in a study in which treatments were implemented at the class or school level, the effect size would be about the same. All that would change would be statistical significance. That is, you would overstate the number of experimental-control differences claimed to be significant (i.e., beyond what you’d expect by chance).

All right, let’s accept that clustered data should be analyzed using HLM, which accounts for clustering. But while we are straining at the clustering gnat, what camels are we swallowing?

My personal bugbear is researcher-made measures. Often, the very same researchers who take an unyielding position on clustering happily accept research designs in which the researcher made the test, even if the test is clearly aligned with the content the experimental group (but not the control group) was taught. In some studies, the teachers who provided tutoring, for example, also gave the tests. Strict-on-clustering researchers also often accept studies that were very brief, sometimes a week or less, or often just an hour. They may accept studies in which conditions in the experimental groups were substantially enhanced beyond what could ever be done in real life, as in technology studies in which a graduate student is placed in every class or even every small group every day to “help with the technology.”

All of these research designs are far more likely to produce misleading findings than are studies that only suffer from clustering problems, and worse, these effects introduce bias, while failing to attend to clustering does not.

Why is this of importance to non-statisticians? It matters because in education, students are usually taught in large groups, so except for studies of one-to-one or small-group tutoring, clustering almost always has to be accounted for, and as a consequence, randomized experiments typically must involve 40-50 schools (20-25 per treatment) to detect an effect size as small as 0.20. Such experiments are very expensive, and they are difficult to do if you are not an expert already. The clustering requirement, therefore, makes it difficult for researchers early in their careers to get funding and to show success if they do, because managing implementation and collecting data in 50 schools is really, really hard.

I do not have a good solution for this problem, and I upset my colleagues when I bring it up. But we have to face it. Making accounting for clustering an absolute makes educational research too expensive, and put another way it means that we can do too few studies for the dollars we do invest. And this requirement bars entry to the field to those unable to get multi-million dollar grants or to manage large field experiments.

One solution to the cluster problem might be to have research funders fund step-by-step studies. For example, imagine that funding were offered for studies of 10 schools to be analyzed at the cluster level (correct but hopelessly underpowered) and at the student level (Bad! But affordable.). If the outcomes are promising, funders could fund another 10-school study, and researchers could combine the samples, repeating this process until there are enough schools to collectively justify a proper clustered analysis. This would also enable neophyte researchers to learn from experience, it would allow everyone to learn over time what the potential impacts are, and it could save billions of dollars now being spent on monster randomized studies of programs never before having shown promising effects (which then turn out to be ineffective).

A gradual approach to clustering might enable the field of education to focus on the real enemy, which is bias. If we systematically stamp out design elements that add bias, then over time the field will converge upon truth, and will cost-effectively move forward knowledge of what works, in time to benefit today’s children. The curse of the cluster is holding back the whole field. With all due respect to the real problems clustered designs present, let’s find ways to compromise so we can learn from unbiased but modest-sized studies and go step-by-step toward better information for practice.

Beware of Do-It-Yourself Assessments

Faithful readers of this blog, and followers of the Best Evidence Encyclopedia (BEE), will know that I am always cautioning readers of program evaluations to pay no attention to findings from measures overly aligned with the experimental but not the control treatment. For example, when researchers teach a set of vocabulary words to the experimental students (but not the controls), it is not surprising to find strong impacts. Unfortunately this happens all too often, but we carefully winnow such measures out of our BEE reviews.

In a recent paper written with my colleague Alan Cheung, we looked at 645 studies accepted across all BEE reviews done so far to find out which methodological factors are associated with excessive, improbable effect sizes. In an earlier blog I wrote about the profound impact of sample size: small studies get (improbably) big effect sizes.

Another important factor, however, was the use of experimenter-made measures. Even after our careful, conservative weeding out of studies with over-aligned measures, we were surprised to find out that effect sizes on measures made by experimenters were twice as high as effect sizes on measures made by someone else (usually standardized tests).

It may be going too far to suggest that no one should ever use or accept experimenter-made measures, no matter how fair they appear to be to the experimental and control groups. However, what it does say is that we need to be very cautious in accepting experimenter-made measures. Standardized tests are far from perfect, but they are almost always fair to experimental and control groups, as control teachers can be assumed to be trying as hard as experimental teachers to improve outcomes on these measures. This may not be so on experimental-made tests.

I’m all for do-it-yourself cooking, home repairs, and other projects. But when it comes to do-it-yourself educational measurement, let the reader beware!

Who Opposes Evidence-Based Reform?

The slow and uncertain pace of progress in evidence-based reform in education seems surprising at one level. How could anyone be against anything so obviously beneficial to children? It must indeed be embarrassing to come out openly against evidence. Who argues for ignorance? Yet while few would stand up and condemn it, I would guess that many educators and researchers would be (secretly) happy if the movement just shriveled up and died.

To illustrate part of the problem, let me tell you about a couple of conversations I had at a dinner for new department heads at the University of York, in England. At the dinner, I chatted with the person on my right, who was the chair in biology, as I recall. I told him I was in York to promote evidence-based reform in education. “I’m against that” he said. “My daughter is a very gifted high school student. If someone found programs that worked on average, her school might use them. Yet the system is serving my daughter very well.”

I turned to my left and chatted with the chair of the physics department. His response was almost identical to that of the biology chair. He also had a brilliant daughter, and the system was working very well for her, thank you very much.

So from this and many other experiences, I have learned that one reason for lack of enthusiasm for evidence is that the system we have is built by and for the people who benefit from it. (A privileged glimpse into the perfectly obvious.) High quality, widely disseminated evidence cannot be controlled, so it might actually cause change, thereby disrupting the system for those for whom it fortunately works. I once heard a respected state superintendent, speaking entirely without irony to an audience of researchers, say the following:

“If research confirms what I believe, it is good research. If it does not, it is bad research.”

The problem with rigorous research is that it can and often does contradict what its funders and advocates originally hope for, and this makes it dangerous. Ignoring or twisting research makes life so much easier for stakeholders of all kinds.

Another group of evidence skeptics are fellow researchers concerned that the kind of quantitative, experimental research emphasized in evidence-based reform is not what they do. So if it prevails, their funding or esteem might be diminished.

Many teachers are uneasy about evidence, because they see it as one more way they may be oppressed by standardized tests, or that they may be forced to implement proven programs. I’m sympathetic to teachers’ concerns in these arenas, but policies to allay these concerns are possible, for example, by allowing teachers to vote on adopting proven programs (as we do in our Success for All whole-school approach). Also, most teachers, in my experience, are delighted to have effective tools to use to make them more effective in their jobs.

If support for evidence-based research comes only from those who benefit from it personally or institutionally, we are doomed. The movement will only prevail if the issue is posed this way:

“How can we use evidence to make sure that students get the best possible outcomes from their education?”

As long as we think only about what is best for kids, evidence-based reform will succeed. There are many legitimate debates to be had about methods and mechanisms, but if we could all agree that students would be better off receiving programs that have been rigorously tested and found to be effective, we’d be 90% of the way to our goal. Anyone who is in education because they want to see kids succeed, which is nearly everyone, should be able to agree. Start with the kids and everything else falls into place. Isn’t that always the case?