Getting the Best Mileage from Proven Programs

Race carWouldn’t you love to have a car that gets 200 miles to the gallon? Or one that can go hundreds of miles on a battery charge? Or one that can accelerate from zero to sixty twice as fast as any on the road?

Such cars exist, but you can’t have them. They are experimental vehicles or race cars that can only be used on a track or in a lab. They may be made of exotic materials, or may not carry passengers or groceries, or may be dangerous on real roads.

In working on our Evidence for ESSA website (www.evidenceforessa.org), we see a lot of studies that are like these experimental cars. For example, there are studies of programs in which the researcher or her graduate students actually did the teaching, or in which students used innovative technology with one adult helper for every machine or every few machines. Such studies are fine for theory building or as pilots, but we do not accept them for Evidence for ESSA, because they could never be replicated in real schools.

However, there is a much more common situation to which we pay very close attention. These are studies in which, for example, teachers receive a great deal of training and coaching, but an amount that seems replicable, in principle. For example, we would reject a study in which the experimenters taught the program, but not one in which they taught ordinary teachers how to use the program.

In such studies, the problem comes in dissemination. If studies validating a program provided a lot of professional development, we would accept it only if the disseminator provides a similar level of professional development, and their estimates of cost and personnel take this level of professional development into account. We put on our website clear expectations that these services be provided at a level similar to what was provided in the research, if the positive outcomes seen in the research are to be obtained.

The problem is that disseminators often offer schools a form of the program that was never evaluated, to keep costs low. They know that schools don’t like to spend a lot on professional development, and they are concerned that if they require the needed levels of PD or other services or materials, schools won’t buy their program. At the extreme end of this, there are programs that were successfully evaluated using extensive professional development, and then put their teacher’s manual on the web for schools to use for free.

A recent study of a program called Mathalicious illustrated the situation. Mathalicious is an on-line math course for middle school. An evaluation found that teachers randomly assigned to just get a license, with minimal training, did not obtain significant positive impacts, compared to a control group. Those who received extensive on-line training, however, did see a significant improvement in math scores, compared to controls.

When we write our program descriptions, we compare program implementation details in the research to what is said or required on the program’s website. If these do not match, within reason, we try to make it clear what were the key elements necessary for success.

Going back to the car analogy, our procedures eliminate those amazing cars that can only operate on special tracks, but we accept cars that can run on streets, carry children and groceries, and generally do what cars are expected to do. But if outstanding cars require frequent recharging, or premium gasoline, or have other important requirements, we’ll say so, in consultation with the disseminator.

In our view, evidence in education is not for academics, it’s for kids. If there is no evidence that a program as disseminated benefits kids, we don’t want to mislead educators who are trying to use evidence to benefit their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Higher Ponytails (And Researcher-Made Measures)

blog220_basketball_333x500

Some time ago, I coached my daughter’s fifth grade basketball team. I knew next to nothing about basketball (my sport was…well, chess), but fortunately my research assistant, Holly Roback, eagerly volunteered. She’d played basketball in college, so our girls got outstanding coaching. However, they got whammed. My assistant coach explained it after another disastrous game, “The other team’s ponytails were just higher than ours.” Basically, our girls were terrific at ball handling and free shots, but they came up short in the height department.

Now imagine that in addition to being our team’s coach I was also the league’s commissioner. Imagine that I changed the rules. From now on, lay-ups and jump shots were abolished, and the ball had to be passed three times from player to player before a team could score.

My new rules could be fairly and consistently enforced, but their entire effect would be to diminish the importance of height and enhance the importance of ball handling and set shots.

Of course, I could never get away with this. Every fifth grader, not to mention their parents and coaches, would immediately understand that my rule changes unfairly favored my own team, and disadvantaged theirs (at least the ones with the higher ponytails).

This blog is not about basketball, of course. It is about researcher-made measures or developer-made measures. (I’m using “researcher-made” to refer to both). I’ve been writing a lot about such measures in various blogs on the What Works Clearinghouse (https://wordpress.com/post/robertslavinsblog.wordpress.com/795 and https://wordpress.com/post/robertslavinsblog.wordpress.com/792).

The reason I’m writing again about this topic is that I’ve gotten some criticism for my criticism of researcher-made measures, and I wanted to respond to these concerns.

First, here is my case, simply put. Measures made by researchers or developers are likely to favor whatever content was taught in the experimental group. I’m not in any way suggesting that researchers or developers are deliberately making measures to favor the experimental group. However, it usually works out that way. If the program teaches unusual content, no matter how laudable that content may be, and the control group never saw that content, then the potential for bias is obvious. If the experimental group was taught on computers and control group was not, and the test was given on a computer, the bias is obvious. If the experimental treatment emphasized certain vocabulary, and the control group did not, then a test of those particular words has obvious bias. If a math program spends a lot of time teaching students to do mental rotations of shapes, and the control treatment never did such exercises, a test that includes mental rotations is obviously biased. In our BEE full-scale reviews of pre-K to 12 reading, math, and science programs, available at www.bestevidence.org, we have long excluded such measures, calling them “treatment-inherent.” The WWC calls such measures “over-aligned,” and says it excludes them.

However, the problem turns out to be much deeper. In a 2016 article in the Educational Researcher, Alan Cheung and I tested outcomes from all 645 studies in the BEE achievement reviews, and found that even after excluding treatment-inherent measures, measures from studies that were made by researchers or developers had effect sizes that were far higher than those for measures not made by researchers or developers, by a ratio of two to one (effect sizes =+0.40 for researcher-made measures, +0.20 for independent measures). Graduate student Marta Pellegrini more recently analyzed data from all WWC reading and math studies. The ratio among WWC studies was 2.7 to 1 (effect sizes = +0.52 for researcher-made measures, +0.19 for independent ones). Again, the WWC was supposed to have already removed overaligned studies, all of which (I’d assume) were also researcher-made.

Some of my critics argue that because the WWC already excludes overaligned measures, they have already taken care of the problem. But if that were true, there would not be a ratio of 2.7 to 1 in effect sizes between researcher-made and independent measures, after removing measures considered by the WWC to be overaligned.

Other critics express concern that my analyses (of bias due to researcher-made measures) have only involved reading, math, and science measures, and the situation might be different for measures of social-emotional outcomes, for example, where appropriate measures may not exist.

I will admit that in areas other than achievement the issues are different, and I’ve written about them. So I’ll be happy to limit the simple version of “no researcher-made measures” to achievement measures. The problems of measuring social- emotional outcomes fairly are far more complex, and for another day.

Other critics express concern that even on achievement measures, there are situations in which appropriate measures don’t exist. That may be so, but in policy-oriented reviews such as the WWC or Evidence for ESSA, it’s hard to imagine that there would be no existing measures of reading, writing, math, science, or other achievement outcomes. An achievement objective so rarified that it has never been measured is probably not particularly relevant for policy or practice.

The WWC is not an academic journal, and it is not primarily intended for academics. If a researcher needs to develop a new measure to test a question of theoretical interest, they should do so by all means. But the findings from that measure should not be accepted or reported by the WWC, even if a journal might accept it.

Another version of this criticism is that researchers often have a strong argument that the program they are evaluating emphasizes standards that should be taught to all students, but are not. Therefore, enhanced performance on a (researcher-made) measure of the better standard is prima facie evidence of a positive program impact. This argument confuses the purpose of experimental evaluations with the purpose of standards. Standards exist to express what we want students to know and be able to do. Arguing for a given standard involves considerations of the needs of the economy, standards of other states or countries, norms of the profession, technological or social developments, and so on—but not comparisons of experimental groups scoring well on tests of a new proposed standard to control groups never exposed to content relating to that standard. It’s just not fair.

To get back to basketball, I could have argued that the rules should be changed to emphasize ball handling and reduce the importance of height. Perhaps this would be a good idea, for all I know. But what I could not do was change the rules to benefit my team. In the same way, researchers cannot make their own measures and then celebrate higher scores on them as indicating higher or better standards. As any fifth grader could tell you, advocating for better rules is fine, but changing the rules in the middle of the season is wrong.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measuresfailure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

…But It Was The Very Best Butter! How Tests Can Be Reliable, Valid, and Worthless

I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”

I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.

Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.

Wow.

How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.

Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.

So how can it be that using even the most reliable and valid tests doesn’t solve this problem?

In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”

The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).

Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.

Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).

Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.

Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Pilot Studies: On the Path to Solid Evidence

This week, the Education Technology Industry Network (ETIN), a division of the Software & Information Industry Association (SIIA), released an updated guide to research methods, authored by a team at Empirical Education Inc. The guide is primarily intended to help software companies understand what is required for studies to meet current standards of evidence.

In government and among methodologists and well-funded researchers, there is general agreement about the kind of evidence needed to establish the effectiveness of an education program intended for broad dissemination. To meet its top rating (“meets standards without reservations”) the What Works Clearinghouse (WWC) requires an experiment in which schools, classes, or students are assigned at random to experimental or control groups, and it has a second category (“meets standards with reservations”) for matched studies.

These WWC categories more or less correspond to the Every Student Succeeds Act (ESSA) evidence standards (“strong” and “moderate” evidence of effectiveness, respectively), and ESSA adds a third category, “promising,” for correlational studies.

Our own Evidence for ESSA website follows the ESSA guidelines, of course. The SIIA guidelines explain all of this.

Despite the overall consensus about the top levels of evidence, the problem is that doing studies that meet these requirements is expensive and time-consuming. Software developers, especially small ones with limited capital, often do not have the resources or the patience to do such studies. Any organization that has developed something new may not want to invest substantial resources into large-scale evaluations until they have some indication that the program is likely to show well in a larger, longer, and better-designed evaluation. There is a path to high-quality evaluations, starting with pilot studies.

The SIIA Guide usefully discusses this problem, but I want to add some further thoughts on what to do when you can’t afford a large randomized study.

1. Design useful pilot studies. Evaluators need to make a clear distinction between full-scale evaluations, intended to meet WWC or ESSA standards, and pilot studies (the SIIA Guidelines call these “formative studies”), which are just meant for internal use, both to assess the strengths or weaknesses of the program and to give an early indicator of whether or not a program is ready for full-scale evaluation. The pilot study should be a miniature version of the large study. But whatever its findings, it should not be used in publicity. Results of pilot studies are important, but by definition a pilot study is not ready for prime time.

An early pilot study may be just a qualitative study, in which developers and others might observe classes, interview teachers, and examine computer-generated data on a limited scale. The problem in pilot studies is at the next level, when developers want an early indication of effects on achievement, but are not ready for a study likely to meet WWC or ESSA standards.

2. Worry about bias, not power. Small, inexpensive studies pose two types of problems. One is the possibility of bias, discussed in the next section. The other is lack of power, mostly meaning having a large enough sample to determine that a potentially meaningful program impact is statistically significant, or unlikely to have happened by chance. To understand this, imagine that your favorite baseball team adopts a new strategy. After the first ten games, the team is doing better than it did last year, in comparison to other teams, but this could have happened by chance. After 100 games? Now the results are getting interesting. If 10 teams all adopt the strategy next year and they all see improvements on average? Now you’re headed toward proof.

During the pilot process, evaluators might compare multiple classes or multiple schools, perhaps assigned at random to experimental and control groups. There may not be enough classes or schools for statistical significance yet, but if the mini-study avoids bias, the results will at least be in the ballpark (so to speak).

3. Avoid bias. A small experiment can be fine as a pilot study, but every effort should be made to avoid bias. Otherwise, the pilot study will give a result far more positive than the full-scale study will, defeating the purpose of doing a pilot.

Examples of common sources of biases in smaller studies are as follows.

a. Use of measures made by developers or researchers. These measures typically produce greatly inflated impacts.

b. Implementation of gold-plated versions of the program. . In small pilot studies, evaluations often implement versions of the program that could never be replicated. Examples include providing additional staff time that could not be repeated at scale.

c. Inclusion of highly motivated teachers or students in the experimental group, which gets the program, but not the control group. For example, matched studies of technology often exclude teachers who did not implement “enough” of the program. The problem is that the full-scale experiment (and real life) include all kinds of teachers, so excluding teachers who could not or did not want to engage with technology overstates the likely impact at scale in ordinary schools. Even worse, excluding students who did not use the technology enough may bias the study toward more capable students.

d. Learn from pilots. Evaluators, developers, and disseminators should learn as much as possible from pilots. Observations, interviews, focus groups, and other informal means should be used to understand what is working and what is not, so when the program is evaluated at scale, it is at its best.

 

***

As evidence becomes more and more important, publishers and software developers will increasingly be called upon to prove that their products are effective. However, no program should have its first evaluation be a 50-school randomized experiment. Such studies are indeed the “gold standard,” but jumping from a two-class pilot to a 50-school experiment is a way to guarantee failure. Software developers and publishers should follow a path that leads to a top-tier evaluation, and learn along the way how to ensure that their programs and evaluations will produce positive outcomes for students at the end of the process.

 

This blog is sponsored by the Laura and John Arnold Foundation

Gambling With Our Children’s Futures

I recently took a business trip to Reno and Las Vegas. I don’t gamble, but it’s important to realize that casinos don’t gamble either. A casino license is permission to make a massive amount of money, risk free.

Think of a roulette table, for example, as a glitzy random numbers generator. People can put bets on any of 38 numbers, and if that number comes up, you get 36 times your bet. The difference between 38 and 36 is the “house percentage.” So as long as the wheel is spinning and people are betting, the casino is making money, no matter what the result is of a particular spin. This is true because over the course of days, weeks, or months, that small percentage becomes big money. The same principle works in every game in the casino.

In educational research, we use statistics much as the casinos do, though for a very different purpose. We want to know what the effect of a given program is on students’ achievement. Think of each student in an experiment as a separate spin of the roulette wheel. If you have just a few students, or a few spins, the results may seem very good or very bad, on average. But when you have hundreds or thousands of students (or spins), the averages stabilize.

In educational experiments, some students usually get an experimental program and others serve as controls. If there are few students (spins) in each group, the differences are unreliable. But as the numbers get larger, the difference between experimental and control groups gets reliable.

This explains why educational experiments should involve large numbers of students. With small numbers, differences could be due to chance.

Several years ago, I wrote an article on the relationship between sample size and effect size in educational experiments. Small studies (e.g., fewer than 100 students in each group) had much larger experimental-control differences (effect sizes) than big ones. How could this be?

What I think was going on is that in small studies, effect sizes could be very positive or very negative (favoring the control group). When positive results are found, results are published and publicized. When results go the other way? Not so much. The studies may disappear.

To understand this, go back to the casino. Imagine that you bet on 20 spins, and you make big money. You go home and tell your friends you are a genius, or you credit your lucky system or your rabbit’s foot. But if you lose your shirt on 20 spins, you probably slink home and stay quiet about the whole experience.

Now imagine that you bet on 1000 spins. It is statistically virtually certain that you will lose a certain amount of money (about 2/38 of what you bet, to be exact, because of 0 and 00). This outcome is not interesting, but it tells you exactly how the system works.

In big studies in education, we can also produce reliable measures of “how the system works” by comparing hundreds or thousands of experimental and control students.

Critics of quantitative research in education seem to think we are doing some sort of statistical mumbo-jumbo with our computers and baffling reports. But what we are doing is trying to get to the truth, with enough “spins” of the roulette wheel to even out chance factors.

Ironically, what large-scale research in education is intended to do is to diminish the role of chance in educational decisions. We want to help educators avoid gambling with their children’s futures.

This blog is sponsored by the Laura and John Arnold Foundation

On Meta-Analysis: Eight Great Tomatoes

I remember a long-ago advertisement for Contadina tomato paste. It went something like this:

Eight great tomatoes in an itsy bitsy can!

This ad creates an appealing image, or at least a provocative one, that I suppose sold a lot of tomato paste.

In educational research, we do something a lot like “eight great tomatoes.” It’s called meta-analysis, or systematic review.  I am particularly interested in meta-analyses of experimental studies of educational programs.  For example, there are meta-analyses of reading and math and science programs.  I’ve written them myself, as have many others.  In each, some number of relevant studies are identified.  From each study, one or more “effect sizes” are computed to represent the impact of the program on important outcomes, such as scores on achievement tests. These are then averaged to get an overall impact for each program or type of program.  Think of the effect size as boiling down tomatoes to make concentrated paste, to fit into an itsy bitsy can.

But here is the problem.  The Contadina ad specifies eight great tomatoes. If even one tomato is instead a really lousy one, the contents of the itsy bitsy can will be lousy.  Ultimately, lousy tomato pastes would bankrupt the company.

The same is true of meta-analyses.  Some meta-analyses include a broad range of studies – good, mediocre, and bad.  They may try to statistically control for various factors, but this does not do the job.  Bad studies lead to bad outcomes.  Years ago, I critiqued a study of “class size.”  The studies of class size in ordinary classrooms found small effects.  But there was one study that involved teaching tennis.  In small classes, the kids got a lot more court time than did kids in large classes.  This study, and only this study, found substantial effects of class size, significantly affecting the average.  There were not eight great tomatoes, there was at least one lousy tomato, which made the itsy bitsy can worthless.

The point I am making here is that when doing meta-analysis, the studies must be pre-screened for quality, and then carefully scrubbed.  Specifically, there are many factors that greatly (and falsely) inflate effect size.  Examples include use of assessments made by the researchers and ones that assess what was taught in the experimental group but not the control group, use of small samples, and provision of excessive assistance to the teachers.

Some meta-analyses just shovel all the studies onto a computer and report an average effect size.  More responsible ones shovel the studies into a computer and then test for and control for various factors that might affect outcomes. This is better, but you just can’t control for lousy studies, because they are often lousy in many ways.

Instead, high-quality meta-analyses set specific criteria for inclusion intended to minimize bias.  Studies often use both valid measures and crummy measures (such as those biased toward the experimental group).  Good meta-analyses use the good measures but not the (defined in advance) crummy ones.  Studies that only used crummy measures are excluded.  And so on.

With systematic standards, systematically applied, meta-analyses can be of great value.  Call it the Contadina method.  In order to get great tomato paste, start with great tomatoes. The rest takes care of itself.