In Meta-Analyses, Weak Inclusion Standards Lead to Misleading Conclusions. Here’s Proof.

By Robert Slavin and Amanda Neitzel, Johns Hopkins University

In two recent blogs (here and here), I’ve written about Baltimore’s culinary glories: crabs and oysters. My point was just that in both cases, there is a lot you have to discard to get to what matters. But I was of course just setting the stage for a problem that is deadly serious, at least to anyone concerned with evidence-based reform in education.

Meta-analysis has contributed a great deal to educational research and reform, helping readers find out about the broad state of the evidence on practical approaches to instruction and school and classroom organization. Recent methodological developments in meta-analysis and meta-regression, and promotion of the use of these methods by agencies such as IES and NSF, have expanded awareness and use of modern methods.

Yet looking at large numbers of meta-analyses published over the past five years, even up to the present, the quality is highly uneven. That’s putting it nicely.  The problem is that most meta-analyses in education are far too unselective with regards to the methodological quality of the studies they include. Actually, I’ve been ranting about this for many years, and along with colleagues, have published several articles on it (e.g., Cheung & Slavin, 2016; Slavin & Madden, 2011; Wolf et al., 2020). But clearly, my colleagues and I are not making enough of a difference.

My colleague, Amanda Neitzel, and I thought of a simple way we could communicate the enormous difference it makes if a meta-analysis accepts studies that contain design elements known to inflate effect sizes. In this blog, we once again use the Kulik & Fletcher (2016) meta-analysis of research on computerized intelligent tutoring, which I critiqued in my blog a few weeks ago (here). As you may recall, the only methodological inclusion standards used by Kulik & Fletcher required that studies use RCTs or QEDs, and that they have a duration of at least 30 minutes (!!!). However, they included enough information to allow us to determine the effect sizes that would have resulted if they had a) weighted for sample size in computing means, which they did not, and b) excluded studies with various features known to inflate effect size estimates. Here is a table summarizing our findings when we additionally excluded studies containing procedures known to inflate mean effect sizes:

If you follow meta-analyses, this table should be shocking. It starts out with 50 studies and a very large effect size, ES=+0.65. Just weighting the mean for study sample sizes reduces this to +0.56. Eliminating small studies (n<60) cut the number of studies almost in half (n=27) and cut the effect size to +0.39. But the largest reductions are due to excluding “local” measures, which on inspection are always measures made by developers or researchers themselves. (The alternative was “standardized measures.”) By itself, excluding local measures (and weighting) cut the number of included studies to 12, and the effect size to +0.10, which was not significantly different from zero (p=.17). Excluding small, brief, and “local” measures only slightly changes the results, because both small and brief studies almost always use “local” (i.e., researcher-made) measures. Excluding all three, and weighting for sample size, leaves this review with only nine studies and an effect size of +0.09, which is not significantly different from zero (p=.21).

The estimates at the bottom of the chart represent what we call “selective standards.” These are the standards we apply in every meta-analysis we write (see www.bestevidence.org), and in Evidence for ESSA (www.evidenceforessa.org).

It is easy to see why this matters. Selective standards almost always produce much lower estimates of effect sizes than do reviews with much less selective standards, which therefore include studies containing design features that have a strong positive bias on effect sizes. Consider how this affects mean effect sizes in meta-analyses. For example, imagine a study that uses two measures of achievement. One is a measure made by the researcher or developer specifically to be “sensitive” to the program’s outcomes. The other is a test independent of the program, such as GRADE/GMADE or Woodcock, standardized tests but not necessarily state tests. Imagine that the researcher-made measure obtains an effect size of +0.30, while the independent measure has an effect size of +0.10. A less-selective meta-analysis would report a mean effect size of +0.20, a respectable-sounding impact. But a selective meta-analysis would report an effect size of +0.10, a very small impact. Which of these estimates represents an outcome with meaning for practice? Clearly, school leaders should not value the +0.30 or +0.20 estimates, which require use of a test designed to be “sensitive” to the treatment. They should care about the gains on the independent test, which represents what educators are trying to achieve and what they are held accountable for. The information from the researcher-made test may be valuable to the researchers, but it has little or no value to educators or students.

The point of this exercise is to illustrate that in meta-analyses, choices of methodological exclusions may entirely determine the outcomes. Had they chosen other exclusions, the Kulik & Fletcher meta-analysis could have reported any effect size from +0.09 (n.s.) to +0.65 (p<.000).

The importance of these exclusions is not merely academic. Think how you’d explain the chart above to your sister the principal:

            Principal Sis: I’m thinking of using one of those intelligent tutoring programs to improve achievement in our math classes. What do you suggest?

            You:  Well, it all depends. I saw a review of this in the top journal in education research. It says that if you include very small studies, very brief studies, and studies in which the researchers made the measures, you could have an effect size of +0.65! That’s like seven additional months of learning!

            Principal Sis:  I like those numbers! But why would I care about small or brief studies, or measures made by researchers? I have 500 kids, we teach all year, and our kids have to pass tests that we don’t get to make up!

            You (sheepishly):  I guess you’re right, Sis. Well, if you just look at the studies with large numbers of students, which continued for more than 12 weeks, and which used independent measures, the effect size was only +0.09, and that wasn’t even statistically significant.

            Principal Sis:  Oh. In that case, what kinds of programs should we use?

From a practical standpoint, study features such as small samples or researcher-made measures add a lot to effect sizes while adding nothing to the value to students or schools of the programs or practices they want to know about. They just add a lot of bias. It’s like trying to convince someone that corn on the cob is a lot more valuable than corn off the cob, because you get so much more quantity (by weight or volume) for the same money with corn on the cob.     Most published meta-analyses only require that studies have control groups, and some do not even require that much. Few exclude researcher- or developer-made measures, or very small or brief studies. The result is that effect sizes in published meta-analyses are very often implausibly large.

Meta-analyses that include studies lacking control groups or studies with small samples, brief durations, pretest differences, or researcher-made measures report overall effect sizes that cannot be fairly compared to other meta-analyses that excluded such studies. If outcomes do not depend on the power of the particular program but rather on the number of potentially biasing features they did or did not exclude, then outcomes of meta-analyses are meaningless.

It is important to note that these two examples are not at all atypical. As we have begun to look systematically at published meta-analyses, most of them fail to exclude or control for key methodological factors known to contribute a great deal of bias. Something very serious has to be done to change this. Also, I’d remind readers that there are lots of programs that do meet strict standards and show positive effects based on reality, not on including biasing factors. At www.evidenceforessa.org, you can see more than 120 reading and math programs that meet selective standards for positive impacts. The problem is that in meta-analyses that include studies containing biasing factors, these truly effective programs are swamped by a blizzard of bias.

In my recent blog (here) I proposed a common set of methodological inclusion criteria that I would think most methodologists would agree to.  If these (or a similar consensus list) were consistently used, we could make more valid comparisons both within and between meta-analyses. But as long as inclusion criteria remain highly variable from meta-analysis to meta-analysis, then all we can do is pick out the few that do use selective standards, and ignore the rest. What a terrible waste.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Slavin, R. E., Madden, N. A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4, 370–380.

Wolf, R., Morrison, J.M., Inns, A., Slavin, R. E., & Risman, K. (2020). Average effect sizes in developer-commissioned and independent evaluations. Journal of Research on Educational Effectiveness. DOI: 10.1080/19345747.2020.1726537

Photo credit: Deeper Learning 4 All, (CC BY-NC 4.0)

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Queasy about Quasi-Experiments? How Rigorous Quasi-Experiments Can Minimize Bias

I once had a statistics professor who loved to start discussions of experimental design with the following:

“First, pick your favorite random number.”

Obviously, if you pick a favorite random number, it isn’t random. I was recalling this bit of absurdity recently when discussing with colleagues the relative value of randomized experiments (RCTs) and matched studies, or quasi-experimental designs (QED). In randomized experiments, students, teachers, classes, or schools are assigned at random to experimental or control conditions. In quasi-experiments, a group of students, teachers, classes, or schools is identified as the experimental group, and then other schools are located (usually in the same districts) and then matched on key variables, such as prior test scores, percent free lunch, ethnicity, and perhaps other factors. The ESSA evidence standards, the What Works Clearinghouse, Evidence for ESSA, and most methodologists favor randomized experiments over QEDs, but there are situations in which RCTs are not feasible. In a recent “Straight Talk on Evidence,” Jon Baron discussed how QEDs can approach the usefulness of RCTs. In this blog, I build on Baron’s article and go further into strategies for getting the best, most unbiased results possible from QEDs.

Randomized and quasi-experimental studies are very similar in most ways. Both almost always compare experimental and control schools that were very similar on key performance and demographic factors. Both use the same statistics, and require the same number of students or clusters for adequate power. Both apply the same logic, that the control group mean represents a good approximation of what the experimental group would have achieved, on average, if the experiment had never taken place.

However, there is one big difference between randomized and quasi-experiments. In a well-designed randomized experiment, the experimental and control groups can be assumed to be equal not only on observed variables, such as pretests and socio-economic status, but also on unobserved variables. The unobserved variables we worry most about have to do with selection bias. How did it happen (in a quasi-experiment) that the experimental group chose to use the experimental treatment, or was assigned to the experimental treatment? If a set of schools decided to use the experimental treatment on their own, then these schools might be composed of teachers or principals who are more oriented toward innovation, for example. Or if the experimental treatment is difficult, the teachers who would choose it might be more hard-working. If it is expensive, then perhaps the experimental schools have more money. Any of these factors could bias the study toward finding positive effects, because schools that have teachers who are motivated or hard-working, in schools with more resources, might perform better than control schools with or without the experimental treatment.

blog_1-16-20_normalcurve_500x333

Because of this problem of selection bias, studies that use quasi-experimental designs generally have larger effect sizes than do randomized experiments. Cheung & Slavin (2016) studied the effects of methodological features of studies on effect sizes. They obtained effect sizes from 645 studies of elementary and secondary reading, mathematics, and science, as well as early childhood programs. These studies had already passed a screening in which they would have been excluded if they had serious design flaws. The results were as follows:

  No. of studies Mean effect size
Quasi-experiments 449 +0.23
Randomized experiments 196 +0.16

Clearly, mean effect sizes were larger in the quasi-experiments, suggesting the possibility that there was bias. Compared to factors such as sample size and use of developer- or researcher-made measures, the amount of effect size inflation in quasi-experiments was modest, and some meta-analyses comparing randomized and quasi-experimental studies have found no difference at all.

Relative Advantages of Randomized and Quasi-Experiments

Because of the problems of selection bias, randomized experiments are preferred to quasi-experiments, all other factors being equal. However, there are times when quasi-experiments may be necessary for practical reasons. For example, it can be easier to recruit and serve schools in a quasi-experiment, and it can be less expensive. A randomized experiment requires that schools be recruited with the promise that they will receive an exciting program. Yet half of them will instead be in a control group, and to keep them willing to sign up, they may be given a lot of money, or an opportunity to receive the program later on. In a quasi-experiment, the experimental schools all get the treatment they want, and control schools just have to agree to be tested.  A quasi-experiment allows schools in a given district to work together, instead of insisting that experimental and control schools both exist in each district. This better simulates the reality schools are likely to face when a program goes into dissemination. If the problems of selection bias can be minimized, quasi-experiments have many attractions.

An ideal design for quasi-experiments would obtain the same unbiased outcomes as a randomized evaluation of the same treatment might do. The purpose of this blog is to discuss ways to minimize bias in quasi-experiments.

In practice, there are several distinct forms of quasi-experiments. Some have considerable likelihood of bias. However, others have much less potential for bias. In general, quasi-experiments to avoid are forms of post-hoc, or after-the-fact designs, in which determination of experimental and control groups takes place after the experiment. Quasi-experiments with much less likelihood of bias are pre-specified designs, in which experimental and control schools, classrooms, or students are identified and registered in advance. In the following sections, I will discuss these very different types of quasi-experiments.

Post-Hoc Designs

Post-hoc designs generally identify schools, teachers, classes, or students who participated in a given treatment, and then find matches for each in routinely collected data, such as district or school standardized test scores, attendance, or retention rates. The routinely collected data (such as state test scores or attendance) are collected as pre-and posttests from school records, so it may be that neither experimental nor control schools’ staffs are even aware that the experiment happened.

Post-hoc designs sound valid; the experimental and control groups were well matched at pretest, so if the experimental group gained more than the control group, that indicates an effective treatment, right?

Not so fast. There is much potential for bias in this design. First, the experimental schools are almost invariably those that actually implemented the treatment. Any schools that dropped out or (even worse) any that were deemed not to have implemented the treatment enough have disappeared from the study. This means that the surviving schools were different in some important way from those that dropped out. For example, imagine that in a study of computer-assisted instruction, schools were dropped if fewer than 50% of students used the software as much as the developers thought they should. The schools that dropped out must have had characteristics that made them unable to implement the program sufficiently. For example, they might have been deficient in teachers’ motivation, organization, skill with technology, or leadership, all factors that might also impact achievement with or without the computers. The experimental group is only keeping the “best” schools, but the control schools will represent the full range, from best to worst. That’s bias. Similarly, if individual students are included in the experimental group only if they actually used the experimental treatment a certain amount, that introduces bias, because the students who did not use the treatment may be less motivated, have lower attendance, or have other deficits.

As another example, developers or researchers may select experimental schools that they know did exceptionally well with the treatment. Then they may find control schools that match on pretest. The problem is that there could be unmeasured characteristics of the experimental schools that could cause these schools to get good results even without the treatment. This introduces serious bias. This is a particular problem if researchers pick experimental or control schools from a large database. The schools will be matched at pretest, but since the researchers may have many potential control schools to choose among, they may use selection rules that, while they maintain initial equality, introduce bias. The readers of the study might never be able to find out if this happened.

Pre-Specified Designs

The best way to minimize bias in quasi-experiments is to identify experimental and control schools in advance (as contrasted with post hoc), before the treatment is applied. After experimental and control schools, classes, or students are identified and matched on pretest scores and other factors, the names of schools, teachers, and possibly students on each list should be registered on the Registry of Efficacy and Effectiveness Studies. This way, all schools (and all students) involved in the study are counted in intent-to-treat (ITT) analyses, just as is expected in randomized studies. The total effect of the treatment is based on this list, even if some schools or students dropped out along the way. An ITT analysis reflects the reality of program effects, because it is rare that all schools or students actually use educational treatments. Such studies also usually report effects of treatment on the treated (TOT), focusing on schools and students who did implement for treatment, but such analyses are of only minor interest, as they are known to reflect bias in favor of the treatment group.

Because most government funders in effect require use of random assignment, the number of quasi-experiments is rapidly diminishing. All things being equal, randomized studies should be preferred. However, quasi-experiments may better fit the practical realities of a given treatment or population, and as such, I hope there can be a place for rigorous quasi-experiments. We need not be so queasy about quasi-experiments if they are designed to minimize bias.

References

Baron, J. (2019, December 12). Why most non-RCT program evaluation findings are unreliable (and a way to improve them). Washington, DC: Arnold Ventures.

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Higher Ponytails (And Researcher-Made Measures)

blog220_basketball_333x500

Some time ago, I coached my daughter’s fifth grade basketball team. I knew next to nothing about basketball (my sport was…well, chess), but fortunately my research assistant, Holly Roback, eagerly volunteered. She’d played basketball in college, so our girls got outstanding coaching. However, they got whammed. My assistant coach explained it after another disastrous game, “The other team’s ponytails were just higher than ours.” Basically, our girls were terrific at ball handling and free shots, but they came up short in the height department.

Now imagine that in addition to being our team’s coach I was also the league’s commissioner. Imagine that I changed the rules. From now on, lay-ups and jump shots were abolished, and the ball had to be passed three times from player to player before a team could score.

My new rules could be fairly and consistently enforced, but their entire effect would be to diminish the importance of height and enhance the importance of ball handling and set shots.

Of course, I could never get away with this. Every fifth grader, not to mention their parents and coaches, would immediately understand that my rule changes unfairly favored my own team, and disadvantaged theirs (at least the ones with the higher ponytails).

This blog is not about basketball, of course. It is about researcher-made measures or developer-made measures. (I’m using “researcher-made” to refer to both). I’ve been writing a lot about such measures in various blogs on the What Works Clearinghouse (https://wordpress.com/post/robertslavinsblog.wordpress.com/795 and https://wordpress.com/post/robertslavinsblog.wordpress.com/792).

The reason I’m writing again about this topic is that I’ve gotten some criticism for my criticism of researcher-made measures, and I wanted to respond to these concerns.

First, here is my case, simply put. Measures made by researchers or developers are likely to favor whatever content was taught in the experimental group. I’m not in any way suggesting that researchers or developers are deliberately making measures to favor the experimental group. However, it usually works out that way. If the program teaches unusual content, no matter how laudable that content may be, and the control group never saw that content, then the potential for bias is obvious. If the experimental group was taught on computers and control group was not, and the test was given on a computer, the bias is obvious. If the experimental treatment emphasized certain vocabulary, and the control group did not, then a test of those particular words has obvious bias. If a math program spends a lot of time teaching students to do mental rotations of shapes, and the control treatment never did such exercises, a test that includes mental rotations is obviously biased. In our BEE full-scale reviews of pre-K to 12 reading, math, and science programs, available at www.bestevidence.org, we have long excluded such measures, calling them “treatment-inherent.” The WWC calls such measures “over-aligned,” and says it excludes them.

However, the problem turns out to be much deeper. In a 2016 article in the Educational Researcher, Alan Cheung and I tested outcomes from all 645 studies in the BEE achievement reviews, and found that even after excluding treatment-inherent measures, measures from studies that were made by researchers or developers had effect sizes that were far higher than those for measures not made by researchers or developers, by a ratio of two to one (effect sizes =+0.40 for researcher-made measures, +0.20 for independent measures). Graduate student Marta Pellegrini more recently analyzed data from all WWC reading and math studies. The ratio among WWC studies was 2.7 to 1 (effect sizes = +0.52 for researcher-made measures, +0.19 for independent ones). Again, the WWC was supposed to have already removed overaligned studies, all of which (I’d assume) were also researcher-made.

Some of my critics argue that because the WWC already excludes overaligned measures, they have already taken care of the problem. But if that were true, there would not be a ratio of 2.7 to 1 in effect sizes between researcher-made and independent measures, after removing measures considered by the WWC to be overaligned.

Other critics express concern that my analyses (of bias due to researcher-made measures) have only involved reading, math, and science measures, and the situation might be different for measures of social-emotional outcomes, for example, where appropriate measures may not exist.

I will admit that in areas other than achievement the issues are different, and I’ve written about them. So I’ll be happy to limit the simple version of “no researcher-made measures” to achievement measures. The problems of measuring social- emotional outcomes fairly are far more complex, and for another day.

Other critics express concern that even on achievement measures, there are situations in which appropriate measures don’t exist. That may be so, but in policy-oriented reviews such as the WWC or Evidence for ESSA, it’s hard to imagine that there would be no existing measures of reading, writing, math, science, or other achievement outcomes. An achievement objective so rarified that it has never been measured is probably not particularly relevant for policy or practice.

The WWC is not an academic journal, and it is not primarily intended for academics. If a researcher needs to develop a new measure to test a question of theoretical interest, they should do so by all means. But the findings from that measure should not be accepted or reported by the WWC, even if a journal might accept it.

Another version of this criticism is that researchers often have a strong argument that the program they are evaluating emphasizes standards that should be taught to all students, but are not. Therefore, enhanced performance on a (researcher-made) measure of the better standard is prima facie evidence of a positive program impact. This argument confuses the purpose of experimental evaluations with the purpose of standards. Standards exist to express what we want students to know and be able to do. Arguing for a given standard involves considerations of the needs of the economy, standards of other states or countries, norms of the profession, technological or social developments, and so on—but not comparisons of experimental groups scoring well on tests of a new proposed standard to control groups never exposed to content relating to that standard. It’s just not fair.

To get back to basketball, I could have argued that the rules should be changed to emphasize ball handling and reduce the importance of height. Perhaps this would be a good idea, for all I know. But what I could not do was change the rules to benefit my team. In the same way, researchers cannot make their own measures and then celebrate higher scores on them as indicating higher or better standards. As any fifth grader could tell you, advocating for better rules is fine, but changing the rules in the middle of the season is wrong.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measures, failure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

…But It Was The Very Best Butter! How Tests Can Be Reliable, Valid, and Worthless

I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”

I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.

Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.

Wow.

How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.

Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.

So how can it be that using even the most reliable and valid tests doesn’t solve this problem?

In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”

The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).

Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.

Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).

Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.

Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Curse of the Cluster

If you follow my blogs, you’ve probably noticed that I stay away from three topics on which reasoned discourse is impossible: religion, politics, and statistics. However, just this once I’d like to break my own rule and talk about statistics, or rather research design. And I promise not to be too nerdy.

While there is little argument about basic principles of statistics and research design, things do get a bit dicey in the real world. Some of my colleagues resolve any situation that is less than ideal by ignoring studies with the slightest flaw. I think that can be a huge waste of (usually) government money, and can deprive researchers and educators alike of valuable information.

My personal position is that all flaws are not created equal. In particular, some flaws introduce bias and some do not. For example, use of researcher-made measures, small sample sizes, and matched rather than randomized designs introduce bias, so they should be avoided or minimized in importance.

On the other hand, accounting for clustering in designs in which students are grouped in classes or schools is now considered essential. That is, if you randomly assign 20 schools to experimental (n=10) or control (n=10) conditions, you might have 5000 students per treatment. Randomly assigning 5000 students one at a time would be a huge study. In fact, 300 students might be enough. However, in a clustered study, 5000 per treatment may be too small. Current statistical principles demand that you use a method called Hierarchical Linear Modeling (HLM) to analyze the data, and unless the effect size is very large, 20 schools will not be sufficient for statistical significance.

Yet here’s the rub: failing to account for clustering does not introduce bias. That is, if you (mistakenly) analyzed at the student level in a study in which treatments were implemented at the class or school level, the effect size would be about the same. All that would change would be statistical significance. That is, you would overstate the number of experimental-control differences claimed to be significant (i.e., beyond what you’d expect by chance).

All right, let’s accept that clustered data should be analyzed using HLM, which accounts for clustering. But while we are straining at the clustering gnat, what camels are we swallowing?

My personal bugbear is researcher-made measures. Often, the very same researchers who take an unyielding position on clustering happily accept research designs in which the researcher made the test, even if the test is clearly aligned with the content the experimental group (but not the control group) was taught. In some studies, the teachers who provided tutoring, for example, also gave the tests. Strict-on-clustering researchers also often accept studies that were very brief, sometimes a week or less, or often just an hour. They may accept studies in which conditions in the experimental groups were substantially enhanced beyond what could ever be done in real life, as in technology studies in which a graduate student is placed in every class or even every small group every day to “help with the technology.”

All of these research designs are far more likely to produce misleading findings than are studies that only suffer from clustering problems, and worse, these effects introduce bias, while failing to attend to clustering does not.

Why is this of importance to non-statisticians? It matters because in education, students are usually taught in large groups, so except for studies of one-to-one or small-group tutoring, clustering almost always has to be accounted for, and as a consequence, randomized experiments typically must involve 40-50 schools (20-25 per treatment) to detect an effect size as small as 0.20. Such experiments are very expensive, and they are difficult to do if you are not an expert already. The clustering requirement, therefore, makes it difficult for researchers early in their careers to get funding and to show success if they do, because managing implementation and collecting data in 50 schools is really, really hard.

I do not have a good solution for this problem, and I upset my colleagues when I bring it up. But we have to face it. Making accounting for clustering an absolute makes educational research too expensive, and put another way it means that we can do too few studies for the dollars we do invest. And this requirement bars entry to the field to those unable to get multi-million dollar grants or to manage large field experiments.

One solution to the cluster problem might be to have research funders fund step-by-step studies. For example, imagine that funding were offered for studies of 10 schools to be analyzed at the cluster level (correct but hopelessly underpowered) and at the student level (Bad! But affordable.). If the outcomes are promising, funders could fund another 10-school study, and researchers could combine the samples, repeating this process until there are enough schools to collectively justify a proper clustered analysis. This would also enable neophyte researchers to learn from experience, it would allow everyone to learn over time what the potential impacts are, and it could save billions of dollars now being spent on monster randomized studies of programs never before having shown promising effects (which then turn out to be ineffective).

A gradual approach to clustering might enable the field of education to focus on the real enemy, which is bias. If we systematically stamp out design elements that add bias, then over time the field will converge upon truth, and will cost-effectively move forward knowledge of what works, in time to benefit today’s children. The curse of the cluster is holding back the whole field. With all due respect to the real problems clustered designs present, let’s find ways to compromise so we can learn from unbiased but modest-sized studies and go step-by-step toward better information for practice.

Beware of Do-It-Yourself Assessments

Faithful readers of this blog, and followers of the Best Evidence Encyclopedia (BEE), will know that I am always cautioning readers of program evaluations to pay no attention to findings from measures overly aligned with the experimental but not the control treatment. For example, when researchers teach a set of vocabulary words to the experimental students (but not the controls), it is not surprising to find strong impacts. Unfortunately this happens all too often, but we carefully winnow such measures out of our BEE reviews.

In a recent paper written with my colleague Alan Cheung, we looked at 645 studies accepted across all BEE reviews done so far to find out which methodological factors are associated with excessive, improbable effect sizes. In an earlier blog I wrote about the profound impact of sample size: small studies get (improbably) big effect sizes.

Another important factor, however, was the use of experimenter-made measures. Even after our careful, conservative weeding out of studies with over-aligned measures, we were surprised to find out that effect sizes on measures made by experimenters were twice as high as effect sizes on measures made by someone else (usually standardized tests).

It may be going too far to suggest that no one should ever use or accept experimenter-made measures, no matter how fair they appear to be to the experimental and control groups. However, what it does say is that we need to be very cautious in accepting experimenter-made measures. Standardized tests are far from perfect, but they are almost always fair to experimental and control groups, as control teachers can be assumed to be trying as hard as experimental teachers to improve outcomes on these measures. This may not be so on experimental-made tests.

I’m all for do-it-yourself cooking, home repairs, and other projects. But when it comes to do-it-yourself educational measurement, let the reader beware!

Who Opposes Evidence-Based Reform?

The slow and uncertain pace of progress in evidence-based reform in education seems surprising at one level. How could anyone be against anything so obviously beneficial to children? It must indeed be embarrassing to come out openly against evidence. Who argues for ignorance? Yet while few would stand up and condemn it, I would guess that many educators and researchers would be (secretly) happy if the movement just shriveled up and died.

To illustrate part of the problem, let me tell you about a couple of conversations I had at a dinner for new department heads at the University of York, in England. At the dinner, I chatted with the person on my right, who was the chair in biology, as I recall. I told him I was in York to promote evidence-based reform in education. “I’m against that” he said. “My daughter is a very gifted high school student. If someone found programs that worked on average, her school might use them. Yet the system is serving my daughter very well.”

I turned to my left and chatted with the chair of the physics department. His response was almost identical to that of the biology chair. He also had a brilliant daughter, and the system was working very well for her, thank you very much.

So from this and many other experiences, I have learned that one reason for lack of enthusiasm for evidence is that the system we have is built by and for the people who benefit from it. (A privileged glimpse into the perfectly obvious.) High quality, widely disseminated evidence cannot be controlled, so it might actually cause change, thereby disrupting the system for those for whom it fortunately works. I once heard a respected state superintendent, speaking entirely without irony to an audience of researchers, say the following:

“If research confirms what I believe, it is good research. If it does not, it is bad research.”

The problem with rigorous research is that it can and often does contradict what its funders and advocates originally hope for, and this makes it dangerous. Ignoring or twisting research makes life so much easier for stakeholders of all kinds.

Another group of evidence skeptics are fellow researchers concerned that the kind of quantitative, experimental research emphasized in evidence-based reform is not what they do. So if it prevails, their funding or esteem might be diminished.

Many teachers are uneasy about evidence, because they see it as one more way they may be oppressed by standardized tests, or that they may be forced to implement proven programs. I’m sympathetic to teachers’ concerns in these arenas, but policies to allay these concerns are possible, for example, by allowing teachers to vote on adopting proven programs (as we do in our Success for All whole-school approach). Also, most teachers, in my experience, are delighted to have effective tools to use to make them more effective in their jobs.

If support for evidence-based research comes only from those who benefit from it personally or institutionally, we are doomed. The movement will only prevail if the issue is posed this way:

“How can we use evidence to make sure that students get the best possible outcomes from their education?”

As long as we think only about what is best for kids, evidence-based reform will succeed. There are many legitimate debates to be had about methods and mechanisms, but if we could all agree that students would be better off receiving programs that have been rigorously tested and found to be effective, we’d be 90% of the way to our goal. Anyone who is in education because they want to see kids succeed, which is nearly everyone, should be able to agree. Start with the kids and everything else falls into place. Isn’t that always the case?