Developer- and Researcher-Made Measures

What if people could make their own yardsticks, and all of a sudden people who did so gained two inches overnight, while people who used ordinary yardsticks did not change height? What if runners counted off time as they ran (one Mississippi, two Mississippi…), and then it so happened that these runners reduced their time in the 100-yard dash by 20%? What if archers could draw their own targets freehand and those who did got more bullseyes?

All of these examples are silly, you say. Of course people who make their own measures will do better on the measures they themselves create. Even the most honest and sincere people, trying to be fair, may give themselves the benefit of the doubt in such situations.

In educational research, it is frequently the case that researchers or developers make up their own measures of achievement or other outcomes. Numerous reviews of research (e.g., Baye et al., 2019; Cheung & Slavin, 2016; deBoer et al., 2014; Wolf et al., 2019) have found that studies that use measures made by developers or researchers obtain effect sizes that may be two or three times as large as measures independent of the developers or researchers. In fact, some studies (e.g., Wolf et al., 2019; Slavin & Madden, 2011) have compared outcomes on researcher/developer-made measures and independent measures within the same studies. In almost every study with both kinds of measures, the researcher/developer measures show much higher effect sizes.

I think anyone can see that researcher/developer measures tend to overstate effects, and the reasons why they would do so are readily apparent (though I will discuss them in a moment). I and other researchers have been writing about this problem in journals and other outlets for years. Yet journals still accept these measures, most authors of meta-analyses still average them into their findings, and life goes on.

I’ve written about this problem in several blogs in this series. In this one I hope to share observations about the persistence of this practice.

How Do Researchers Justify Use of Researcher/Developer-Made Measures?

Very few researchers in education are dishonest, and I do not believe that researchers set out to hoodwink readers by using measures they made up. Instead, researchers who make up their own measures or use developer-made measures express reasonable-sounding rationales for making their own measures. Some common rationales are discussed below.

  1. Perhaps the most common rationale for using researcher/developer-made measures is that the alternative is to use standardized tests, which are felt to be too insensitive to any experimental treatment. Often researchers will use both a “distal” (i.e., standardized) measure and a “proximal” (i.e., researcher/developer-made) measure. For example, studies of vocabulary-development programs that focus on specific words will often create a test consisting primarily or entirely of these focal words. They may also use a broad-range standardized test of vocabulary. Typically, such studies find positive effects on the words taught in the experimental group, but not on vocabulary in general. However, the students in the control group did not focus on the focal words, so it is unlikely they would improve on them as much as students who spent considerable time with them, regardless of the teaching method. Control students may be making impressive gains on vocabulary, mostly on words other than those emphasized in the experimental group.
  2. Many researchers make up their own tests to reflect their beliefs about how children should learn. For example, a researcher might believe that students should learn algebra in third grade. Because there are no third grade algebra tests, the researcher might make one. If others complain that of course the students taught algebra in third grade will do better on a test of the algebra they learned (but that the control group never saw), the researcher may give excellent reasons why algebra should be taught to third graders, and if the control group didn’t get that content, well, they should
  3. Often, researchers say they used their own measures because there were no appropriate tests available focusing on whatever they taught. However, there are many tests of all kinds available either from specialized publishers or from measures made by other researchers. A researcher who cannot find anything appropriate is perhaps studying something so esoteric that it will not have ever been seen by any control group.
  4. Sometimes, researchers studying technology applications will give the final test on the computer. This may, of course, give a huge advantage to the experimental group, which may have been using the specific computers and formats emphasized in the test. The control group may have much less experience with computers, or with the particular computer formats used in the experimental group. The researcher might argue that it would not be fair to teach on computers but test on paper. Yet every student knows how to write with a pencil, but not every student has extensive experience with the computers used for the test.

blog_10-24-19_hslab_500x333

A Potential Solution to the Problem of Researcher/Developer Measures

Researcher/developer-made measures clearly inflate effect sizes considerably. Further, research in education, an applied field, should use measures like those for which schools and teachers are held accountable. No principal or teacher gets to make up his or her own test to use for accountability, and neither should researchers or developers have that privilege.

However, arguments for the use of researcher- and developer-made measures are not entirely foolish, as long as these measures are only used as supplements to independent measures. For example, in a vocabulary study, there may be a reason researchers want to know the effect of a program on the hundred words it emphasizes. This is at least a minimum expectation for such a treatment. If a vocabulary intervention that focused on only 100 words all year did not improve knowledge of those words, that would be an indication of trouble. Similarly, there may be good reasons to try out treatments based on unique theories of action and to test them using measures also aligned with that theory of action.

The problem comes in how such results are reported, and especially how they are treated in meta-analyses or other quantitative syntheses. My suggestions are as follows:

  1. Results from researcher/developer-made measures should be reported in articles on the program being evaluated, but not emphasized or averaged with independent measures. Analyses of researcher/developer-made measures may provide information, but not a fair or meaningful evaluation of the program impact. Reports of effect sizes from researcher/developer measures should be treated as implementation measures, not outcomes. The outcomes emphasized should only be those from independent measures.
  2. In meta-analyses and other quantitative syntheses, only independent measures should be used in calculations. Results from researcher/developer measures may be reported in program descriptions, but never averaged in with the independent measures.
  3. Studies whose only achievement measures are made by researchers or developers should not be included in quantitative reviews.

Fields in which research plays a central and respected role in policy and practice always pay close attention to the validity and fairness of measures. If educational research is ever to achieve a similar status, it must relegate measures made by researchers or developers to a supporting role, and stop treating such data the same way it treats data from independent, valid measures.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (2019). Effective reading programs for secondary students. Reading Research Quarterly, 54 (2), 133-166.

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

de Boer, H., Donker, A.S., & van der Werf, M.P.C. (2014). Effects of the attributes of educational interventions on students’ academic performance: A meta- analysis. Review of Educational Research, 84(4), 509–545. https://doi.org/10.3102/0034654314540006

Slavin, R.E., & Madden, N.A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4 (4), 370-380.

Wolf, R., Morrison, J., Inns, A., Slavin, R., & Risman, K. (2019). Differences in average effect sizes in developer-commissioned and independent studies. Manuscript submitted for publication.

Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Effect Sizes and Additional Months of Gain: Can’t We Just Agree That More is Better?

In the 1984 mockumentary This is Spinal Tap, there is a running joke about a hapless band, Spinal Tap, which proudly bills itself “Britain’s Loudest Band.”  A pesky reporter keeps asking the band’s leader, “But how can you prove that you are Britain’s loudest band?” The band leader explains, with declining patience, that while ordinary amplifiers’ sound controls only go up to 10, Spinal Tap’s go up to 11.  “But those numbers are arbitrary,” says the reporter.  “They don’t mean a thing!”  “Don’t you get it?” asks the band leader.  “ELEVEN is more than TEN!  Anyone can see that!”

In educational research, we have an ongoing debate reminiscent of Spinal Tap.  Educational researchers speaking to other researchers invariably express the impact of educational treatments as effect sizes (the difference in adjusted means for the experimental and control groups divided by the unadjusted standard deviation).  All else being equal, higher effect sizes are better than lower ones.

However, educators who are not trained in statistics often despise effect sizes.  “What do they mean?” they ask.  “Tell us how much difference the treatment makes in student learning!”

Researchers want to be understood, so they try to translate effect sizes into more educator-friendly equivalents.  The problem is that the friendlier the units, the more statistically problematic they are.  The friendliest of all is “additional months of learning.”  Researchers or educators can look on a chart and, for any particular effect size, they can find the number of “additional months of learning.”  The Education Endowment Foundation in England, which funds and reports on rigorous experiments, reports both effect sizes and additional months of learning, and provides tables to help people make the conversion.  But here’s the rub.  A recent article by Baird & Pane (2019) compared additional months of learning to three other translations of effect sizes.  Additional months of learning was rated highest in ease of use, but lowest in four other categories, such as transparency and consistency. For example, a month of learning clearly has a different meaning in kindergarten than it does in tenth grade.

The other translations rated higher by Baird and Pane were, at least to me, just as hard to understand as effect sizes.  For example, the What Works Clearinghouse presents, along with effect sizes, an “improvement index” that has the virtue of being equally incomprehensible to researchers and educators alike.

On one hand, arguing about outcome metrics is as silly as arguing the relative virtues of Fahrenheit and Celsius. If they can be directly transformed into the other unit, who cares?

However, additional months of learning is often used to cover up very low effect sizes. I recently ran into an example of this in a series of studies by the Stanford Center for Research on Education Outcomes (CREDO), in which disadvantaged urban African American students gained 59 more “days of learning” than matched students not in charters in math, and 44 more days in reading. These numbers were cited in an editorial praising charter schools in the May 29 Washington Post.

However, these “days of learning” are misleading. The effect size for this same comparison was only +0.08 for math, and +0.06 for reading. Any researcher will tell you that these are very small effects. They were only made to look big by reporting the gains in days. These not only magnify the apparent differences, but they also make them unstable. Would it interest you to know that White students in urban charter schools performed 36 days a year worse than matched students in math (ES= -0.05) and 14 days worse in reading (ES= -0.02)? How about Native American students in urban charter schools, whose scores were 70 days worse than matched students in non-charters in math (ES= -0.10), and equal in reading. I wrote about charter school studies in a recent blog. In the blog, I did not argue that charter schools are effective for disadvantaged African Americans but harmful for Whites and Native Americans. That seems unlikely. What I did argue is that the effects of charter schools are so small that the directions of the effects are unstable. The overall effects across all urban schools studied were only 40 days (ES=+0.055) in math and 28 days (ES=+0.04) in reading. These effects look big because of the “days of learning” transformation, but they are not.

blog_6-13-19_volume_500x375In This is Spinal Tap, the argument about whether or not Spinal Tap is Britain’s loudest band is absurd.  Any band can turn its amplifiers to the top and blow out everyone’s eardrums, whether the top is marked eleven or ten.  In education, however, it does matter a great deal that educators are taking evidence into account in their decisions about educational programs. Using effect sizes, perhaps supplemented by additional months of learning, is one way to help readers understand outcomes of educational experiments. Using “days of learning,” however, is misleading, making very small impacts look important. Why not additional hours or minutes of learning, while we’re at it? Spinal Tap would be proud.

References

Baird, M., & Paine, J. (2019). Translating standardized effects of education programs into more interpretable metrics. Educational Researcher. Advance online publication. doi.org/10.3102/0013189X19848729

CREDO (2015). Overview of the Urban Charter School Study. Stanford, CA: Author.

Washington Post: Denying poor children a chance. [Editorial]. (May 29, 2019). The Washington Post, A16.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Measuring Social Emotional Skills in Schools: Return of the MOOSES

Throughout the U. S., there is huge interest in improving students’ social emotional skills and related behaviors. This is indeed important as a means of building tomorrow’s society. However, measuring SEL skills is terribly difficult. Not that measuring reading, math, or science learning is easy, but there are at least accepted measures in those areas. In SEL, almost anything goes, and measures cover an enormous range. Some measures might be fine for theoretical research and some would be all right if they were given independently of the teachers who administered the treatment, but SEL measures are inherently squishy.

A few months ago, I wrote a blog on measurement of social emotional skills. In it, I argued that social emotional skills should be measured in pragmatic school research as objectively as possible, especially to avoid measures that merely reflect having students in experimental groups repeating back attitudes or terminology they learned in the program. I expressed the ideal for social emotional measurement in school experiments as MOOSES: Measurable, Observable, Objective, Social Emotional Skills.

Since that time, our group at Johns Hopkins University has received a generous grant from the Gates Foundation to add research on social emotional skills and attendance to our Evidence for ESSA website. This has enabled our group to dig a lot deeper into measures for social emotional learning. In particular, JHU graduate student Sooyeon Byun created a typology of SEL measures arrayed from least to most MOOSE-like. This is as follows.

  1. Cognitive Skills or Low-Level SEL Skills.

Examples include executive functioning tasks such as pencil tapping, the Stroop test, and other measures of cognitive regulation, as well as recognition of emotions. These skills may be of importance as part of theories of action leading to social emotional skills of importance to schools, but they are not goals of obvious importance to educators in themselves.

  1. Attitudes toward SEL (non-behavioral).

These include agreement with statements such as “bullying is wrong,” and statements about why other students engage in certain behaviors (e.g., “He spilled the milk because he was mean.”).

  1. Intention for SEL behaviors (quasi-behavioral).

Scenario-based measures (e.g., what would you do in this situation?).

  1. SEL behaviors based on self-report (semi-behavioral).

Reports of actual behaviors of self, or observations of others, often with frequencies (e.g., “How often have you seen bullying in this school during this school year?”) or “How often do you feel anxious or afraid in class in this school?”)

This category was divided according to who is reporting:

4a. Interested party (e.g., report by teachers or parents who implemented the program and may have reason to want to give a positive report)

4b. Disinterested party (e.g., report by students or by teachers or parents who did not administer the treatment)

  1. MOOSES (Measurable, Observable, Objective Social Emotional Skills)
  • Behaviors observed by independent observers, either researchers, ideally unaware of treatment assignment, or by school officials reporting on behaviors as they always would, not as part of a study (e.g., regular reports of office referrals for various infractions, suspensions, or expulsions).
  • Standardized tests
  • Other school records

blog_2-21-19_twomoose_500x333

Uses for MOOSES

All other things being equal, school researchers and educators should want to know about measures as high as possible on the MOOSES scale. However, all things are never equal, and in practice, some measures lower on the MOOSES scale may be all that exists or ever could exist. For example, it is unlikely that school officials or independent observers could determine students’ anxiety or fear, so self-report (level 4b) may be essential. MOOSES measures (level 5) may be objectively reported by school officials, but limiting attention to such measures may limit SEL measurement to readily observable behaviors, such as aggression, truancy, and other behaviors of importance to school management, and not on difficult-to-observe behaviors such as bullying.

Still, we expect to find in our ongoing review of the SEL literature that there will be enough research on outcomes measured at level 3 or above to enable us to downplay levels 1 and 2 for school audiences, and in many cases to downplay reports by interested parties in level 4a, where teachers or parents who implement a program then rate the behavior of the children they served.

Social emotional learning is important, and we need measures that reflect their importance, minimizing potential bias and staying as close as possible to independent, meaningful measures of behaviors that are of the greatest importance to educators. In our research team, we have very productive arguments about these measurement issues in the course of reviewing individual articles. I placed a cardboard cutout of a “principal” called “Norm” in our conference room. Whenever things get too theoretical, we consult “Norm” for his advice. For example, “Norm” is not too interested in pencil tapping and Stroop tests, but he sure cares a lot about bullying, aggression, and truancy. Of course, as part of our review we will be discussing our issues and initial decisions with real principals and educators, as well as other experts on SEL.

The growing number of studies of SEL in recent years enables reviewers to set higher standards than would have been feasible even just a few years ago. We still have to maintain a balance in which we can be as rigorous as possible but not end up with too few studies to review.  We can all aspire to be MOOSES, but that is not practical for some measures. Instead, it is useful to have a model of the ideal and what approaches the ideal, so we can make sense of the studies that exist today, with all due recognition of when we are accepting measures that are nearly MOOSES but not quite the real Bullwinkle

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.

John Hattie is Wrong

John Hattie is a professor at the University of Melbourne, Australia. He is famous for a book, Visible Learning, which claims to review every area of research that relates to teaching and learning. He uses a method called “meta-meta-analysis,” averaging effect sizes from many meta-analyses. The book ranks factors from one to 138 in terms of their effect sizes on achievement measures. Hattie is a great speaker, and many educators love the clarity and simplicity of his approach. How wonderful to have every known variable reviewed and ranked!

However, operating on the principle that anything that looks to be too good to be true probably is, I looked into Visible Learning to try to understand why it reports such large effect sizes. My colleague, Marta Pellegrini from the University of Florence (Italy), helped me track down the evidence behind Hattie’s claims. And sure enough, Hattie is profoundly wrong. He is merely shoveling meta-analyses containing massive bias into meta-meta-analyses that reflect the same biases.

blog_6-21-18_salvagepaper_476x500

Part of Hattie’s appeal to educators is that his conclusions are so easy to understand. He even uses a system of dials with color-coded “zones,” where effect sizes of 0.00 to +0.15 are designated “developmental effects,” +0.15 to +0.40 “teacher effects” (i.e., what teachers can do without any special practices or programs), and +0.40 to +1.20 the “zone of desired effects.” Hattie makes a big deal of the magical effect size +0.40, the “hinge point,” recommending that educators essentially ignore factors or programs below that point, because they are no better than what teachers produce each year, from fall to spring, on their own. In Hattie’s view, an effect size of from +0.15 to +0.40 is just the effect that “any teacher” could produce, in comparison to students not being in school at all. He says, “When teachers claim that they are having a positive effect on achievement or when a policy improves achievement, this is almost always a trivial claim: Virtually everything works. One only needs a pulse and we can improve achievement.” (Hattie, 2009, p. 16). An effect size of 0.00 to +0.15 is, he estimates, “what students could probably achieve if there were no schooling” (Hattie, 2009, p. 20). Yet this characterization of dials and zones misses the essential meaning of effect sizes, which are rarely used to measure the amount teachers’ students gain from fall to spring, but rather the amount students receiving a given treatment gained in comparison to gains made by similar students in a control group over the same period. So an effect size of, say, +0.15 or +0.25 could be very important.

Hattie’s core claims are these:

  • Almost everything works
  • Any effect size less than +0.40 is ignorable
  • It is possible to meaningfully rank educational factors in comparison to each other by averaging the findings of meta-analyses.

These claims appear appealing, simple, and understandable. But they are also wrong.

The essential problem with Hattie’s meta-meta-analyses is that they accept the results of the underlying meta-analyses without question. Yet many, perhaps most meta-analyses accept all sorts of individual studies of widely varying standards of quality. In Visible Learning, Hattie considers and then discards the possibility that there is anything wrong with individual meta-analyses, specifically rejecting the idea that the methods used in individual studies can greatly bias the findings.

To be fair, a great deal has been learned about the degree to which particular study characteristics bias study findings, always in a positive (i.e., inflated) direction. For example, there is now overwhelming evidence that effect sizes are significantly inflated in studies with small sample sizes, brief durations, use measures made by researchers or developers, are published (vs. unpublished), or use quasi-experiments (vs. randomized experiments) (Cheung & Slavin, 2016). Many meta-analyses even include pre-post studies, or studies that do not have pretests, or have pretest differences but fail to control for them. For example, I once criticized a meta-analysis of gifted education in which some studies compared students accepted into gifted programs to students rejected for those programs, controlling for nothing!

A huge problem with meta-meta-analysis is that until recently, meta-analysts rarely screened individual studies to remove those with fatal methodological flaws. Hattie himself rejects this procedure: “There is…no reason to throw out studies automatically because of lower quality” (Hattie, 2009, p. 11).

In order to understand what is going on in the underlying meta-analyses in a meta-meta-analysis, is it crucial to look all the way down to the individual studies. As a point of illustration, I examined Hattie’s own meta-meta-analysis of feedback, his third ranked factor, with a mean effect size of +0.79. Hattie & Timperly (2007) located 12 meta-analyses. I found some of the ones with the highest mean effect sizes.

At a mean of +1.24, the meta-analysis with the largest effect size in the Hattie & Timperley (2007) review was a review of research on various reinforcement treatments for students in special education by Skiba, Casey, & Center (1985-86). The reviewers required use of single-subject designs, so the review consisted of a total of 35 students treated one at a time, across 25 studies. Yet it is known that single-subject designs produce much larger effect sizes than ordinary group designs (see What Works Clearinghouse, 2017).

The second-highest effect size, +1.13, was from a meta-analysis by Lysakowski & Walberg (1982), on instructional cues, participation, and corrective feedback. Not enough information is provided to understand the individual studies, but there is one interesting note. A study using a single-subject design, involving two students, had an effect size of 11.81. That is the equivalent of raising a child’s IQ from 100 to 277! It was “winsorized” to the next-highest value of 4.99 (which is like adding 75 IQ points). Many of the studies were correlational, with no controls for inputs, or had no control group, or were pre-post designs.

A meta-analysis by Rummel and Feinberg (1988), with a reported effect size of +0.60, is perhaps the most humorous inclusion in the Hattie & Timperley (2007) meta-meta-analysis. It consists entirely of brief lab studies of the degree to which being paid or otherwise reinforced for engaging in an activity that was already intrinsically motivating would reduce subjects’ later participation in that activity. Rummel & Feinberg (1988) reported a positive effect size if subjects later did less of the activity they were paid to do. The reviewers decided to code studies positively if their findings corresponded to the theory (i.e., that feedback and reinforcement reduce later participation in previously favored activities), but in fact their “positive” effect size of +0.60 indicates a negative effect of feedback on performance.

I could go on (and on), but I think you get the point. Hattie’s meta-meta-analyses grab big numbers from meta-analyses of all kinds with little regard to the meaning or quality of the original studies, or of the meta-analyses.

If you are familiar with the What Works Clearinghouse (2007), or our own Best-Evidence Syntheses (www.bestevidence.org) or Evidence for ESSA (www.evidenceforessa.org), you will know that individual studies, except for studies of one-to-one tutoring, almost never have effect sizes as large as +0.40, Hattie’s “hinge point.” This is because WWC, BEE, and Evidence for ESSA all very carefully screen individual studies. We require control groups, controls for pretests, minimum sample sizes and durations, and measures independent of the treatments. Hattie applies no such standards, and in fact proclaims that they are not necessary.

It is possible, in fact essential, to make genuine progress using high-quality rigorous research to inform educational decisions. But first we must agree on what standards to apply.  Modest effect sizes from studies of practical treatments in real classrooms over meaningful periods of time on measures independent of the treatments tell us how much a replicable treatment will actually improve student achievement, in comparison to what would have been achieved otherwise. I would much rather use a program with an effect size of +0.15 from such studies than to use programs or practices found in studies with major flaws to have effect sizes of +0.79. If they understand the situation, I’m sure all educators would agree with me.

To create information that is fair and meaningful, meta-analysts cannot include studies of unknown and mostly low quality. Instead, they need to apply consistent standards of quality for each study, to look carefully at each one and judge its freedom from bias and major methodological flaws, as well as its relevance to practice. A meta-analysis cannot be any better than the studies that go into it. Hattie’s claims are deeply misleading because they are based on meta-analyses that themselves accepted studies of all levels of quality.

Evidence matters in education, now more than ever. Yet Hattie and others who uncritically accept all studies, good and bad, are undermining the value of evidence. This needs to stop if we are to make solid progress in educational practice and policy.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Hattie, J. (2009). Visible learning. New York, NY: Routledge.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81-112.

Lysakowski, R., & Walberg, H. (1982). Instructional effects of cues, participation, and corrective feedback: A quantitative synthesis. American Educational Research Journal, 19 (4), 559-578.

Rummel, A., & Feinberg, R. (1988). Cognitive evaluation theory: A review of the literature. Social Behavior and Personality, 16 (2), 147-164.

Skiba, R., Casey, A., & Center, B. (1985-86). Nonaversive procedures I the treatment of classroom behavior problems. The Journal of Special Education, 19 (4), 459-481.

What Works Clearinghouse (2017). Procedures handbook 4.0. Washington, DC: Author.

Photo credit: U.S. Farm Security Administration [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

The Good, the Bad, and the (Un)Promising

The ESSA evidence standards are finally beginning to matter. States are starting the process that will lead them to make school improvement awards to their lowest-achieving schools. The ESSA law is clear that for schools to qualify for these awards, they must agree to implement programs that meet the strong, moderate, or promising levels of the ESSA evidence standards. This is very exciting for those who believe in the power of proven programs to transform schools and benefit children. It is good news for kids, for teachers, and for our profession.

But inevitably, there is bad news with the good. If evidence is to be a standard for government funding, there are bound to be people who disseminate programs lacking high-quality evidence who will seek to bend the definitions to declare themselves “proven.” And there are also bound to be schools and districts that want to keep using what they have always used, or to keep choosing programs based on factors other than evidence, while doing the minimum the law requires.

The battleground is the ESSA “promising” criterion. “Strong” programs are pretty well defined as having significant positive evidence from high-quality randomized studies. “Moderate” programs are pretty well defined as having significant positive evidence from high-quality matched studies. Both “strong” and “moderate” are clearly defined in Evidence for ESSA (www.evidenceforessa.org), and, with a bit of translation, by the What Works Clearinghouse, both of which list specific programs that meet or do not meet these standards.

“Promising,” on the other hand is kind  of . . . squishy. The ESSA evidence standards do define programs meeting “promising” as ones that have statistically significant effects in “well-designed and well-implemented” correlational studies, with controls for inputs (e.g., pretests).  This sounds good, but it is hard to nail down in practice. I’m seeing and hearing about a category of studies that perfectly illustrate the problem. Imagine that a developer commissions a study of a form of software. A set of schools and their 1000 students are assigned to use the software, while control schools and their 1000 students do not have access to the software but continue with business as usual.

Computers routinely produce “trace data” that automatically tells researchers all sorts of things about how much students used the software, what they did with it, how successful they were, and so on.

The problem is that typically, large numbers of students given software do not use it. They may never even hit a key, or they may use the software so little that the researchers rule the software use to be effectively zero. So in a not unusual situation, let’s assume that in the treatment group, the one that got the software, only 500 of the 1000 students actually used the software at an adequate level.

Now here’s the rub. Almost always, the 500 students will out-perform the 1000 controls, even after controlling for pretests. Yet this would be likely to happen even if the software were completely ineffective.

To understand this, think about the 500 students who did use the software and the 500 who did not. The users are probably more conscientious, hard-working, and well-organized. The 500 non-users are more likely to be absent a lot, to fool around in class, to use their technology to play computer games, or go on (non-school-related) social media, rather than to do math or science for example. Even if the pretest scores in the user and non-user groups were identical, they are not identical students, because their behavior with the software is not equal.

I once visited a secondary school in England that was a specially-funded model for universal use of technology. Along with colleagues, I went into several classes. The teachers were teaching their hearts out, making constant use of the technology that all students had on their desks. The students were well-behaved, but just a few dominated the discussion. Maybe the others were just a bit shy, we thought. From the front of each class, this looked like the classroom of the future.

But then, we filed to the back of each class, where we could see over students’ shoulders. And we immediately saw what was going on. Maybe 60 or 70 percent of the students were actually on social media unrelated to the content, paying no attention to the teacher or instructional software!

blog_5-24-18_DistStudents_500x332

Now imagine that a study compared the 30-40% of students who were actually using the computers to students with similar pretests in other schools who had no computers at all. Again, the users would look terrific, but this is not a fair comparison, because all the goof-offs and laggards in the computer school had selected themselves out of the study while goof-offs and laggards in the control group were still included.

Rigorous researchers use a method called intent-to-treat, which in this case would include every student, whether or not they used the software or played non-educational computer games. “Not fair!” responds the software developer, because intent-to-treat includes a lot of students who never touched a key except to use social media. No sophisticated researcher accepts such an argument, however, because including only users gives the experimental group a big advantage.

Here’s what is happening at the policy level. Software developers are using data from studies that only include the students who made adequate use of the software. They are then claiming that such studies are correlational and meet the “promising” standard of ESSA.

Those who make this argument are correct in saying that such studies are correlational. But these studies are very, very, very bad, because they are biased toward the treatment. The ESSA standards specify well-designed and well-implemented studies, and these studies may be correlational, but they are not well-designed or well-implemented. Software developers and other vendors are very concerned about the ESSA evidence standards, and some may use the “promising” category as a loophole. Evidence for ESSA does not accept such studies, even as promising, and the What Works Clearinghouse does not even have any category that corresponds to “promising.” Yet vendors are flooding state departments of education and districts with studies they claim to meet the ESSA standards, though in the lowest category.

Recently, I heard something that could be a solution to this problem. Apparently, some states are announcing that for school improvement grants, and any other purpose that has financial consequences, they will only accept programs with “strong” and “moderate” evidence. They have the right to do this; the federal law says school improvement grants must support programs that at least meet the “promising” standard, but it does not say states cannot set a higher minimum standard.

One might argue that ignoring “promising” studies is going too far. In Evidence for ESSA (www.evidenceforessa.org), we accept studies as “promising” if they have weaknesses that do not lead to bias, such as clustered studies that were significant at the student but not the cluster level. But the danger posed by studies claiming to fit “promising” using biased designs is too great. Until the feds fix the definition of “promising” to exclude bias, the states may have to solve it for themselves.

I hope there will be further development of the “promising” standard to focus it on lower-quality but unbiased evidence, but as things are now, perhaps it is best for states themselves to declare that “promising” is no longer promising.

Eventually, evidence will prevail in education, as it has in many other fields, but on the way to that glorious future, we are going to have to make some adjustments. Requiring that “promising” be truly promising would be a good place to begin.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

What Kinds of Studies Are Likely to Replicate?

Replicated scientists 03 01 18

In the hard sciences, there is a publication called the Journal of Irreproducible Results.  It really has nothing to do with replication of experiments, but is a humor journal by and for scientists.  The reason I bring it up is that to chemists and biologists and astronomers and physicists, for example, an inability to replicate an experiment is a sure indication that the original experiment was wrong.  To the scientific mind, a Journal of Irreproducible Results is inherently funny, because it is a journal of nonsense.

Replication, the ability to repeat an experiment and get a similar result, is the hallmark of a mature science.  Sad to say, replication is rare in educational research, which says a lot about our immaturity as a science.  For example, in the What Works Clearinghouse, about half of programs across all topics are represented by a single evaluation.  When there are two or more, the results are often very different.  Relatively recent funding initiatives, especially studies supported by Investing in Innovation (i3) and the Institute for Education Sciences (IES), and targeted initiatives such as Striving Readers (secondary reading) and the Preschool Curriculum Evaluation Research (PCER), have added a great deal in this regard. They have funded many large-scale, randomized, very high-quality studies of all sorts of programs in the first place, and many of these are replications themselves, or they provide a good basis for replications later.  As my colleagues and I have done many reviews of research in every area of education, pre-kindergarten to grade 12 (see www.bestevidence.org), we have gained a good intuition about what kinds of studies are likely to replicate and what kinds are less likely.

First, let me define in more detail what I mean by “replication.”  There is no value in replicating biased studies, which may well consistently find the same biased results (as when, for example, both the original studies and the replication studies used the same researcher- or developer-made outcome measures that are slanted toward the content the experimental group experienced but not what the control group experienced) (See http://www.tandfonline.com/doi/abs/10.1080/19345747.2011.558986.)

Instead, I’d consider a successful replication one that shows positive outcomes both in the original studies and in at least one large-scale, rigorous replication. One obvious way to increase the chances that a program producing a positive outcome in one or more initial studies will succeed in such a rigorous replication evaluation is to use a similar, equally rigorous evaluation design in the first place. I think a lot of treatments that fail to replicate are ones that used weak methods in the original studies. In particular, small studies tend to produce greatly inflated effect sizes (see http://www.bestevidence.org/methods/methods.html), which are unlikely to replicate in larger evaluations.

Another factor likely to contribute to replicability is use in the earlier studies of methods or conditions that can be repeated in later studies, or in schools in general. For example, providing teachers with specific manuals, videos demonstrating the methods, and specific student materials all add to the chances that a successful program can be successfully replicated. Avoiding unusual pilot sites (such as schools known to have outstanding principals or staff) may contribute to replication, as these conditions are unlikely to be found in larger-scale studies. Having experimenters or their colleagues or graduate students extensively involved in the early studies diminishes replicability, of course, because those conditions will not exist in replications.

Replications are entirely possible. I wish there were a lot more of them in our field. Showing that programs can be effective in just two rigorous evaluations is way more convincing than just one. As evidence becomes more and more important, I hope and expect that replications, perhaps carried out by states or districts, will become more common.

The Journal of Irreproducible Results is fun, but it isn’t science. I’d love to see a Journal of Replications in Education to tell us what really works for kids.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Higher Ponytails (And Researcher-Made Measures)

blog220_basketball_333x500

Some time ago, I coached my daughter’s fifth grade basketball team. I knew next to nothing about basketball (my sport was…well, chess), but fortunately my research assistant, Holly Roback, eagerly volunteered. She’d played basketball in college, so our girls got outstanding coaching. However, they got whammed. My assistant coach explained it after another disastrous game, “The other team’s ponytails were just higher than ours.” Basically, our girls were terrific at ball handling and free shots, but they came up short in the height department.

Now imagine that in addition to being our team’s coach I was also the league’s commissioner. Imagine that I changed the rules. From now on, lay-ups and jump shots were abolished, and the ball had to be passed three times from player to player before a team could score.

My new rules could be fairly and consistently enforced, but their entire effect would be to diminish the importance of height and enhance the importance of ball handling and set shots.

Of course, I could never get away with this. Every fifth grader, not to mention their parents and coaches, would immediately understand that my rule changes unfairly favored my own team, and disadvantaged theirs (at least the ones with the higher ponytails).

This blog is not about basketball, of course. It is about researcher-made measures or developer-made measures. (I’m using “researcher-made” to refer to both). I’ve been writing a lot about such measures in various blogs on the What Works Clearinghouse (https://wordpress.com/post/robertslavinsblog.wordpress.com/795 and https://wordpress.com/post/robertslavinsblog.wordpress.com/792).

The reason I’m writing again about this topic is that I’ve gotten some criticism for my criticism of researcher-made measures, and I wanted to respond to these concerns.

First, here is my case, simply put. Measures made by researchers or developers are likely to favor whatever content was taught in the experimental group. I’m not in any way suggesting that researchers or developers are deliberately making measures to favor the experimental group. However, it usually works out that way. If the program teaches unusual content, no matter how laudable that content may be, and the control group never saw that content, then the potential for bias is obvious. If the experimental group was taught on computers and control group was not, and the test was given on a computer, the bias is obvious. If the experimental treatment emphasized certain vocabulary, and the control group did not, then a test of those particular words has obvious bias. If a math program spends a lot of time teaching students to do mental rotations of shapes, and the control treatment never did such exercises, a test that includes mental rotations is obviously biased. In our BEE full-scale reviews of pre-K to 12 reading, math, and science programs, available at www.bestevidence.org, we have long excluded such measures, calling them “treatment-inherent.” The WWC calls such measures “over-aligned,” and says it excludes them.

However, the problem turns out to be much deeper. In a 2016 article in the Educational Researcher, Alan Cheung and I tested outcomes from all 645 studies in the BEE achievement reviews, and found that even after excluding treatment-inherent measures, measures from studies that were made by researchers or developers had effect sizes that were far higher than those for measures not made by researchers or developers, by a ratio of two to one (effect sizes =+0.40 for researcher-made measures, +0.20 for independent measures). Graduate student Marta Pellegrini more recently analyzed data from all WWC reading and math studies. The ratio among WWC studies was 2.7 to 1 (effect sizes = +0.52 for researcher-made measures, +0.19 for independent ones). Again, the WWC was supposed to have already removed overaligned studies, all of which (I’d assume) were also researcher-made.

Some of my critics argue that because the WWC already excludes overaligned measures, they have already taken care of the problem. But if that were true, there would not be a ratio of 2.7 to 1 in effect sizes between researcher-made and independent measures, after removing measures considered by the WWC to be overaligned.

Other critics express concern that my analyses (of bias due to researcher-made measures) have only involved reading, math, and science measures, and the situation might be different for measures of social-emotional outcomes, for example, where appropriate measures may not exist.

I will admit that in areas other than achievement the issues are different, and I’ve written about them. So I’ll be happy to limit the simple version of “no researcher-made measures” to achievement measures. The problems of measuring social- emotional outcomes fairly are far more complex, and for another day.

Other critics express concern that even on achievement measures, there are situations in which appropriate measures don’t exist. That may be so, but in policy-oriented reviews such as the WWC or Evidence for ESSA, it’s hard to imagine that there would be no existing measures of reading, writing, math, science, or other achievement outcomes. An achievement objective so rarified that it has never been measured is probably not particularly relevant for policy or practice.

The WWC is not an academic journal, and it is not primarily intended for academics. If a researcher needs to develop a new measure to test a question of theoretical interest, they should do so by all means. But the findings from that measure should not be accepted or reported by the WWC, even if a journal might accept it.

Another version of this criticism is that researchers often have a strong argument that the program they are evaluating emphasizes standards that should be taught to all students, but are not. Therefore, enhanced performance on a (researcher-made) measure of the better standard is prima facie evidence of a positive program impact. This argument confuses the purpose of experimental evaluations with the purpose of standards. Standards exist to express what we want students to know and be able to do. Arguing for a given standard involves considerations of the needs of the economy, standards of other states or countries, norms of the profession, technological or social developments, and so on—but not comparisons of experimental groups scoring well on tests of a new proposed standard to control groups never exposed to content relating to that standard. It’s just not fair.

To get back to basketball, I could have argued that the rules should be changed to emphasize ball handling and reduce the importance of height. Perhaps this would be a good idea, for all I know. But what I could not do was change the rules to benefit my team. In the same way, researchers cannot make their own measures and then celebrate higher scores on them as indicating higher or better standards. As any fifth grader could tell you, advocating for better rules is fine, but changing the rules in the middle of the season is wrong.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measures, failure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.