Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Effect Sizes and Additional Months of Gain: Can’t We Just Agree That More is Better?

In the 1984 mockumentary This is Spinal Tap, there is a running joke about a hapless band, Spinal Tap, which proudly bills itself “Britain’s Loudest Band.”  A pesky reporter keeps asking the band’s leader, “But how can you prove that you are Britain’s loudest band?” The band leader explains, with declining patience, that while ordinary amplifiers’ sound controls only go up to 10, Spinal Tap’s go up to 11.  “But those numbers are arbitrary,” says the reporter.  “They don’t mean a thing!”  “Don’t you get it?” asks the band leader.  “ELEVEN is more than TEN!  Anyone can see that!”

In educational research, we have an ongoing debate reminiscent of Spinal Tap.  Educational researchers speaking to other researchers invariably express the impact of educational treatments as effect sizes (the difference in adjusted means for the experimental and control groups divided by the unadjusted standard deviation).  All else being equal, higher effect sizes are better than lower ones.

However, educators who are not trained in statistics often despise effect sizes.  “What do they mean?” they ask.  “Tell us how much difference the treatment makes in student learning!”

Researchers want to be understood, so they try to translate effect sizes into more educator-friendly equivalents.  The problem is that the friendlier the units, the more statistically problematic they are.  The friendliest of all is “additional months of learning.”  Researchers or educators can look on a chart and, for any particular effect size, they can find the number of “additional months of learning.”  The Education Endowment Foundation in England, which funds and reports on rigorous experiments, reports both effect sizes and additional months of learning, and provides tables to help people make the conversion.  But here’s the rub.  A recent article by Baird & Pane (2019) compared additional months of learning to three other translations of effect sizes.  Additional months of learning was rated highest in ease of use, but lowest in four other categories, such as transparency and consistency. For example, a month of learning clearly has a different meaning in kindergarten than it does in tenth grade.

The other translations rated higher by Baird and Pane were, at least to me, just as hard to understand as effect sizes.  For example, the What Works Clearinghouse presents, along with effect sizes, an “improvement index” that has the virtue of being equally incomprehensible to researchers and educators alike.

On one hand, arguing about outcome metrics is as silly as arguing the relative virtues of Fahrenheit and Celsius. If they can be directly transformed into the other unit, who cares?

However, additional months of learning is often used to cover up very low effect sizes. I recently ran into an example of this in a series of studies by the Stanford Center for Research on Education Outcomes (CREDO), in which disadvantaged urban African American students gained 59 more “days of learning” than matched students not in charters in math, and 44 more days in reading. These numbers were cited in an editorial praising charter schools in the May 29 Washington Post.

However, these “days of learning” are misleading. The effect size for this same comparison was only +0.08 for math, and +0.06 for reading. Any researcher will tell you that these are very small effects. They were only made to look big by reporting the gains in days. These not only magnify the apparent differences, but they also make them unstable. Would it interest you to know that White students in urban charter schools performed 36 days a year worse than matched students in math (ES= -0.05) and 14 days worse in reading (ES= -0.02)? How about Native American students in urban charter schools, whose scores were 70 days worse than matched students in non-charters in math (ES= -0.10), and equal in reading. I wrote about charter school studies in a recent blog. In the blog, I did not argue that charter schools are effective for disadvantaged African Americans but harmful for Whites and Native Americans. That seems unlikely. What I did argue is that the effects of charter schools are so small that the directions of the effects are unstable. The overall effects across all urban schools studied were only 40 days (ES=+0.055) in math and 28 days (ES=+0.04) in reading. These effects look big because of the “days of learning” transformation, but they are not.

blog_6-13-19_volume_500x375In This is Spinal Tap, the argument about whether or not Spinal Tap is Britain’s loudest band is absurd.  Any band can turn its amplifiers to the top and blow out everyone’s eardrums, whether the top is marked eleven or ten.  In education, however, it does matter a great deal that educators are taking evidence into account in their decisions about educational programs. Using effect sizes, perhaps supplemented by additional months of learning, is one way to help readers understand outcomes of educational experiments. Using “days of learning,” however, is misleading, making very small impacts look important. Why not additional hours or minutes of learning, while we’re at it? Spinal Tap would be proud.

References

Baird, M., & Paine, J. (2019). Translating standardized effects of education programs into more interpretable metrics. Educational Researcher. Advance online publication. doi.org/10.3102/0013189X19848729

CREDO (2015). Overview of the Urban Charter School Study. Stanford, CA: Author.

Washington Post: Denying poor children a chance. [Editorial]. (May 29, 2019). The Washington Post, A16.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Measuring Social Emotional Skills in Schools: Return of the MOOSES

Throughout the U. S., there is huge interest in improving students’ social emotional skills and related behaviors. This is indeed important as a means of building tomorrow’s society. However, measuring SEL skills is terribly difficult. Not that measuring reading, math, or science learning is easy, but there are at least accepted measures in those areas. In SEL, almost anything goes, and measures cover an enormous range. Some measures might be fine for theoretical research and some would be all right if they were given independently of the teachers who administered the treatment, but SEL measures are inherently squishy.

A few months ago, I wrote a blog on measurement of social emotional skills. In it, I argued that social emotional skills should be measured in pragmatic school research as objectively as possible, especially to avoid measures that merely reflect having students in experimental groups repeating back attitudes or terminology they learned in the program. I expressed the ideal for social emotional measurement in school experiments as MOOSES: Measurable, Observable, Objective, Social Emotional Skills.

Since that time, our group at Johns Hopkins University has received a generous grant from the Gates Foundation to add research on social emotional skills and attendance to our Evidence for ESSA website. This has enabled our group to dig a lot deeper into measures for social emotional learning. In particular, JHU graduate student Sooyeon Byun created a typology of SEL measures arrayed from least to most MOOSE-like. This is as follows.

  1. Cognitive Skills or Low-Level SEL Skills.

Examples include executive functioning tasks such as pencil tapping, the Stroop test, and other measures of cognitive regulation, as well as recognition of emotions. These skills may be of importance as part of theories of action leading to social emotional skills of importance to schools, but they are not goals of obvious importance to educators in themselves.

  1. Attitudes toward SEL (non-behavioral).

These include agreement with statements such as “bullying is wrong,” and statements about why other students engage in certain behaviors (e.g., “He spilled the milk because he was mean.”).

  1. Intention for SEL behaviors (quasi-behavioral).

Scenario-based measures (e.g., what would you do in this situation?).

  1. SEL behaviors based on self-report (semi-behavioral).

Reports of actual behaviors of self, or observations of others, often with frequencies (e.g., “How often have you seen bullying in this school during this school year?”) or “How often do you feel anxious or afraid in class in this school?”)

This category was divided according to who is reporting:

4a. Interested party (e.g., report by teachers or parents who implemented the program and may have reason to want to give a positive report)

4b. Disinterested party (e.g., report by students or by teachers or parents who did not administer the treatment)

  1. MOOSES (Measurable, Observable, Objective Social Emotional Skills)
  • Behaviors observed by independent observers, either researchers, ideally unaware of treatment assignment, or by school officials reporting on behaviors as they always would, not as part of a study (e.g., regular reports of office referrals for various infractions, suspensions, or expulsions).
  • Standardized tests
  • Other school records

blog_2-21-19_twomoose_500x333

Uses for MOOSES

All other things being equal, school researchers and educators should want to know about measures as high as possible on the MOOSES scale. However, all things are never equal, and in practice, some measures lower on the MOOSES scale may be all that exists or ever could exist. For example, it is unlikely that school officials or independent observers could determine students’ anxiety or fear, so self-report (level 4b) may be essential. MOOSES measures (level 5) may be objectively reported by school officials, but limiting attention to such measures may limit SEL measurement to readily observable behaviors, such as aggression, truancy, and other behaviors of importance to school management, and not on difficult-to-observe behaviors such as bullying.

Still, we expect to find in our ongoing review of the SEL literature that there will be enough research on outcomes measured at level 3 or above to enable us to downplay levels 1 and 2 for school audiences, and in many cases to downplay reports by interested parties in level 4a, where teachers or parents who implement a program then rate the behavior of the children they served.

Social emotional learning is important, and we need measures that reflect their importance, minimizing potential bias and staying as close as possible to independent, meaningful measures of behaviors that are of the greatest importance to educators. In our research team, we have very productive arguments about these measurement issues in the course of reviewing individual articles. I placed a cardboard cutout of a “principal” called “Norm” in our conference room. Whenever things get too theoretical, we consult “Norm” for his advice. For example, “Norm” is not too interested in pencil tapping and Stroop tests, but he sure cares a lot about bullying, aggression, and truancy. Of course, as part of our review we will be discussing our issues and initial decisions with real principals and educators, as well as other experts on SEL.

The growing number of studies of SEL in recent years enables reviewers to set higher standards than would have been feasible even just a few years ago. We still have to maintain a balance in which we can be as rigorous as possible but not end up with too few studies to review.  We can all aspire to be MOOSES, but that is not practical for some measures. Instead, it is useful to have a model of the ideal and what approaches the ideal, so we can make sense of the studies that exist today, with all due recognition of when we are accepting measures that are nearly MOOSES but not quite the real Bullwinkle

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.

John Hattie is Wrong

John Hattie is a professor at the University of Melbourne, Australia. He is famous for a book, Visible Learning, which claims to review every area of research that relates to teaching and learning. He uses a method called “meta-meta-analysis,” averaging effect sizes from many meta-analyses. The book ranks factors from one to 138 in terms of their effect sizes on achievement measures. Hattie is a great speaker, and many educators love the clarity and simplicity of his approach. How wonderful to have every known variable reviewed and ranked!

However, operating on the principle that anything that looks to be too good to be true probably is, I looked into Visible Learning to try to understand why it reports such large effect sizes. My colleague, Marta Pellegrini from the University of Florence (Italy), helped me track down the evidence behind Hattie’s claims. And sure enough, Hattie is profoundly wrong. He is merely shoveling meta-analyses containing massive bias into meta-meta-analyses that reflect the same biases.

blog_6-21-18_salvagepaper_476x500

Part of Hattie’s appeal to educators is that his conclusions are so easy to understand. He even uses a system of dials with color-coded “zones,” where effect sizes of 0.00 to +0.15 are designated “developmental effects,” +0.15 to +0.40 “teacher effects” (i.e., what teachers can do without any special practices or programs), and +0.40 to +1.20 the “zone of desired effects.” Hattie makes a big deal of the magical effect size +0.40, the “hinge point,” recommending that educators essentially ignore factors or programs below that point, because they are no better than what teachers produce each year, from fall to spring, on their own. In Hattie’s view, an effect size of from +0.15 to +0.40 is just the effect that “any teacher” could produce, in comparison to students not being in school at all. He says, “When teachers claim that they are having a positive effect on achievement or when a policy improves achievement, this is almost always a trivial claim: Virtually everything works. One only needs a pulse and we can improve achievement.” (Hattie, 2009, p. 16). An effect size of 0.00 to +0.15 is, he estimates, “what students could probably achieve if there were no schooling” (Hattie, 2009, p. 20). Yet this characterization of dials and zones misses the essential meaning of effect sizes, which are rarely used to measure the amount teachers’ students gain from fall to spring, but rather the amount students receiving a given treatment gained in comparison to gains made by similar students in a control group over the same period. So an effect size of, say, +0.15 or +0.25 could be very important.

Hattie’s core claims are these:

  • Almost everything works
  • Any effect size less than +0.40 is ignorable
  • It is possible to meaningfully rank educational factors in comparison to each other by averaging the findings of meta-analyses.

These claims appear appealing, simple, and understandable. But they are also wrong.

The essential problem with Hattie’s meta-meta-analyses is that they accept the results of the underlying meta-analyses without question. Yet many, perhaps most meta-analyses accept all sorts of individual studies of widely varying standards of quality. In Visible Learning, Hattie considers and then discards the possibility that there is anything wrong with individual meta-analyses, specifically rejecting the idea that the methods used in individual studies can greatly bias the findings.

To be fair, a great deal has been learned about the degree to which particular study characteristics bias study findings, always in a positive (i.e., inflated) direction. For example, there is now overwhelming evidence that effect sizes are significantly inflated in studies with small sample sizes, brief durations, use measures made by researchers or developers, are published (vs. unpublished), or use quasi-experiments (vs. randomized experiments) (Cheung & Slavin, 2016). Many meta-analyses even include pre-post studies, or studies that do not have pretests, or have pretest differences but fail to control for them. For example, I once criticized a meta-analysis of gifted education in which some studies compared students accepted into gifted programs to students rejected for those programs, controlling for nothing!

A huge problem with meta-meta-analysis is that until recently, meta-analysts rarely screened individual studies to remove those with fatal methodological flaws. Hattie himself rejects this procedure: “There is…no reason to throw out studies automatically because of lower quality” (Hattie, 2009, p. 11).

In order to understand what is going on in the underlying meta-analyses in a meta-meta-analysis, is it crucial to look all the way down to the individual studies. As a point of illustration, I examined Hattie’s own meta-meta-analysis of feedback, his third ranked factor, with a mean effect size of +0.79. Hattie & Timperly (2007) located 12 meta-analyses. I found some of the ones with the highest mean effect sizes.

At a mean of +1.24, the meta-analysis with the largest effect size in the Hattie & Timperley (2007) review was a review of research on various reinforcement treatments for students in special education by Skiba, Casey, & Center (1985-86). The reviewers required use of single-subject designs, so the review consisted of a total of 35 students treated one at a time, across 25 studies. Yet it is known that single-subject designs produce much larger effect sizes than ordinary group designs (see What Works Clearinghouse, 2017).

The second-highest effect size, +1.13, was from a meta-analysis by Lysakowski & Walberg (1982), on instructional cues, participation, and corrective feedback. Not enough information is provided to understand the individual studies, but there is one interesting note. A study using a single-subject design, involving two students, had an effect size of 11.81. That is the equivalent of raising a child’s IQ from 100 to 277! It was “winsorized” to the next-highest value of 4.99 (which is like adding 75 IQ points). Many of the studies were correlational, with no controls for inputs, or had no control group, or were pre-post designs.

A meta-analysis by Rummel and Feinberg (1988), with a reported effect size of +0.60, is perhaps the most humorous inclusion in the Hattie & Timperley (2007) meta-meta-analysis. It consists entirely of brief lab studies of the degree to which being paid or otherwise reinforced for engaging in an activity that was already intrinsically motivating would reduce subjects’ later participation in that activity. Rummel & Feinberg (1988) reported a positive effect size if subjects later did less of the activity they were paid to do. The reviewers decided to code studies positively if their findings corresponded to the theory (i.e., that feedback and reinforcement reduce later participation in previously favored activities), but in fact their “positive” effect size of +0.60 indicates a negative effect of feedback on performance.

I could go on (and on), but I think you get the point. Hattie’s meta-meta-analyses grab big numbers from meta-analyses of all kinds with little regard to the meaning or quality of the original studies, or of the meta-analyses.

If you are familiar with the What Works Clearinghouse (2007), or our own Best-Evidence Syntheses (www.bestevidence.org) or Evidence for ESSA (www.evidenceforessa.org), you will know that individual studies, except for studies of one-to-one tutoring, almost never have effect sizes as large as +0.40, Hattie’s “hinge point.” This is because WWC, BEE, and Evidence for ESSA all very carefully screen individual studies. We require control groups, controls for pretests, minimum sample sizes and durations, and measures independent of the treatments. Hattie applies no such standards, and in fact proclaims that they are not necessary.

It is possible, in fact essential, to make genuine progress using high-quality rigorous research to inform educational decisions. But first we must agree on what standards to apply.  Modest effect sizes from studies of practical treatments in real classrooms over meaningful periods of time on measures independent of the treatments tell us how much a replicable treatment will actually improve student achievement, in comparison to what would have been achieved otherwise. I would much rather use a program with an effect size of +0.15 from such studies than to use programs or practices found in studies with major flaws to have effect sizes of +0.79. If they understand the situation, I’m sure all educators would agree with me.

To create information that is fair and meaningful, meta-analysts cannot include studies of unknown and mostly low quality. Instead, they need to apply consistent standards of quality for each study, to look carefully at each one and judge its freedom from bias and major methodological flaws, as well as its relevance to practice. A meta-analysis cannot be any better than the studies that go into it. Hattie’s claims are deeply misleading because they are based on meta-analyses that themselves accepted studies of all levels of quality.

Evidence matters in education, now more than ever. Yet Hattie and others who uncritically accept all studies, good and bad, are undermining the value of evidence. This needs to stop if we are to make solid progress in educational practice and policy.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Hattie, J. (2009). Visible learning. New York, NY: Routledge.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81-112.

Lysakowski, R., & Walberg, H. (1982). Instructional effects of cues, participation, and corrective feedback: A quantitative synthesis. American Educational Research Journal, 19 (4), 559-578.

Rummel, A., & Feinberg, R. (1988). Cognitive evaluation theory: A review of the literature. Social Behavior and Personality, 16 (2), 147-164.

Skiba, R., Casey, A., & Center, B. (1985-86). Nonaversive procedures I the treatment of classroom behavior problems. The Journal of Special Education, 19 (4), 459-481.

What Works Clearinghouse (2017). Procedures handbook 4.0. Washington, DC: Author.

Photo credit: U.S. Farm Security Administration [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

The Good, the Bad, and the (Un)Promising

The ESSA evidence standards are finally beginning to matter. States are starting the process that will lead them to make school improvement awards to their lowest-achieving schools. The ESSA law is clear that for schools to qualify for these awards, they must agree to implement programs that meet the strong, moderate, or promising levels of the ESSA evidence standards. This is very exciting for those who believe in the power of proven programs to transform schools and benefit children. It is good news for kids, for teachers, and for our profession.

But inevitably, there is bad news with the good. If evidence is to be a standard for government funding, there are bound to be people who disseminate programs lacking high-quality evidence who will seek to bend the definitions to declare themselves “proven.” And there are also bound to be schools and districts that want to keep using what they have always used, or to keep choosing programs based on factors other than evidence, while doing the minimum the law requires.

The battleground is the ESSA “promising” criterion. “Strong” programs are pretty well defined as having significant positive evidence from high-quality randomized studies. “Moderate” programs are pretty well defined as having significant positive evidence from high-quality matched studies. Both “strong” and “moderate” are clearly defined in Evidence for ESSA (www.evidenceforessa.org), and, with a bit of translation, by the What Works Clearinghouse, both of which list specific programs that meet or do not meet these standards.

“Promising,” on the other hand is kind  of . . . squishy. The ESSA evidence standards do define programs meeting “promising” as ones that have statistically significant effects in “well-designed and well-implemented” correlational studies, with controls for inputs (e.g., pretests).  This sounds good, but it is hard to nail down in practice. I’m seeing and hearing about a category of studies that perfectly illustrate the problem. Imagine that a developer commissions a study of a form of software. A set of schools and their 1000 students are assigned to use the software, while control schools and their 1000 students do not have access to the software but continue with business as usual.

Computers routinely produce “trace data” that automatically tells researchers all sorts of things about how much students used the software, what they did with it, how successful they were, and so on.

The problem is that typically, large numbers of students given software do not use it. They may never even hit a key, or they may use the software so little that the researchers rule the software use to be effectively zero. So in a not unusual situation, let’s assume that in the treatment group, the one that got the software, only 500 of the 1000 students actually used the software at an adequate level.

Now here’s the rub. Almost always, the 500 students will out-perform the 1000 controls, even after controlling for pretests. Yet this would be likely to happen even if the software were completely ineffective.

To understand this, think about the 500 students who did use the software and the 500 who did not. The users are probably more conscientious, hard-working, and well-organized. The 500 non-users are more likely to be absent a lot, to fool around in class, to use their technology to play computer games, or go on (non-school-related) social media, rather than to do math or science for example. Even if the pretest scores in the user and non-user groups were identical, they are not identical students, because their behavior with the software is not equal.

I once visited a secondary school in England that was a specially-funded model for universal use of technology. Along with colleagues, I went into several classes. The teachers were teaching their hearts out, making constant use of the technology that all students had on their desks. The students were well-behaved, but just a few dominated the discussion. Maybe the others were just a bit shy, we thought. From the front of each class, this looked like the classroom of the future.

But then, we filed to the back of each class, where we could see over students’ shoulders. And we immediately saw what was going on. Maybe 60 or 70 percent of the students were actually on social media unrelated to the content, paying no attention to the teacher or instructional software!

blog_5-24-18_DistStudents_500x332

Now imagine that a study compared the 30-40% of students who were actually using the computers to students with similar pretests in other schools who had no computers at all. Again, the users would look terrific, but this is not a fair comparison, because all the goof-offs and laggards in the computer school had selected themselves out of the study while goof-offs and laggards in the control group were still included.

Rigorous researchers use a method called intent-to-treat, which in this case would include every student, whether or not they used the software or played non-educational computer games. “Not fair!” responds the software developer, because intent-to-treat includes a lot of students who never touched a key except to use social media. No sophisticated researcher accepts such an argument, however, because including only users gives the experimental group a big advantage.

Here’s what is happening at the policy level. Software developers are using data from studies that only include the students who made adequate use of the software. They are then claiming that such studies are correlational and meet the “promising” standard of ESSA.

Those who make this argument are correct in saying that such studies are correlational. But these studies are very, very, very bad, because they are biased toward the treatment. The ESSA standards specify well-designed and well-implemented studies, and these studies may be correlational, but they are not well-designed or well-implemented. Software developers and other vendors are very concerned about the ESSA evidence standards, and some may use the “promising” category as a loophole. Evidence for ESSA does not accept such studies, even as promising, and the What Works Clearinghouse does not even have any category that corresponds to “promising.” Yet vendors are flooding state departments of education and districts with studies they claim to meet the ESSA standards, though in the lowest category.

Recently, I heard something that could be a solution to this problem. Apparently, some states are announcing that for school improvement grants, and any other purpose that has financial consequences, they will only accept programs with “strong” and “moderate” evidence. They have the right to do this; the federal law says school improvement grants must support programs that at least meet the “promising” standard, but it does not say states cannot set a higher minimum standard.

One might argue that ignoring “promising” studies is going too far. In Evidence for ESSA (www.evidenceforessa.org), we accept studies as “promising” if they have weaknesses that do not lead to bias, such as clustered studies that were significant at the student but not the cluster level. But the danger posed by studies claiming to fit “promising” using biased designs is too great. Until the feds fix the definition of “promising” to exclude bias, the states may have to solve it for themselves.

I hope there will be further development of the “promising” standard to focus it on lower-quality but unbiased evidence, but as things are now, perhaps it is best for states themselves to declare that “promising” is no longer promising.

Eventually, evidence will prevail in education, as it has in many other fields, but on the way to that glorious future, we are going to have to make some adjustments. Requiring that “promising” be truly promising would be a good place to begin.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

What Kinds of Studies Are Likely to Replicate?

Replicated scientists 03 01 18

In the hard sciences, there is a publication called the Journal of Irreproducible Results.  It really has nothing to do with replication of experiments, but is a humor journal by and for scientists.  The reason I bring it up is that to chemists and biologists and astronomers and physicists, for example, an inability to replicate an experiment is a sure indication that the original experiment was wrong.  To the scientific mind, a Journal of Irreproducible Results is inherently funny, because it is a journal of nonsense.

Replication, the ability to repeat an experiment and get a similar result, is the hallmark of a mature science.  Sad to say, replication is rare in educational research, which says a lot about our immaturity as a science.  For example, in the What Works Clearinghouse, about half of programs across all topics are represented by a single evaluation.  When there are two or more, the results are often very different.  Relatively recent funding initiatives, especially studies supported by Investing in Innovation (i3) and the Institute for Education Sciences (IES), and targeted initiatives such as Striving Readers (secondary reading) and the Preschool Curriculum Evaluation Research (PCER), have added a great deal in this regard. They have funded many large-scale, randomized, very high-quality studies of all sorts of programs in the first place, and many of these are replications themselves, or they provide a good basis for replications later.  As my colleagues and I have done many reviews of research in every area of education, pre-kindergarten to grade 12 (see www.bestevidence.org), we have gained a good intuition about what kinds of studies are likely to replicate and what kinds are less likely.

First, let me define in more detail what I mean by “replication.”  There is no value in replicating biased studies, which may well consistently find the same biased results (as when, for example, both the original studies and the replication studies used the same researcher- or developer-made outcome measures that are slanted toward the content the experimental group experienced but not what the control group experienced) (See http://www.tandfonline.com/doi/abs/10.1080/19345747.2011.558986.)

Instead, I’d consider a successful replication one that shows positive outcomes both in the original studies and in at least one large-scale, rigorous replication. One obvious way to increase the chances that a program producing a positive outcome in one or more initial studies will succeed in such a rigorous replication evaluation is to use a similar, equally rigorous evaluation design in the first place. I think a lot of treatments that fail to replicate are ones that used weak methods in the original studies. In particular, small studies tend to produce greatly inflated effect sizes (see http://www.bestevidence.org/methods/methods.html), which are unlikely to replicate in larger evaluations.

Another factor likely to contribute to replicability is use in the earlier studies of methods or conditions that can be repeated in later studies, or in schools in general. For example, providing teachers with specific manuals, videos demonstrating the methods, and specific student materials all add to the chances that a successful program can be successfully replicated. Avoiding unusual pilot sites (such as schools known to have outstanding principals or staff) may contribute to replication, as these conditions are unlikely to be found in larger-scale studies. Having experimenters or their colleagues or graduate students extensively involved in the early studies diminishes replicability, of course, because those conditions will not exist in replications.

Replications are entirely possible. I wish there were a lot more of them in our field. Showing that programs can be effective in just two rigorous evaluations is way more convincing than just one. As evidence becomes more and more important, I hope and expect that replications, perhaps carried out by states or districts, will become more common.

The Journal of Irreproducible Results is fun, but it isn’t science. I’d love to see a Journal of Replications in Education to tell us what really works for kids.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.