The Curious Case of the Missing Programs

“Let me tell you, my dear Watson, about one of my most curious and vexing cases,” said Holmes. “I call it, ‘The Case of the Missing Programs’. A school superintendent from America sent me a letter.  It appears that whenever she looks in the What Works Clearinghouse to find a program her district wants to use, nine times out of ten there is nothing there!”

Watson was astonished. “But surely there has to be something. Perhaps the missing programs did not meet WWC standards, or did not have positive effects!”

“Not meeting standards or having disappointing outcomes would be something,” responded Holmes, “but the WWC often says nothing at all about a program. Users are apparently confused. They don’t know what to conclude.”

“The missing programs must make the whole WWC less useful and reliable,” mused Watson.

“Just so, my friend,” said Holmes, “and so we must take a trip to America to get to the bottom of this!”

blog_9-6-18_SherlockProf_458x500

While Holmes and Watson are arranging steamship transportation to America, let me fill you in on this very curious case.

In the course of our work on Evidence for ESSA (www.evidenceforessa.org), we are occasionally asked by school district leaders why there is nothing in our website about a given program, text, or software. Whenever this happens, our staff immediately checks to see if there is any evidence we’ve missed. If we are pretty sure that there are no studies of the missing program that meet our standards, we add the program to our website, with a brief indication that there are no qualifying studies. If any studies do meet our standards, we review them as soon as possible and add them as meeting or not meeting ESSA standards.

Sometimes, districts or states send us their entire list of approved texts and software, and we check them all to see that all are included.

From having done this for more than a year, we now have an entry on most of the reading and math programs any district would come up with, though we keep getting more all the time.

All of this seems to us to be obviously essential. If users of Evidence for ESSA look up their favorite programs, or ones they are thinking of adopting, and find that there is no entry, they begin losing confidence in the whole enterprise. They cannot know whether the program they seek was ignored or missed for some reason, or has no evidence of effectiveness, or perhaps has been proven effective but has not been reviewed.

Recently, a large district sent me their list of 98 approved and supplementary texts, software, and other programs in reading and math. They had marked each according to the ratings given by the What Works Clearinghouse and Evidence for ESSA. At the time (a few weeks ago), Evidence for ESSA had listings for 67% of the programs. Today, of course, it has 100%, because we immediately set to work researching and adding in all the programs we’d missed.

What I found astonishing, however, is how few of the district’s programs were mentioned at all in the What Works Clearinghouse. Only 15% of the reading and math programs were in the WWC.

I’ve written previously about how far behind the WWC is in reviewing programs. But the problem with the district list was not just a question of slowness. Many of the programs the WWC missed have been around for some time.

I’m not sure how the WWC decides what to review, but they do not seem to be trying for completeness. I think this is counterproductive. Users of the WWC should expect to be able to find out about programs that meet standards for positive outcomes, those that have an evidence base that meets evidence standards but do not have positive outcomes, those that have evidence not meeting standards, and those that have no evidence at all. Yet it seems clear that the largest category in the WWC is “none of the above.” Most programs a user would be interested in do not appear at all in the WWC. Most often, a lack of a listing means a lack of evidence, but this is not always the case, especially when evidence is recent. One way or another, finding big gaps in any compendium undermines faith in the whole effort. It’s difficult to expect educational leaders to get into the habit of looking for evidence if most of the programs they consider are not listed.

Imagine, for example, that a telephone book was missing a significant fraction of the people who live in a given city. Users would be frustrated about not being able to find their friends, and the gaps would soon undermine confidence in the whole phone book.

****

When Holmes and Watson arrived in the U.S., they spoke with many educators who’d tried to find programs in the WWC, and they heard tales of frustration and impatience. Many former users said they no longer bothered to consult the WWC and had lost faith in evidence in their field. Fortunately, Holmes and Watson got a meeting with U.S. Department of Education officials, who immediately understood the problem and set to work to find the evidence base (or lack of evidence) for every reading and math program in America. Usage of the WWC soared, and support for evidence-based reform in education increased.

Of course, this outcome is fictional. But it need not remain fictional. The problem is real, and the solution is simple. Or as Holmes would say, “Elementary and secondary, my dear Watson!”

Photo credit: By Rumensz [CC0], from Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

What’s the Evidence that Evidence Works?

I recently gave a couple of speeches on evidence-based reform in education in Barcelona.  In preparing for them, one of the organizers asked me an interesting question: “What is your evidence that evidence works?”

At one level, this is a trivial question. If schools select proven programs and practices aligned with their needs and implement them with fidelity and intelligence, with levels of resources similar to those used in the original successful research, then of course they’ll work, right? And if a school district adopts proven programs, encourages and funds them, and monitors their implementation and outcomes, then of course the appropriate use of all these programs is sure to enhance achievement district-wide, right?

Although logic suggests that a policy of encouraging and funding proven programs is sure to increase achievement on a broad scale, I like to be held to a higher standard: Evidence. And, it so happens, I happen to have some evidence on this very topic. This evidence came from a large-scale evaluation of an ambitious, national effort to increase use of proven and promising schoolwide programs in elementary and middle schools, in a research center funded by the Institute for Education Sciences (IES) called the Center for Data-Driven Reform in Education, or CDDRE (see Slavin, Cheung, Holmes, Madden, & Chamberlain, 2013). The name of the program the experimental schools used was Raising the Bar.

How Raising the Bar Raised the Bar

The idea behind Raising the Bar was to help schools analyze their own needs and strengths, and then select whole-school reform models likely to help them meet their achievement goals. CDDRE consultants provided about 30 days of on-site professional development to each district over a 2-year period. The PD focused on review of data, effective use of benchmark assessments, school walk-throughs by district leaders to see the degree to which schools were already using the programs they claimed to be using, and then exposing district and school leaders to information and data on schoolwide programs available to them, from several providers. If districts selected a program to implement, their district and school received PD on ensuring effective implementation and principals and teachers received PD on the programs they chose.

blog_7-26-18_polevault_375x500

Evaluating Raising the Bar

In the study of Raising the Bar we recruited a total of 397 elementary and 225 middle schools in 59 districts in 7 states (AL, AZ, IN, MS, OH, TN). All schools were Title I schools in rural and mid-sized urban districts. Overall, 30% of students were African-American, 20% were Hispanic, and 47% were White. Across three cohorts, starting in 2005, 2006, or 2007, schools were randomly assigned to either use Raising the Bar, or to continue with what they were doing. The study ended in 2009, so schools could have been in the Raising the Bar group for two, three, or four years.

Did We Raise the Bar?

State test scores were obtained from all schools and transformed to z-scores so they could be combined across states. The analyses focused on grades 5 and 8, as these were the only grades tested in some states at the time. Hierarchical linear modeling, with schools nested within districts, were used for analysis.

For reading in fifth grade, outcomes were very good. By Year 3, the effect sizes were significant, with significant individual-level effect sizes of +0.10 in Year 3 and +0.19 in Year 4. In middle school reading, effect sizes reached an effect size of +0.10 by Year 4.

Effects were also very good in fifth grade math, with significant effects of +0.10 in Year 3 and +0.13 in Year 4. Effect sizes in middle school math were also significant in Year 4 (ES=+0.12).

Note that these effects are for all schools, whether they adopted a program or not. Non-experimental analyses found that by Year 4, elementary schools that had chosen and implemented a reading program (33% of schools by Year 3, 42% by Year 4) scored better than matched controls in reading. Schools that chose any reading program usually chose our Success for All reading program, but some chose other models. Even in schools that did not adopt reading or math programs, scores were always higher, on average, (though not always significantly higher) than for schools that did not choose programs.

How Much Did We Raise the Bar?

The CDDRE project was exceptional because of its size and scope. The 622 schools, in 59 districts in 7 states, were collectively equivalent to a medium-sized state. So if anyone asks what evidence-based reform could do to help an entire state, this study provides one estimate. The student-level outcome in elementary reading, an effect size of +0.19, applied to NAEP scores, would be enough to move 43 states to the scores now only attained by the top 10. If applied successfully to schools serving mostly African American and Hispanic students or to students receiving free- or reduced-price lunches regardless of ethnicity, it would reduce the achievement gap between these and White or middle-class students by about 38%. All in four years, at very modest cost.

Actually, implementing something like Raising the Bar could be done much more easily and effectively today than it could in 2005-2009. First, there are a lot more proven programs to choose from than there were then. Second, the U.S. Congress, in the Every Student Succeeds Act (ESSA), now has definitions of strong, moderate, and promising levels of evidence, and restricts school improvement grants to schools that choose such programs. The reason only 42% of Raising the Bar schools selected a program is that they had to pay for it, and many could not afford to do so. Today, there are resources to help with this.

The evidence is both logical and clear: Evidence works.

Reference

Slavin, R. E., Cheung, A., Holmes, G., Madden, N. A., & Chamberlain, A. (2013). Effects of a data-driven district reform model on state assessment outcomes. American Educational Research Journal, 50 (2), 371-396.

Photo by Sebastian Mary/Gio JL [CC BY-SA 2.0  (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

John Hattie is Wrong

John Hattie is a professor at the University of Melbourne, Australia. He is famous for a book, Visible Learning, which claims to review every area of research that relates to teaching and learning. He uses a method called “meta-meta-analysis,” averaging effect sizes from many meta-analyses. The book ranks factors from one to 138 in terms of their effect sizes on achievement measures. Hattie is a great speaker, and many educators love the clarity and simplicity of his approach. How wonderful to have every known variable reviewed and ranked!

However, operating on the principle that anything that looks to be too good to be true probably is, I looked into Visible Learning to try to understand why it reports such large effect sizes. My colleague, Marta Pellegrini from the University of Florence (Italy), helped me track down the evidence behind Hattie’s claims. And sure enough, Hattie is profoundly wrong. He is merely shoveling meta-analyses containing massive bias into meta-meta-analyses that reflect the same biases.

blog_6-21-18_salvagepaper_476x500

Part of Hattie’s appeal to educators is that his conclusions are so easy to understand. He even uses a system of dials with color-coded “zones,” where effect sizes of 0.00 to +0.15 are designated “developmental effects,” +0.15 to +0.40 “teacher effects” (i.e., what teachers can do without any special practices or programs), and +0.40 to +1.20 the “zone of desired effects.” Hattie makes a big deal of the magical effect size +0.40, the “hinge point,” recommending that educators essentially ignore factors or programs below that point, because they are no better than what teachers produce each year, from fall to spring, on their own. In Hattie’s view, an effect size of from +0.15 to +0.40 is just the effect that “any teacher” could produce, in comparison to students not being in school at all. He says, “When teachers claim that they are having a positive effect on achievement or when a policy improves achievement, this is almost always a trivial claim: Virtually everything works. One only needs a pulse and we can improve achievement.” (Hattie, 2009, p. 16). An effect size of 0.00 to +0.15 is, he estimates, “what students could probably achieve if there were no schooling” (Hattie, 2009, p. 20). Yet this characterization of dials and zones misses the essential meaning of effect sizes, which are rarely used to measure the amount teachers’ students gain from fall to spring, but rather the amount students receiving a given treatment gained in comparison to gains made by similar students in a control group over the same period. So an effect size of, say, +0.15 or +0.25 could be very important.

Hattie’s core claims are these:

  • Almost everything works
  • Any effect size less than +0.40 is ignorable
  • It is possible to meaningfully rank educational factors in comparison to each other by averaging the findings of meta-analyses.

These claims appear appealing, simple, and understandable. But they are also wrong.

The essential problem with Hattie’s meta-meta-analyses is that they accept the results of the underlying meta-analyses without question. Yet many, perhaps most meta-analyses accept all sorts of individual studies of widely varying standards of quality. In Visible Learning, Hattie considers and then discards the possibility that there is anything wrong with individual meta-analyses, specifically rejecting the idea that the methods used in individual studies can greatly bias the findings.

To be fair, a great deal has been learned about the degree to which particular study characteristics bias study findings, always in a positive (i.e., inflated) direction. For example, there is now overwhelming evidence that effect sizes are significantly inflated in studies with small sample sizes, brief durations, use measures made by researchers or developers, are published (vs. unpublished), or use quasi-experiments (vs. randomized experiments) (Cheung & Slavin, 2016). Many meta-analyses even include pre-post studies, or studies that do not have pretests, or have pretest differences but fail to control for them. For example, I once criticized a meta-analysis of gifted education in which some studies compared students accepted into gifted programs to students rejected for those programs, controlling for nothing!

A huge problem with meta-meta-analysis is that until recently, meta-analysts rarely screened individual studies to remove those with fatal methodological flaws. Hattie himself rejects this procedure: “There is…no reason to throw out studies automatically because of lower quality” (Hattie, 2009, p. 11).

In order to understand what is going on in the underlying meta-analyses in a meta-meta-analysis, is it crucial to look all the way down to the individual studies. As a point of illustration, I examined Hattie’s own meta-meta-analysis of feedback, his third ranked factor, with a mean effect size of +0.79. Hattie & Timperly (2007) located 12 meta-analyses. I found some of the ones with the highest mean effect sizes.

At a mean of +1.24, the meta-analysis with the largest effect size in the Hattie & Timperley (2007) review was a review of research on various reinforcement treatments for students in special education by Skiba, Casey, & Center (1985-86). The reviewers required use of single-subject designs, so the review consisted of a total of 35 students treated one at a time, across 25 studies. Yet it is known that single-subject designs produce much larger effect sizes than ordinary group designs (see What Works Clearinghouse, 2017).

The second-highest effect size, +1.13, was from a meta-analysis by Lysakowski & Walberg (1982), on instructional cues, participation, and corrective feedback. Not enough information is provided to understand the individual studies, but there is one interesting note. A study using a single-subject design, involving two students, had an effect size of 11.81. That is the equivalent of raising a child’s IQ from 100 to 277! It was “winsorized” to the next-highest value of 4.99 (which is like adding 75 IQ points). Many of the studies were correlational, with no controls for inputs, or had no control group, or were pre-post designs.

A meta-analysis by Rummel and Feinberg (1988), with a reported effect size of +0.60, is perhaps the most humorous inclusion in the Hattie & Timperley (2007) meta-meta-analysis. It consists entirely of brief lab studies of the degree to which being paid or otherwise reinforced for engaging in an activity that was already intrinsically motivating would reduce subjects’ later participation in that activity. Rummel & Feinberg (1988) reported a positive effect size if subjects later did less of the activity they were paid to do. The reviewers decided to code studies positively if their findings corresponded to the theory (i.e., that feedback and reinforcement reduce later participation in previously favored activities), but in fact their “positive” effect size of +0.60 indicates a negative effect of feedback on performance.

I could go on (and on), but I think you get the point. Hattie’s meta-meta-analyses grab big numbers from meta-analyses of all kinds with little regard to the meaning or quality of the original studies, or of the meta-analyses.

If you are familiar with the What Works Clearinghouse (2007), or our own Best-Evidence Syntheses (www.bestevidence.org) or Evidence for ESSA (www.evidenceforessa.org), you will know that individual studies, except for studies of one-to-one tutoring, almost never have effect sizes as large as +0.40, Hattie’s “hinge point.” This is because WWC, BEE, and Evidence for ESSA all very carefully screen individual studies. We require control groups, controls for pretests, minimum sample sizes and durations, and measures independent of the treatments. Hattie applies no such standards, and in fact proclaims that they are not necessary.

It is possible, in fact essential, to make genuine progress using high-quality rigorous research to inform educational decisions. But first we must agree on what standards to apply.  Modest effect sizes from studies of practical treatments in real classrooms over meaningful periods of time on measures independent of the treatments tell us how much a replicable treatment will actually improve student achievement, in comparison to what would have been achieved otherwise. I would much rather use a program with an effect size of +0.15 from such studies than to use programs or practices found in studies with major flaws to have effect sizes of +0.79. If they understand the situation, I’m sure all educators would agree with me.

To create information that is fair and meaningful, meta-analysts cannot include studies of unknown and mostly low quality. Instead, they need to apply consistent standards of quality for each study, to look carefully at each one and judge its freedom from bias and major methodological flaws, as well as its relevance to practice. A meta-analysis cannot be any better than the studies that go into it. Hattie’s claims are deeply misleading because they are based on meta-analyses that themselves accepted studies of all levels of quality.

Evidence matters in education, now more than ever. Yet Hattie and others who uncritically accept all studies, good and bad, are undermining the value of evidence. This needs to stop if we are to make solid progress in educational practice and policy.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Hattie, J. (2009). Visible learning. New York, NY: Routledge.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81-112.

Lysakowski, R., & Walberg, H. (1982). Instructional effects of cues, participation, and corrective feedback: A quantitative synthesis. American Educational Research Journal, 19 (4), 559-578.

Rummel, A., & Feinberg, R. (1988). Cognitive evaluation theory: A review of the literature. Social Behavior and Personality, 16 (2), 147-164.

Skiba, R., Casey, A., & Center, B. (1985-86). Nonaversive procedures I the treatment of classroom behavior problems. The Journal of Special Education, 19 (4), 459-481.

What Works Clearinghouse (2017). Procedures handbook 4.0. Washington, DC: Author.

Photo credit: U.S. Farm Security Administration [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Meta-Analysis and Its Discontents

Everyone loves meta-analyses. We did an analysis of the most frequently opened articles on Best Evidence in Brief. Almost all of the most popular were meta-analyses. What’s so great about meta-analyses is that they condense a lot of evidence and synthesize it, so instead of just one study that might be atypical or incorrect, a meta-analysis seems authoritative, because it averages many individual studies to find the true effect of a given treatment or variable.

Meta-analyses can be wonderful summaries of useful information. But today I wanted to discuss how they can be misleading. Very misleading.

The problem is that there are no norms among journal editors or meta-analysts themselves about standards for including studies or, perhaps most importantly, how much or what kind of information needs to be reported about each individual study in a meta-analysis. Some meta-analyses are completely statistical. They report all sorts of statistics and very detailed information on exactly how the search for articles took place, but never say anything about even a single study. This is a problem for many reasons. Readers may have no real understanding of what the studies really say. Even if citations for the included studies are available, only a very motivated reader is going to go find any of them. Most meta-analyses do have a table listing studies, but the information in the table may be idiosyncratic or limited.

One reason all of this matters is that without clear information on each study, readers can be easily misled. I remember encountering this when meta-analysis first became popular in the 1980s. Gene Glass, who coined the very term, proposed some foundational procedures, and popularized the methods. Early on, he applied meta-analysis to determine the effects of class size, which by then had been studied several times and found to matter very little except in first grade. Reducing “class size” to one (i.e., one-to-one tutoring) also was known to make a big difference, but few people would include one-to-one tutoring in a review of class size. But Glass and Smith (1978) found a much higher effect, not limited to first grade or tutoring. It was a big deal at the time.

I wanted to understand what happened. I bought and read Glass’ book on class size, but it was nearly impossible to tell what had happened. But then I found in an obscure appendix a distribution of effect sizes. Most studies had effect sizes near zero, as I expected. But one had a huge effect size, of +1.25! It was hard to tell which particular study accounted for this amazing effect but I searched by process of elimination and finally found it.

It was a study of tennis.

blog_6-7-18_tennis_500x355

The outcome measure was the ability to “rally a ball against a wall so many times in 30 seconds.” Not surprisingly, when there were “large class sizes,” most students got very few chances to practice, while in “small class sizes,” they did.

If you removed the clearly irrelevant tennis study, the average effect size for class sizes (other than tutoring) dropped to near zero, as reported in all other reviews (Slavin, 1989).

The problem went way beyond class size, of course. What was important, to me at least, was that Glass’ presentation of the data made it very difficult to find out what was really going on. He had attractive and compelling graphs and charts showing effects of class size, but they all depended on the one tennis study, and there was no easy way to find out.

Because of this review and several others appearing in the 1980s, I wrote an article criticizing numbers–only meta-analyses and arguing that reviewers should show all of the relevant information about the studies in their meta-analyses, and should even describe each study briefly to help readers understand what was happening. I made up a name for this, “best-evidence synthesis” (Slavin, 1986).

Neither the term nor the concept really took hold, I’m sad to say. You still see meta-analyses all the time that do not tell readers enough for them to know what’s really going on. Yet several developments have made the argument for something like best-evidence synthesis a lot more compelling.

One development is the increasing evidence that methodological features can be strongly correlated with effect sizes (Cheung & Slavin, 2016). The evidence is now overwhelming that effect sizes are greatly inflated when sample sizes are small, when study durations are brief, when measures are made by developers or researchers, or when quasi-experiments rather than randomized experiments are used, for example. Many meta-analyses check for the effects of these and other study characteristics, and may make adjustments if there are significant differences. But this is not sufficient, because in a particular meta-analysis, there may not be enough studies to make any study-level factors significant. For example, if Glass had tested “tennis vs. non-tennis,” there would have been no significant difference, because there was only one tennis study. Yet that one study dominated the means anyway. Eliminating studies using, for example, researcher/developer-made measures or very small sample sizes or very brief durations is one way to remove bias from meta-analyses, and this is what we do in our reviews. But at bare minimum, it is important to have enough information available in tables to enable readers or journal reviewers to look for such biasing factors so they can recompute or at least understand the main effects if they are so inclined.

The second development that makes it important to require more information on individual studies in meta-analyses is the increased popularity of meta-meta-analyses, where the average effect sizes from whole meta-analyses are averaged. These have even more potential for trouble than the worst statistics-only reviews, because it is extremely unlikely that many readers will follow the citations to each included meta-analysis and then follow those citations to look for individual studies. It would be awfully helpful if readers or reviewers could trust the individual meta-analyses (and therefore their averages), or at least see for themselves.

As evidence takes on greater importance, this would be a good time to discuss reasonable standards for meta-analyses. Otherwise, we’ll be rallying balls uselessly against walls forever.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292

Glass, G., & Smith, M. L. (1978). Meta-Analysis of research on the relationship of class size and achievement. San Francisco: Far West Laboratory for Educational Research and Development.

Slavin, R.E. (1986). Best-evidence synthesis: An alternative to meta-analytic and traditional reviews. Educational Researcher, 15 (9), 5-11.

Slavin, R. E. (1989). Class size and student achievement:  Small effects of small classes. Educational Psychologist, 24, 99-110.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

When Developers Commission Studies, What Develops?

I have the greatest respect for commercial developers and disseminators of educational programs, software, and professional development. As individuals, I think they genuinely want to improve the practice of education, and help produce better outcomes for children. However, most developers are for-profit companies, and they have shareholders who are focused on the bottom line. So when developers carry out evaluations, or commission evaluation companies to do so on their behalf, perhaps it’s best to keep in mind a bit of dialogue from a Marx Brothers movie. Someone asks Groucho if Chico is honest. “Sure,” says Groucho, “As long as you watch him!”

blog_5-31-18_MarxBros_500x272

         A healthy role for developers in evidence-based reform in education is desirable. Publishers, software developers, and other commercial companies have a lot of capital, and a strong motivation to create new products with evidence of effectiveness that will stand up to scrutiny. In medicine, most advances in practical drugs and treatments are made by drug companies. If you’re a cynic, this may sound disturbing. But for a long time, the federal government has encouraged drug companies to do development and evaluation of new drugs, but they have strict rules about what counts as conclusive evidence. Basically, the government says, following Groucho, “Are drug companies honest? Sure, as long as you watch ‘em.”

            In our field, we may want to think about how to do this. As one contribution, my colleague Betsy Wolf did some interesting research on outcomes of studies sponsored by developers, compared to those conducted by independent, third parties. She looked at all reading/literacy and math studies listed on the What Works Clearinghouse database. The first thing she found was very disturbing. Sure enough, the effect sizes for the developer-commissioned studies (ES = +0.27, n=73) were twice as large as those for independent studies (ES = +0.13, n=96). That’s a huge difference.

Being a curious person, Betsy wanted to know why developer-commissioned studies had effect sizes that were so much larger than independent ones. We now know a lot about study characteristics that inflate effect sizes. The most inflationary are small sample size, use of measures made by researchers or developers (rather than independent measures), and use of quasi-experiments instead of randomized designs. Developer-commissioned studies were in fact much more likely to use researcher/developer-made measures (29% in developer-commissioned vs. 8% in independent studies), and randomized vs. quasi-experiments (51% quasi-experiments for developer-commissioned studies vs. 15% quasi-experiments for independent studies). However, sample sizes were similar in developer-commissioned and independent studies. And most surprising, statistically controlling for all of these factors did not reduce the developer effect by very much.

If there is so much inflation of effect sizes in developer-commissioned studies, then how come controlling for the largest factors that usually cause effect size inflation does not explain the developer effect?

There is a possible reason for this, which Betsy cautiously advances (since it cannot be proven). Perhaps the reason that effect sizes are inflated in developer-commissioned studies is not due to the nature of the studies we can find, but to the studies we cannot find. There has long been recognition of what is called the “file drawer effect,” which happens when studies that do not obtain a positive outcome disappear (into a file drawer). Perhaps developers are especially likely to hide disappointing findings. Unlike academic studies, which are likely to exist as technical reports or dissertations, perhaps commercial companies have no incentive to make null findings findable in any form.

This may not be true, or it may be true of some but not other developers. But if government is going to start taking evidence a lot more seriously, as it has done with the ESSA evidence standards (see www.evidenceforessa.org), it is important to prevent developers, or any researchers, from hiding their null findings.

There is a solution to this problem that is heading rapidly in our direction. This is pre-registration. In pre-registration, researchers or evaluators must file a study design, measures, and analyses about to be used in a study, but perhaps most importantly, pre-registration announces that a study exists, or will exist soon. If a developer pre-registered a study but that study never showed up in the literature, this might be a cause for losing faith in the developer. Imagine that the What Works Clearinghouse, Evidence for ESSA, and journals refused to accept research reports on programs unless the study had been pre-registered, and unless all other studies of the program were made available.

Some areas of medicine use pre-registration, and the Society for Research on Educational Effectiveness is moving toward introducing a pre-registration process for education. Use of pre-registration and other safeguards could be a boon to commercial developers, as it is to drug companies, because it could build public confidence in developer-sponsored research. Admittedly, it would take many years and/or a lot more investment in educational research to make this practical, but there are concrete steps we could take in that direction.

I’m not sure I see any reason we shouldn’t move toward pre-registration. It would be good for Groucho, good for Chico, and good for kids. And that’s good enough for me!

Photo credit: By Paramount Pictures (source) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Effect Sizes and the 10-Foot Man

If you ever go into the Ripley’s Believe It or Not Museum in Baltimore, you will be greeted at the entrance by a statue of the tallest man who ever lived, Robert Pershing Wadlow, a gentle giant at 8 feet, 11 inches in his stocking feet. Kids and adults love to get their pictures taken standing by him, to provide a bit of perspective.

blog_5-10-18_Wadlow_292x500

I bring up Mr. Wadlow to explain a phrase I use whenever my colleagues come up with an effect size of more than 1.00. “That’s a 10-foot man,” I say. What I mean, of course, is that while it is not impossible that there could be a 10-foot man someday, it is extremely unlikely, because there has never been a man that tall in all of history. If someone reports seeing one, they are probably mistaken.

In the case of effect sizes you will never, or almost never, see an effect size of more than +1.00, assuming the following reasonable conditions:

  1. The effect size compares experimental and control groups (i.e., it is not pre-post).
  2. The experimental and control group started at the same level, or they started at similar levels and researchers statistically controlled for pretest differences.
  3. The measures involved were independent of the researcher and the treatment, not made by the developers or researchers. The test was not given by the teachers to their own students.
  4. The treatment was provided by ordinary teachers, not by researchers, and could in principle be replicated widely in ordinary schools. The experiment had a duration of at least 12 weeks.
  5. There were at least 30 students and 2 teachers in each treatment group (experimental and control).

If these conditions are met, the chances of finding effect sizes of more than +1.00 are about the same as the chances of finding a 10-foot man. That is, zero.

I was thinking about the 10-foot man when I was recently asked by a reporter about the “two sigma effect” claimed by Benjamin Bloom and much discussed in the 1970s and 1980s. Bloom’s students did a series of experiments in which students were taught about a topic none of them knew anything about, usually principles of sailing. After a short period, students were tested. Those who did not achieve at least 80% (defined as “mastery”) on the tests were tutored by University of Chicago graduate students long enough to ensure that every tutored student reached mastery. The purpose of this demonstration was to make a claim that every student could learn whatever we wanted to teach them, and the only variable was instructional time, as some students need more time to learn than others. In a system in which enough time could be given to all, “ability” would disappear as a factor in outcomes. Also, in comparison to control groups who were not taught about sailing at all, the effect size was often more than 2.0, or two sigma. That’s why this principle was called the “two sigma effect.” Doesn’t the two sigma effect violate my 10-foot man principle?

No, it does not. The two sigma studies used experimenter-made tests of content taught to the experimental but not control groups. It used University of Chicago graduate students providing far more tutoring (as a percentage of initial instruction) than any school could ever provide. The studies were very brief and sample sizes were small. The two sigma experiments were designed to prove a point, not to evaluate a feasible educational method.

A more recent example of the 10-foot man principle is found in Visible Learning, the currently fashionable book by John Hattie claiming huge effect sizes for all sorts of educational treatments. Hattie asks the reader to ignore any educational treatment with an effect size of less than +0.40, and reports many whole categories of teaching methods with average effect sizes of more than +1.00. How can this be?

The answer is that such effect sizes, like two sigma, do not incorporate the conditions I laid out. Instead, Hattie throws into his reviews entire meta-analyses which may include pre-post studies, studies using researcher-made measures, studies with tiny samples, and so on. For practicing educators, such effect sizes are useless. An educator knows that all children grow from pre- to posttest. They would not (and should not) accept measures made by researchers. The largest known effect sizes that do meet the above conditions are one-to-one tutoring studies with effect sizes up to +0.86. Still not +1.00. What could be more effective than the best of 1-1 tutoring?

It’s fun to visit Mr. Wadlow at the museum, and to imagine what an ever taller man could do on a basketball team, for example. But if you see a 10-foot man at Ripley’s Believe it or Not, or anywhere else, here’s my suggestion. Don’t believe it. And if you visit a museum of famous effect sizes that displays a whopper effect size of +1.00, don’t believe that, either. It doesn’t matter how big effect sizes are if they are not valid.

A 10-foot man would be a curiosity. An effect size of +1.00 is a distraction. Our work on evidence is too important to spend our time looking for 10-foot men, or effect sizes of +1.00, that don’t exist.

Photo credit: [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Mill and The School

 

On a recent trip to Scotland, I visited some very interesting oat mills. I always love to visit medieval mills, because I find it endlessly fascinating how people long ago used natural forces and materials – wind, water, and fire, stone, wood, and metal – to create advanced mechanisms that had a profound impact on society.

In Scotland, it’s all about oat mills (almost everywhere else, it’s wheat). These grain mills date back to the 10th century. In their time, they were a giant leap in technology. A mill is very complicated, but at its heart are two big innovations. In the center of the mill, a heavy millstone turns on top of another. The grain is poured through a hole in the top stone for grinding. The miller’s most difficult task is to maintain an exact distance between the stones. A few millimeters too far apart and no milling happens. A few millimeters too close and the heat of friction can ruin the machinery, possibly causing a fire.

The other key technology is the water wheel (except in windmills, of course). The water mill is part of a system that involves a carefully controlled flow of water from a millpond, which the miller uses to provide exactly the right amount of water to turn a giant wooden wheel, which powers the top millstone.

blog_5-2-18_TheMaidOfTheMill_500x472

The medieval grain mill is not a single innovation, but a closely integrated system of innovations. Millers learned to manage this complex technology in a system of apprenticeship over many years.

Mills enabled medieval millers to obtain far more nutrition from an acre of grain than was possible before. This made it possible for land to support many more people, and the population surged. The whole feudal system was built around the economics of mills, and mills thrived through the 19th century.

What does the mill have to with the school? Mills only grind well-behaved grain into well-behaved flour, while schools work with far more complex children, families, and all the systems that surround them. The products of schools must include joy and discovery, knowledge and skills.

Yet as different as they are, mills have something to teach us. They show the importance of integrating diverse systems that can then efficiently deliver desired outcomes. Neither a mill nor an effective school comes into existence because someone in power tells it to. Instead, complex systems, mills or schools, must be created, tested, adapted to local needs, and constantly improved. Once we know how to create, manage, and disseminate effective mills or schools, policies can be readily devised to support their expansion and improvement.

Important progress in societies and economies almost always comes about from development of complex, multi-component innovations that, once developed, can be disseminated and continuously improved. The same is true of schools. Changes in governance or large-scale policies can enhance (or inhibit) the possibility of change, but the reality of reform depends on creation of complex, integrated systems, from mills to ships to combines to hospitals to schools.

For education, what this means is that system transformation will come only when we have whole-school improvement approaches that are known to greatly increase student outcomes. Whole-school change is necessary because many individual improvements are needed to make big changes, and these must be carefully aligned with each other. Just as the huge water wheel and the tiny millstone adjustment mechanism and other components must work together in the mill, the key parts of a school must work together in synchrony to produce maximum impact, or the whole system fails to work as well as it should.

For example, if you look at research on proven programs, you’ll find effective strategies for school management, for teaching, and for tutoring struggling readers. These are all well and good, but they work so much better if they are linked to each other.

To understand this, first consider tutoring. Especially in the elementary grades, there is no more effective strategy. Our recent review of research on programs for struggling readers finds that well-qualified teaching assistants can be as effective as teachers in tutoring struggling readers, and that while one-to-four tutoring is less effective than one-to-one, it is still a lot more effective than no tutoring. So an evidence-oriented educator might logically choose to implement proven one-to-one and/or one-to-small group tutoring programs to improve school outcomes.

However, tutoring only helps the students who receive it, and it is expensive. A wise school administrator might reason that tutoring alone is not sufficient, but improving the quality of classroom instruction is also essential, both to improve outcomes for students who do not need tutoring and to reduce the number of students who do need tutoring. There is an array of proven classroom methods the principal or district might choose to improve student outcomes in all subjects and grade levels (see www.evidenceforessa.org).

But now consider students who are at risk because they are not attending regularly, or have behavior problems, or need eyeglasses but do not have them. Flexible school-level systems are necessary to ensure that students are in school, eager to learn, well-behaved, and physically prepared to succeed.

In addition, there is a need to have school principals and other leaders learn strategies for making effective use of proven programs. These would include managing professional development, coaching, monitoring implementation and outcomes of proven programs, distributed leadership, and much more. Leadership also requires jointly setting school goals with all school staff and monitoring progress toward these goals.

These are all components of the education “mill” that have to be designed, tested, and (if effective) disseminated to ever-increasing numbers of schools. Like the mill, an effective school design integrates individual parts, makes them work in synchrony, constantly assesses their functioning and output, and adjusts procedures when necessary.

Many educational theorists argue that education will only change when systems change. Ferocious battles rage about charters vs. ordinary public schools, about adopting policies of countries that do well on international tests, and so on. These policies can be important, but they are unlikely to create substantial and lasting improvement unless they lead to development and dissemination of proven whole-school approaches.

Effective school improvement is not likely to come about from let-a-thousand-flowers-bloom local innovation, nor from top-level changes in policy or governance. Sufficient change will not come about by throwing individual small innovations into schools and hoping they will collectively make a difference. Instead, effective improvement will take root when we learn how to reliably create effective programs for schools, implement them in a coordinated and planful way, find them effective, and then disseminate them. Once such schools are widespread, we can build larger policies and systems around their needs.

Coordinated, schoolwide improvement approaches offer schools proven strategies for increasing the achievement and success of their children. There should be many programs of this kind, among which schools and districts can choose. A school is not the same as mill, but the mill provides at least one image of how creating complex, integrated replicable systems can change whole societies and economies. We should learn from this and many other examples of how to focus our efforts to improve outcomes for all children.

Photo credit: By Johnson, Helen Kendrik [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.