Succeeding Faster in Education

“If you want to increase your success rate, double your failure rate.” So said Thomas Watson, the founder of IBM. What he meant, of course, is that people and organizations thrive when they try many experiments, even though most experiments fail. Failing twice as often means trying twice as many experiments, leading to twice as many failures—but also, he was saying, many more successes.

blog_9-20-18_TJWatson_500x488
Thomas Watson

In education research and innovation circles, many people know this quote, and use it to console colleagues who have done an experiment that did not produce significant positive outcomes. A lot of consolation is necessary, because most high-quality experiments in education do not produce significant positive outcomes. In studies funded by the Institute for Education Sciences (IES), Investing in Innovation (i3), and England’s Education Endowment Foundation (EEF), all of which require very high standards of evidence, fewer than 20% of experiments show significant positive outcomes.

The high rate of failure in educational experiments is often shocking to non-researchers, especially the government agencies, foundations, publishers, and software developers who commission the studies. I was at a conference recently in which a Peruvian researcher presented the devastating results of an experiment in which high-poverty, mostly rural schools in Peru were randomly assigned to receive computers for all of their students, or to continue with usual instruction. The Peruvian Ministry of Education was so confident that the computers would be effective that they had built a huge model of the specific computers used in the experiment and attached it to the Ministry headquarters. When the results showed no positive outcomes (except for the ability to operate computers), the Ministry quietly removed the computer statue from the top of their building.

Improving Success Rates

Much as I believe Watson’s admonition (“fail more”), there is another principle that he was implying, or so I expect: We have to learn from failure, so we can increase the rate of success. It is not realistic to expect government to continue to invest substantial funding in high-quality educational experiments if the success rate remains below 20%. We have to get smarter, so we can succeed more often. Fortunately, qualitative measures, such as observations, interviews, and questionnaires, are becoming required elements of funded research, facilitating finding out what happened so that researchers can find out what went wrong. Was the experimental program faithfully implemented? Were there unexpected responses toward the program by teachers or students?

In the course of my work reviewing positive and disappointing outcomes of educational innovations, I’ve noticed some patterns that often predict that a given program is likely or unlikely to be effective in a well-designed evaluation. Some of these are as follows.

  1. Small changes lead to small (or zero) impacts. In every subject and grade level, researchers have evaluated new textbooks, in comparison to existing texts. These almost never show positive effects. The reason is that textbooks are just not that different from each other. Approaches that do show positive effects are usually markedly different from ordinary practices or texts.
  2. Successful programs almost always provide a lot of professional development. The programs that have significant positive effects on learning are ones that markedly improve pedagogy. Changing teachers’ daily instructional practices usually requires initial training followed by on-site coaching by well-trained and capable coaches. Lots of PD does not guarantee success, but minimal PD virtually guarantees failure. Sufficient professional development can be expensive, but education itself is expensive, and adding a modest amount to per-pupil cost for professional development and other requirements of effective implementation is often the best way to substantially enhance outcomes.
  3. Effective programs are usually well-specified, with clear procedures and materials. Rarely do programs work if they are unclear about what teachers are expected to do, and helped to do it. In the Peruvian study of one-to-one computers, for example, students were given tablet computers at a per-pupil cost of $438. Teachers were expected to figure out how best to use them. In fact, a qualitative study found that the computers were considered so valuable that many teachers locked them up except for specific times when they were to be used. They lacked specific instructional software or professional development to create the needed software. No wonder “it” didn’t work. Other than the physical computers, there was no “it.”
  4. Technology is not magic. Technology can create opportunities for improvement, but there is little understanding of how to use technology to greatest effect. My colleagues and I have done reviews of research on effects of modern technology on learning. We found near-zero effects of a variety of elementary and secondary reading software (Inns et al., 2018; Baye et al., in press), with a mean effect size of +0.05 in elementary reading and +0.00 in secondary. In math, effects were slightly more positive (ES=+0.09), but still quite small, on average (Pellegrini et al., 2018). Some technology approaches had more promise than others, but it is time that we learned from disappointing as well as promising applications. The widespread belief that technology is the future must eventually be right, but at present we have little reason to believe that technology is transformative, and we don’t know which form of technology is most likely to be transformative.
  5. Tutoring is the most solid approach we have. Reviews of elementary reading for struggling readers (Inns et al., 2018) and secondary struggling readers (Baye et al., in press), as well as elementary math (Pellegrini et al., 2018), find outcomes for various forms of tutoring that are far beyond effects seen for any other type of treatment. Everyone knows this, but thinking about tutoring falls into two camps. One, typified by advocates of Reading Recovery, takes the view that tutoring is so effective for struggling first graders that it should be used no matter what the cost. The other, also perhaps thinking about Reading Recovery, rejects this approach because of its cost. Yet recent research on tutoring methods is finding strategies that are cost-effective and feasible. First, studies in both reading (Inns et al., 2018) and math (Pellegrini et al., 2018) find no difference in outcomes between certified teachers and paraprofessionals using structured one-to-one or one-to-small group tutoring models. Second, although one-to-one tutoring is more effective than one-to-small group, one-to-small group is far more cost-effective, as one trained tutor can work with 4 to 6 students at a time. Also, recent studies have found that tutoring can be just as effective in the upper elementary and middle grades as in first grade, so this strategy may have broader applicability than it has in the past. The real challenge for research on tutoring is to develop and evaluate models that increase cost-effectiveness of this clearly effective family of approaches.

The extraordinary advances in the quality and quantity of research in education, led by investments from IES, i3, and the EEF, have raised expectations for research-based reform. However, the modest percentage of recent studies meeting current rigorous standards of evidence has caused disappointment in some quarters. Instead, all findings, whether immediately successful or not, should be seen as crucial information. Some studies identify programs ready for prime time right now, but the whole body of work can and must inform us about areas worthy of expanded investment, as well as areas in need of serious rethinking and redevelopment. The evidence movement, in the form it exists today, is completing its first decade. It’s still early days. There is much more we can learn and do to develop, evaluate, and disseminate effective strategies, especially for students in great need of proven approaches.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

 Photo credit: IBM [CC BY-SA 3.0  (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Advertisements

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

“But It Worked in the Lab!” How Lab Research Misleads Educators

In researching John Hattie’s meta-meta analyses, and digging into the original studies, I discovered one underlying factor that more than anything explains why he consistently comes up with greatly inflated effect sizes:  Most studies in the meta-analyses that he synthesizes are brief, small, artificial lab studies. And lab studies produce very large effect sizes that have little if any relevance to classroom practice.

This discovery reminds me of one of the oldest science jokes in existence: (One scientist to another): “Your treatment worked very well in practice, but how will it work in the lab?” (Or “…in theory?”)

blog_6-28-18_scientists_500x424

The point of the joke, of course, is to poke fun at scientists more interested in theory than in practical impacts on real problems. Personally, I have great respect for theory and lab studies. My very first publication as a psychology undergraduate involved an experiment on rats.

Now, however, I work in a rapidly growing field that applies scientific methods to the study and improvement of classroom practice.  In our field, theory also has an important role. But lab studies?  Not so much.

A lab study in education is, in my view, any experiment that tests a treatment so brief, so small, or so artificial that it could never be used all year. Also, an evaluation of any treatment that could never be replicated, such as a technology program in which a graduate student is standing by every four students every day of the experiment, or a tutoring program in which the study author or his or her students provide the tutoring, might be considered a lab study, even if it went on for several months.

Our field exists to try to find practical solutions to practical problems in an applied discipline.  Lab studies have little importance in this process, because they are designed to eliminate all factors other than the variables of interest. A one-hour study in which children are asked to do some task under very constrained circumstances may produce very interesting findings, but cannot recommend practices for real teachers in real classrooms.  Findings of lab studies may suggest practical treatments, but by themselves they never, ever validate practices for classroom use.

Lab studies are almost invariably doomed to success. Their conditions are carefully set up to support a given theory. Because they are small, brief, and highly controlled, they produce huge effect sizes. (Because they are relatively easy and inexpensive to do, it is also very easy to discard them if they do not work out, contributing to the universally reported tendency of studies appearing in published sources to report much higher effects than reports in unpublished sources).  Lab studies are so common not only because researchers believe in them, but also because they are easy and inexpensive to do, while meaningful field experiments are difficult and expensive.   Need a publication?  Randomly assign your college sophomores to two artificial treatments and set up an experiment that cannot fail to show significant differences.  Need a dissertation topic?  Do the same in your third-grade class, or in your friend’s tenth grade English class.  Working with some undergraduates, we once did three lab studies in a single day. All were published. As with my own sophomore rat study, lab experiments are a good opportunity to learn to do research.  But that does not make them relevant to practice, even if they happen to take place in a school building.

By doing meta-analyses, or meta-meta-analyses, Hattie and others who do similar reviews obscure the fact that many and usually most of the studies they include are very brief, very small, and very artificial, and therefore produce very inflated effect sizes.  They do this by covering over the relevant information with numbers and statistics rather than information on individual studies, and by including such large numbers of studies that no one wants to dig deeper into them.  In Hattie’s case, he claims that Visible Learning meta-meta-analyses contain 52,637 individual studies.  Who wants to read 52,637 individual studies, only to find out that most are lab studies and have no direct bearing on classroom practice?  It is difficult for readers to do anything but assume that the 52,637 studies must have taken place in real classrooms, and achieved real outcomes over meaningful periods of time.  But in fact, the few that did this are overwhelmed by the thousands of lab studies that did not.

Educators have a right to data that are meaningful for the practice of education.  Anyone who recommends practices or programs for educators to use needs to be open about where that evidence comes from, so educators can judge for themselves whether or not one-hour or one-week studies under artificial conditions tell them anything about how they should teach. I think the question answers itself.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

John Hattie is Wrong

John Hattie is a professor at the University of Melbourne, Australia. He is famous for a book, Visible Learning, which claims to review every area of research that relates to teaching and learning. He uses a method called “meta-meta-analysis,” averaging effect sizes from many meta-analyses. The book ranks factors from one to 138 in terms of their effect sizes on achievement measures. Hattie is a great speaker, and many educators love the clarity and simplicity of his approach. How wonderful to have every known variable reviewed and ranked!

However, operating on the principle that anything that looks to be too good to be true probably is, I looked into Visible Learning to try to understand why it reports such large effect sizes. My colleague, Marta Pellegrini from the University of Florence (Italy), helped me track down the evidence behind Hattie’s claims. And sure enough, Hattie is profoundly wrong. He is merely shoveling meta-analyses containing massive bias into meta-meta-analyses that reflect the same biases.

blog_6-21-18_salvagepaper_476x500

Part of Hattie’s appeal to educators is that his conclusions are so easy to understand. He even uses a system of dials with color-coded “zones,” where effect sizes of 0.00 to +0.15 are designated “developmental effects,” +0.15 to +0.40 “teacher effects” (i.e., what teachers can do without any special practices or programs), and +0.40 to +1.20 the “zone of desired effects.” Hattie makes a big deal of the magical effect size +0.40, the “hinge point,” recommending that educators essentially ignore factors or programs below that point, because they are no better than what teachers produce each year, from fall to spring, on their own. In Hattie’s view, an effect size of from +0.15 to +0.40 is just the effect that “any teacher” could produce, in comparison to students not being in school at all. He says, “When teachers claim that they are having a positive effect on achievement or when a policy improves achievement, this is almost always a trivial claim: Virtually everything works. One only needs a pulse and we can improve achievement.” (Hattie, 2009, p. 16). An effect size of 0.00 to +0.15 is, he estimates, “what students could probably achieve if there were no schooling” (Hattie, 2009, p. 20). Yet this characterization of dials and zones misses the essential meaning of effect sizes, which are rarely used to measure the amount teachers’ students gain from fall to spring, but rather the amount students receiving a given treatment gained in comparison to gains made by similar students in a control group over the same period. So an effect size of, say, +0.15 or +0.25 could be very important.

Hattie’s core claims are these:

  • Almost everything works
  • Any effect size less than +0.40 is ignorable
  • It is possible to meaningfully rank educational factors in comparison to each other by averaging the findings of meta-analyses.

These claims appear appealing, simple, and understandable. But they are also wrong.

The essential problem with Hattie’s meta-meta-analyses is that they accept the results of the underlying meta-analyses without question. Yet many, perhaps most meta-analyses accept all sorts of individual studies of widely varying standards of quality. In Visible Learning, Hattie considers and then discards the possibility that there is anything wrong with individual meta-analyses, specifically rejecting the idea that the methods used in individual studies can greatly bias the findings.

To be fair, a great deal has been learned about the degree to which particular study characteristics bias study findings, always in a positive (i.e., inflated) direction. For example, there is now overwhelming evidence that effect sizes are significantly inflated in studies with small sample sizes, brief durations, use measures made by researchers or developers, are published (vs. unpublished), or use quasi-experiments (vs. randomized experiments) (Cheung & Slavin, 2016). Many meta-analyses even include pre-post studies, or studies that do not have pretests, or have pretest differences but fail to control for them. For example, I once criticized a meta-analysis of gifted education in which some studies compared students accepted into gifted programs to students rejected for those programs, controlling for nothing!

A huge problem with meta-meta-analysis is that until recently, meta-analysts rarely screened individual studies to remove those with fatal methodological flaws. Hattie himself rejects this procedure: “There is…no reason to throw out studies automatically because of lower quality” (Hattie, 2009, p. 11).

In order to understand what is going on in the underlying meta-analyses in a meta-meta-analysis, is it crucial to look all the way down to the individual studies. As a point of illustration, I examined Hattie’s own meta-meta-analysis of feedback, his third ranked factor, with a mean effect size of +0.79. Hattie & Timperly (2007) located 12 meta-analyses. I found some of the ones with the highest mean effect sizes.

At a mean of +1.24, the meta-analysis with the largest effect size in the Hattie & Timperley (2007) review was a review of research on various reinforcement treatments for students in special education by Skiba, Casey, & Center (1985-86). The reviewers required use of single-subject designs, so the review consisted of a total of 35 students treated one at a time, across 25 studies. Yet it is known that single-subject designs produce much larger effect sizes than ordinary group designs (see What Works Clearinghouse, 2017).

The second-highest effect size, +1.13, was from a meta-analysis by Lysakowski & Walberg (1982), on instructional cues, participation, and corrective feedback. Not enough information is provided to understand the individual studies, but there is one interesting note. A study using a single-subject design, involving two students, had an effect size of 11.81. That is the equivalent of raising a child’s IQ from 100 to 277! It was “winsorized” to the next-highest value of 4.99 (which is like adding 75 IQ points). Many of the studies were correlational, with no controls for inputs, or had no control group, or were pre-post designs.

A meta-analysis by Rummel and Feinberg (1988), with a reported effect size of +0.60, is perhaps the most humorous inclusion in the Hattie & Timperley (2007) meta-meta-analysis. It consists entirely of brief lab studies of the degree to which being paid or otherwise reinforced for engaging in an activity that was already intrinsically motivating would reduce subjects’ later participation in that activity. Rummel & Feinberg (1988) reported a positive effect size if subjects later did less of the activity they were paid to do. The reviewers decided to code studies positively if their findings corresponded to the theory (i.e., that feedback and reinforcement reduce later participation in previously favored activities), but in fact their “positive” effect size of +0.60 indicates a negative effect of feedback on performance.

I could go on (and on), but I think you get the point. Hattie’s meta-meta-analyses grab big numbers from meta-analyses of all kinds with little regard to the meaning or quality of the original studies, or of the meta-analyses.

If you are familiar with the What Works Clearinghouse (2007), or our own Best-Evidence Syntheses (www.bestevidence.org) or Evidence for ESSA (www.evidenceforessa.org), you will know that individual studies, except for studies of one-to-one tutoring, almost never have effect sizes as large as +0.40, Hattie’s “hinge point.” This is because WWC, BEE, and Evidence for ESSA all very carefully screen individual studies. We require control groups, controls for pretests, minimum sample sizes and durations, and measures independent of the treatments. Hattie applies no such standards, and in fact proclaims that they are not necessary.

It is possible, in fact essential, to make genuine progress using high-quality rigorous research to inform educational decisions. But first we must agree on what standards to apply.  Modest effect sizes from studies of practical treatments in real classrooms over meaningful periods of time on measures independent of the treatments tell us how much a replicable treatment will actually improve student achievement, in comparison to what would have been achieved otherwise. I would much rather use a program with an effect size of +0.15 from such studies than to use programs or practices found in studies with major flaws to have effect sizes of +0.79. If they understand the situation, I’m sure all educators would agree with me.

To create information that is fair and meaningful, meta-analysts cannot include studies of unknown and mostly low quality. Instead, they need to apply consistent standards of quality for each study, to look carefully at each one and judge its freedom from bias and major methodological flaws, as well as its relevance to practice. A meta-analysis cannot be any better than the studies that go into it. Hattie’s claims are deeply misleading because they are based on meta-analyses that themselves accepted studies of all levels of quality.

Evidence matters in education, now more than ever. Yet Hattie and others who uncritically accept all studies, good and bad, are undermining the value of evidence. This needs to stop if we are to make solid progress in educational practice and policy.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Hattie, J. (2009). Visible learning. New York, NY: Routledge.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81-112.

Lysakowski, R., & Walberg, H. (1982). Instructional effects of cues, participation, and corrective feedback: A quantitative synthesis. American Educational Research Journal, 19 (4), 559-578.

Rummel, A., & Feinberg, R. (1988). Cognitive evaluation theory: A review of the literature. Social Behavior and Personality, 16 (2), 147-164.

Skiba, R., Casey, A., & Center, B. (1985-86). Nonaversive procedures I the treatment of classroom behavior problems. The Journal of Special Education, 19 (4), 459-481.

What Works Clearinghouse (2017). Procedures handbook 4.0. Washington, DC: Author.

Photo credit: U.S. Farm Security Administration [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Meta-Analysis and Its Discontents

Everyone loves meta-analyses. We did an analysis of the most frequently opened articles on Best Evidence in Brief. Almost all of the most popular were meta-analyses. What’s so great about meta-analyses is that they condense a lot of evidence and synthesize it, so instead of just one study that might be atypical or incorrect, a meta-analysis seems authoritative, because it averages many individual studies to find the true effect of a given treatment or variable.

Meta-analyses can be wonderful summaries of useful information. But today I wanted to discuss how they can be misleading. Very misleading.

The problem is that there are no norms among journal editors or meta-analysts themselves about standards for including studies or, perhaps most importantly, how much or what kind of information needs to be reported about each individual study in a meta-analysis. Some meta-analyses are completely statistical. They report all sorts of statistics and very detailed information on exactly how the search for articles took place, but never say anything about even a single study. This is a problem for many reasons. Readers may have no real understanding of what the studies really say. Even if citations for the included studies are available, only a very motivated reader is going to go find any of them. Most meta-analyses do have a table listing studies, but the information in the table may be idiosyncratic or limited.

One reason all of this matters is that without clear information on each study, readers can be easily misled. I remember encountering this when meta-analysis first became popular in the 1980s. Gene Glass, who coined the very term, proposed some foundational procedures, and popularized the methods. Early on, he applied meta-analysis to determine the effects of class size, which by then had been studied several times and found to matter very little except in first grade. Reducing “class size” to one (i.e., one-to-one tutoring) also was known to make a big difference, but few people would include one-to-one tutoring in a review of class size. But Glass and Smith (1978) found a much higher effect, not limited to first grade or tutoring. It was a big deal at the time.

I wanted to understand what happened. I bought and read Glass’ book on class size, but it was nearly impossible to tell what had happened. But then I found in an obscure appendix a distribution of effect sizes. Most studies had effect sizes near zero, as I expected. But one had a huge effect size, of +1.25! It was hard to tell which particular study accounted for this amazing effect but I searched by process of elimination and finally found it.

It was a study of tennis.

blog_6-7-18_tennis_500x355

The outcome measure was the ability to “rally a ball against a wall so many times in 30 seconds.” Not surprisingly, when there were “large class sizes,” most students got very few chances to practice, while in “small class sizes,” they did.

If you removed the clearly irrelevant tennis study, the average effect size for class sizes (other than tutoring) dropped to near zero, as reported in all other reviews (Slavin, 1989).

The problem went way beyond class size, of course. What was important, to me at least, was that Glass’ presentation of the data made it very difficult to find out what was really going on. He had attractive and compelling graphs and charts showing effects of class size, but they all depended on the one tennis study, and there was no easy way to find out.

Because of this review and several others appearing in the 1980s, I wrote an article criticizing numbers–only meta-analyses and arguing that reviewers should show all of the relevant information about the studies in their meta-analyses, and should even describe each study briefly to help readers understand what was happening. I made up a name for this, “best-evidence synthesis” (Slavin, 1986).

Neither the term nor the concept really took hold, I’m sad to say. You still see meta-analyses all the time that do not tell readers enough for them to know what’s really going on. Yet several developments have made the argument for something like best-evidence synthesis a lot more compelling.

One development is the increasing evidence that methodological features can be strongly correlated with effect sizes (Cheung & Slavin, 2016). The evidence is now overwhelming that effect sizes are greatly inflated when sample sizes are small, when study durations are brief, when measures are made by developers or researchers, or when quasi-experiments rather than randomized experiments are used, for example. Many meta-analyses check for the effects of these and other study characteristics, and may make adjustments if there are significant differences. But this is not sufficient, because in a particular meta-analysis, there may not be enough studies to make any study-level factors significant. For example, if Glass had tested “tennis vs. non-tennis,” there would have been no significant difference, because there was only one tennis study. Yet that one study dominated the means anyway. Eliminating studies using, for example, researcher/developer-made measures or very small sample sizes or very brief durations is one way to remove bias from meta-analyses, and this is what we do in our reviews. But at bare minimum, it is important to have enough information available in tables to enable readers or journal reviewers to look for such biasing factors so they can recompute or at least understand the main effects if they are so inclined.

The second development that makes it important to require more information on individual studies in meta-analyses is the increased popularity of meta-meta-analyses, where the average effect sizes from whole meta-analyses are averaged. These have even more potential for trouble than the worst statistics-only reviews, because it is extremely unlikely that many readers will follow the citations to each included meta-analysis and then follow those citations to look for individual studies. It would be awfully helpful if readers or reviewers could trust the individual meta-analyses (and therefore their averages), or at least see for themselves.

As evidence takes on greater importance, this would be a good time to discuss reasonable standards for meta-analyses. Otherwise, we’ll be rallying balls uselessly against walls forever.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292

Glass, G., & Smith, M. L. (1978). Meta-Analysis of research on the relationship of class size and achievement. San Francisco: Far West Laboratory for Educational Research and Development.

Slavin, R.E. (1986). Best-evidence synthesis: An alternative to meta-analytic and traditional reviews. Educational Researcher, 15 (9), 5-11.

Slavin, R. E. (1989). Class size and student achievement:  Small effects of small classes. Educational Psychologist, 24, 99-110.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

When Developers Commission Studies, What Develops?

I have the greatest respect for commercial developers and disseminators of educational programs, software, and professional development. As individuals, I think they genuinely want to improve the practice of education, and help produce better outcomes for children. However, most developers are for-profit companies, and they have shareholders who are focused on the bottom line. So when developers carry out evaluations, or commission evaluation companies to do so on their behalf, perhaps it’s best to keep in mind a bit of dialogue from a Marx Brothers movie. Someone asks Groucho if Chico is honest. “Sure,” says Groucho, “As long as you watch him!”

blog_5-31-18_MarxBros_500x272

         A healthy role for developers in evidence-based reform in education is desirable. Publishers, software developers, and other commercial companies have a lot of capital, and a strong motivation to create new products with evidence of effectiveness that will stand up to scrutiny. In medicine, most advances in practical drugs and treatments are made by drug companies. If you’re a cynic, this may sound disturbing. But for a long time, the federal government has encouraged drug companies to do development and evaluation of new drugs, but they have strict rules about what counts as conclusive evidence. Basically, the government says, following Groucho, “Are drug companies honest? Sure, as long as you watch ‘em.”

            In our field, we may want to think about how to do this. As one contribution, my colleague Betsy Wolf did some interesting research on outcomes of studies sponsored by developers, compared to those conducted by independent, third parties. She looked at all reading/literacy and math studies listed on the What Works Clearinghouse database. The first thing she found was very disturbing. Sure enough, the effect sizes for the developer-commissioned studies (ES = +0.27, n=73) were twice as large as those for independent studies (ES = +0.13, n=96). That’s a huge difference.

Being a curious person, Betsy wanted to know why developer-commissioned studies had effect sizes that were so much larger than independent ones. We now know a lot about study characteristics that inflate effect sizes. The most inflationary are small sample size, use of measures made by researchers or developers (rather than independent measures), and use of quasi-experiments instead of randomized designs. Developer-commissioned studies were in fact much more likely to use researcher/developer-made measures (29% in developer-commissioned vs. 8% in independent studies), and randomized vs. quasi-experiments (51% quasi-experiments for developer-commissioned studies vs. 15% quasi-experiments for independent studies). However, sample sizes were similar in developer-commissioned and independent studies. And most surprising, statistically controlling for all of these factors did not reduce the developer effect by very much.

If there is so much inflation of effect sizes in developer-commissioned studies, then how come controlling for the largest factors that usually cause effect size inflation does not explain the developer effect?

There is a possible reason for this, which Betsy cautiously advances (since it cannot be proven). Perhaps the reason that effect sizes are inflated in developer-commissioned studies is not due to the nature of the studies we can find, but to the studies we cannot find. There has long been recognition of what is called the “file drawer effect,” which happens when studies that do not obtain a positive outcome disappear (into a file drawer). Perhaps developers are especially likely to hide disappointing findings. Unlike academic studies, which are likely to exist as technical reports or dissertations, perhaps commercial companies have no incentive to make null findings findable in any form.

This may not be true, or it may be true of some but not other developers. But if government is going to start taking evidence a lot more seriously, as it has done with the ESSA evidence standards (see www.evidenceforessa.org), it is important to prevent developers, or any researchers, from hiding their null findings.

There is a solution to this problem that is heading rapidly in our direction. This is pre-registration. In pre-registration, researchers or evaluators must file a study design, measures, and analyses about to be used in a study, but perhaps most importantly, pre-registration announces that a study exists, or will exist soon. If a developer pre-registered a study but that study never showed up in the literature, this might be a cause for losing faith in the developer. Imagine that the What Works Clearinghouse, Evidence for ESSA, and journals refused to accept research reports on programs unless the study had been pre-registered, and unless all other studies of the program were made available.

Some areas of medicine use pre-registration, and the Society for Research on Educational Effectiveness is moving toward introducing a pre-registration process for education. Use of pre-registration and other safeguards could be a boon to commercial developers, as it is to drug companies, because it could build public confidence in developer-sponsored research. Admittedly, it would take many years and/or a lot more investment in educational research to make this practical, but there are concrete steps we could take in that direction.

I’m not sure I see any reason we shouldn’t move toward pre-registration. It would be good for Groucho, good for Chico, and good for kids. And that’s good enough for me!

Photo credit: By Paramount Pictures (source) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Effect Sizes and the 10-Foot Man

If you ever go into the Ripley’s Believe It or Not Museum in Baltimore, you will be greeted at the entrance by a statue of the tallest man who ever lived, Robert Pershing Wadlow, a gentle giant at 8 feet, 11 inches in his stocking feet. Kids and adults love to get their pictures taken standing by him, to provide a bit of perspective.

blog_5-10-18_Wadlow_292x500

I bring up Mr. Wadlow to explain a phrase I use whenever my colleagues come up with an effect size of more than 1.00. “That’s a 10-foot man,” I say. What I mean, of course, is that while it is not impossible that there could be a 10-foot man someday, it is extremely unlikely, because there has never been a man that tall in all of history. If someone reports seeing one, they are probably mistaken.

In the case of effect sizes you will never, or almost never, see an effect size of more than +1.00, assuming the following reasonable conditions:

  1. The effect size compares experimental and control groups (i.e., it is not pre-post).
  2. The experimental and control group started at the same level, or they started at similar levels and researchers statistically controlled for pretest differences.
  3. The measures involved were independent of the researcher and the treatment, not made by the developers or researchers. The test was not given by the teachers to their own students.
  4. The treatment was provided by ordinary teachers, not by researchers, and could in principle be replicated widely in ordinary schools. The experiment had a duration of at least 12 weeks.
  5. There were at least 30 students and 2 teachers in each treatment group (experimental and control).

If these conditions are met, the chances of finding effect sizes of more than +1.00 are about the same as the chances of finding a 10-foot man. That is, zero.

I was thinking about the 10-foot man when I was recently asked by a reporter about the “two sigma effect” claimed by Benjamin Bloom and much discussed in the 1970s and 1980s. Bloom’s students did a series of experiments in which students were taught about a topic none of them knew anything about, usually principles of sailing. After a short period, students were tested. Those who did not achieve at least 80% (defined as “mastery”) on the tests were tutored by University of Chicago graduate students long enough to ensure that every tutored student reached mastery. The purpose of this demonstration was to make a claim that every student could learn whatever we wanted to teach them, and the only variable was instructional time, as some students need more time to learn than others. In a system in which enough time could be given to all, “ability” would disappear as a factor in outcomes. Also, in comparison to control groups who were not taught about sailing at all, the effect size was often more than 2.0, or two sigma. That’s why this principle was called the “two sigma effect.” Doesn’t the two sigma effect violate my 10-foot man principle?

No, it does not. The two sigma studies used experimenter-made tests of content taught to the experimental but not control groups. It used University of Chicago graduate students providing far more tutoring (as a percentage of initial instruction) than any school could ever provide. The studies were very brief and sample sizes were small. The two sigma experiments were designed to prove a point, not to evaluate a feasible educational method.

A more recent example of the 10-foot man principle is found in Visible Learning, the currently fashionable book by John Hattie claiming huge effect sizes for all sorts of educational treatments. Hattie asks the reader to ignore any educational treatment with an effect size of less than +0.40, and reports many whole categories of teaching methods with average effect sizes of more than +1.00. How can this be?

The answer is that such effect sizes, like two sigma, do not incorporate the conditions I laid out. Instead, Hattie throws into his reviews entire meta-analyses which may include pre-post studies, studies using researcher-made measures, studies with tiny samples, and so on. For practicing educators, such effect sizes are useless. An educator knows that all children grow from pre- to posttest. They would not (and should not) accept measures made by researchers. The largest known effect sizes that do meet the above conditions are one-to-one tutoring studies with effect sizes up to +0.86. Still not +1.00. What could be more effective than the best of 1-1 tutoring?

It’s fun to visit Mr. Wadlow at the museum, and to imagine what an ever taller man could do on a basketball team, for example. But if you see a 10-foot man at Ripley’s Believe it or Not, or anywhere else, here’s my suggestion. Don’t believe it. And if you visit a museum of famous effect sizes that displays a whopper effect size of +1.00, don’t believe that, either. It doesn’t matter how big effect sizes are if they are not valid.

A 10-foot man would be a curiosity. An effect size of +1.00 is a distraction. Our work on evidence is too important to spend our time looking for 10-foot men, or effect sizes of +1.00, that don’t exist.

Photo credit: [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.