First There Must be Love. Then There Must be Technique.

I recently went to Barcelona. This was my third time in this wonderful city, and for the third time I visited La Sagrada Familia, Antoni Gaudi’s breathtaking church. It was begun in the 1880s, and Gaudi worked on it from the time he was 31 until he died in 1926 at 74. It is due to be completed in 2026.

Every time I go, La Sagrada Familia has grown even more astonishing. In the nave, massive columns branching into tree shapes hold up the spectacular roof. The architecture is extremely creative, and wonders lie around every corner.

blog_7-19-18_Barcelona_333x500

I visited a new museum under the church. At the entrance, it had a Gaudi quote:

First there must be love.

Then there must be technique.

This quote sums up La Sagrada Familia. Gaudi used complex mathematics to plan his constructions. He was a master of technique. But he knew that it all meant nothing without love.

In writing about educational research, I try to remind my readers of this from time to time. There is much technique to master in creating educational programs, evaluating them, and fairly summarizing their effects. There is even more technique in implementing proven programs in schools and classrooms, and in creating policies to support use of proven programs. But what Gaudi reminds us of is just as essential in our field as it was in his. We must care about technique because we care about children. Caring about technique just for its own sake is of little value. Too many children in our schools are failing to learn adequately. We cannot say, “That’s not my problem, I’m a statistician,” or “that’s not my problem, I’m a policymaker,” or “that’s not my problem, I’m an economist.” If we love children and we know that our research can help them, then it’s all of our problems. All of us go into education to solve real problems in real classrooms. That’s the structure we are all building together over many years. Building this structure takes technique, and the skilled efforts of many researchers, developers, statisticians, superintendents, principals, and teachers.

Each of us brings his or her own skills and efforts to this task. None of us will live to see our structure completed, because education keeps growing in techniques and capability. But as Gaudi reminds us, it’s useful to stop from time to time and remember why we do what we do, and for whom.

Photo credit: By Txllxt TxllxT [CC BY-SA 4.0  (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Teachers’ Crucial Role in School Health

These days, many schools are becoming “community schools,” or centering on the “whole child.”  A focus of these efforts is often on physical health, especially vision and hearing. As readers of this blog may know, our group at Johns Hopkins School of Education is collaborating on creating, evaluating, and disseminating a strategy, called Vision for Baltimore, to provide every Baltimore child in grades PK to 8 who needs them with eyeglasses. Our partners are the Johns Hopkins Hospital’s Wilmer Eye Institute, the Baltimore City Schools, the Baltimore City Health Department, Vision to Learn, a philanthropy that provides vision vans to do vision testing, and Warby Parker, which provides children with free eyeglasses.  We are two years into the project, and we are learning a lot.

The most important news is this:  you cannot hope to solve the substantial problem of disadvantaged children who need glasses but don’t have them by just providing vision testing and glasses.  There are too many steps to the solution.  If things go wrong and are not corrected at any step, children end up without glasses.   Children need to be screened, then (in most places) they need parent permission, then they need to be tested, then they need to get correct and attractive glasses, then they have to wear them every day as needed, and then they have to replace them if they are lost or broken.

blog_7-12-18_500x333

School of Education graduate student Amanda Inns is writing her dissertation on many of these issues. By listening to teachers, learning from our partners, and trial and error, Amanda, ophthalmologist Megan Collins, Lead School Vision Advocate Christine Levy, and others involved in the project came up with practical solutions to a number of problems. I’ll tell you about them in a moment, but here’s the overarching discovery that Amanda reports:

Nothing can stop a motivated and informed teacher.

What Amanda found out is that all aspects of the vision program tended to work when teachers and principals were engaged and empowered to try to ensure that all children who needed glasses got them, and they tended not to work if teachers were not involved.  A moment’s reflection would tell you why.  Unlike anyone else, teachers see kids every day. Unlike anyone else, they are likely to have good relationships with parents.  In a city like Baltimore, parents may not answer calls from most people in their child’s school, and even less anyone in any other city agency.  But they will answer their own child’s teacher. In fact, the teacher may be the only person in the school, or indeed in the entire city, who knows the parents’ real, current phone number.

In places like Baltimore, many parents do not trust most city institutions. But if they trust anyone, it’s their own child’s teacher.

Cost-effective citywide systems are needed to solve widespread health problems, such as vision (about 30% of Baltimore children need glasses, and few had them before Vision for Baltimore began).  Building such systems should start, we’ve learned, with the question of how teachers can be enabled and supported to ensure that services are actually reaching children and parents. Then you have to work backward to fill in the rest of the system.

Obviously, teachers cannot do it alone. In Vision for Baltimore, the Health Department expanded its screening to include all grades, not just state-mandated grades (PK, 1, 8, and new entrants).  Vision to Learn’s vision vans and vision professionals are extremely effective at providing testing. Free eyeglasses, or ones paid for by Medicaid, are essential. The fact that kids (and teachers) love Warby Parker glasses is of no small consequence. School nurses and parent coordinators play a key role across all health issues, not just vision. But even with these services, universal provision of eyeglasses to students who need them, long-term use, and replacement of lost glasses, are still not guaranteed.

Our basic approach is to provide services and incentives to help teachers be as effective as possible in supporting vision health, using their special position to solve key problems. We hired three School Vision Advocates (SVAs) to work with about 43 schools entering the vision system each year. The SVAs work with teachers and principals to jointly plan how to ensure that all students who need them will get, use, and maintain their glasses. They interact with principals and staff members to build excitement about eyeglasses, offering posters to place around each school. They organize data to make it easy for teachers to know which students need glasses, and which students still need parent permission forms. They provide fun prizes, $20 worth of school supplies, to all teachers in schools in which 80% of parents sent in their forms.  They get to know school office staff, also critical to such efforts.  They listen to teachers and get their ideas about how to adapt their approaches to the school’s needs.  They do observations in classes to see what percentage of students are wearing their glasses. They arrange to replace lost or broken glasses.  They advocate for the program in the district office, and find ways to get the superintendent and other district leaders to show support for the teachers’ activities directed toward vision.

Amanda’s research found that introducing SVAs into schools had substantial impacts on rates of parent permission, and adding the school prizes added another substantial amount. Many students are wearing their glasses. There is still more to do to get to 100%, but the schools have made unprecedented gains.

Urban teachers are very busy, and adding vision to their list of responsibilities can only work if the teachers see the value, feel respected and engaged, and have help in doing their part. What Amanda’s research illustrates is how modest investments in friendly and capable people targeting high-leverage activities can make a big difference across an entire city.

Ensuring that all students have good vision is critical, and a hit-or-miss strategy is not sufficient. Schools need systems to bring cost-effective services to thousands of kids who need help with health issues that affect very large numbers, such as vision, hearing, and asthma. Teachers still must focus first on their roles as instructors, but with help, they can also provide essential assistance to build the health of their students.  No system to solve health problems that require daily, long-term monitoring of children’s behavior can work at this scale in urban schools without engaging teachers.

Photo credit: By SSgt Sara Csurilla [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

“But It Worked in the Lab!” How Lab Research Misleads Educators

In researching John Hattie’s meta-meta analyses, and digging into the original studies, I discovered one underlying factor that more than anything explains why he consistently comes up with greatly inflated effect sizes:  Most studies in the meta-analyses that he synthesizes are brief, small, artificial lab studies. And lab studies produce very large effect sizes that have little if any relevance to classroom practice.

This discovery reminds me of one of the oldest science jokes in existence: (One scientist to another): “Your treatment worked very well in practice, but how will it work in the lab?” (Or “…in theory?”)

blog_6-28-18_scientists_500x424

The point of the joke, of course, is to poke fun at scientists more interested in theory than in practical impacts on real problems. Personally, I have great respect for theory and lab studies. My very first publication as a psychology undergraduate involved an experiment on rats.

Now, however, I work in a rapidly growing field that applies scientific methods to the study and improvement of classroom practice.  In our field, theory also has an important role. But lab studies?  Not so much.

A lab study in education is, in my view, any experiment that tests a treatment so brief, so small, or so artificial that it could never be used all year. Also, an evaluation of any treatment that could never be replicated, such as a technology program in which a graduate student is standing by every four students every day of the experiment, or a tutoring program in which the study author or his or her students provide the tutoring, might be considered a lab study, even if it went on for several months.

Our field exists to try to find practical solutions to practical problems in an applied discipline.  Lab studies have little importance in this process, because they are designed to eliminate all factors other than the variables of interest. A one-hour study in which children are asked to do some task under very constrained circumstances may produce very interesting findings, but cannot recommend practices for real teachers in real classrooms.  Findings of lab studies may suggest practical treatments, but by themselves they never, ever validate practices for classroom use.

Lab studies are almost invariably doomed to success. Their conditions are carefully set up to support a given theory. Because they are small, brief, and highly controlled, they produce huge effect sizes. (Because they are relatively easy and inexpensive to do, it is also very easy to discard them if they do not work out, contributing to the universally reported tendency of studies appearing in published sources to report much higher effects than reports in unpublished sources).  Lab studies are so common not only because researchers believe in them, but also because they are easy and inexpensive to do, while meaningful field experiments are difficult and expensive.   Need a publication?  Randomly assign your college sophomores to two artificial treatments and set up an experiment that cannot fail to show significant differences.  Need a dissertation topic?  Do the same in your third-grade class, or in your friend’s tenth grade English class.  Working with some undergraduates, we once did three lab studies in a single day. All were published. As with my own sophomore rat study, lab experiments are a good opportunity to learn to do research.  But that does not make them relevant to practice, even if they happen to take place in a school building.

By doing meta-analyses, or meta-meta-analyses, Hattie and others who do similar reviews obscure the fact that many and usually most of the studies they include are very brief, very small, and very artificial, and therefore produce very inflated effect sizes.  They do this by covering over the relevant information with numbers and statistics rather than information on individual studies, and by including such large numbers of studies that no one wants to dig deeper into them.  In Hattie’s case, he claims that Visible Learning meta-meta-analyses contain 52,637 individual studies.  Who wants to read 52,637 individual studies, only to find out that most are lab studies and have no direct bearing on classroom practice?  It is difficult for readers to do anything but assume that the 52,637 studies must have taken place in real classrooms, and achieved real outcomes over meaningful periods of time.  But in fact, the few that did this are overwhelmed by the thousands of lab studies that did not.

Educators have a right to data that are meaningful for the practice of education.  Anyone who recommends practices or programs for educators to use needs to be open about where that evidence comes from, so educators can judge for themselves whether or not one-hour or one-week studies under artificial conditions tell them anything about how they should teach. I think the question answers itself.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

John Hattie is Wrong

John Hattie is a professor at the University of Melbourne, Australia. He is famous for a book, Visible Learning, which claims to review every area of research that relates to teaching and learning. He uses a method called “meta-meta-analysis,” averaging effect sizes from many meta-analyses. The book ranks factors from one to 138 in terms of their effect sizes on achievement measures. Hattie is a great speaker, and many educators love the clarity and simplicity of his approach. How wonderful to have every known variable reviewed and ranked!

However, operating on the principle that anything that looks to be too good to be true probably is, I looked into Visible Learning to try to understand why it reports such large effect sizes. My colleague, Marta Pellegrini from the University of Florence (Italy), helped me track down the evidence behind Hattie’s claims. And sure enough, Hattie is profoundly wrong. He is merely shoveling meta-analyses containing massive bias into meta-meta-analyses that reflect the same biases.

blog_6-21-18_salvagepaper_476x500

Part of Hattie’s appeal to educators is that his conclusions are so easy to understand. He even uses a system of dials with color-coded “zones,” where effect sizes of 0.00 to +0.15 are designated “developmental effects,” +0.15 to +0.40 “teacher effects” (i.e., what teachers can do without any special practices or programs), and +0.40 to +1.20 the “zone of desired effects.” Hattie makes a big deal of the magical effect size +0.40, the “hinge point,” recommending that educators essentially ignore factors or programs below that point, because they are no better than what teachers produce each year, from fall to spring, on their own. In Hattie’s view, an effect size of from +0.15 to +0.40 is just the effect that “any teacher” could produce, in comparison to students not being in school at all. He says, “When teachers claim that they are having a positive effect on achievement or when a policy improves achievement, this is almost always a trivial claim: Virtually everything works. One only needs a pulse and we can improve achievement.” (Hattie, 2009, p. 16). An effect size of 0.00 to +0.15 is, he estimates, “what students could probably achieve if there were no schooling” (Hattie, 2009, p. 20). Yet this characterization of dials and zones misses the essential meaning of effect sizes, which are rarely used to measure the amount teachers’ students gain from fall to spring, but rather the amount students receiving a given treatment gained in comparison to gains made by similar students in a control group over the same period. So an effect size of, say, +0.15 or +0.25 could be very important.

Hattie’s core claims are these:

  • Almost everything works
  • Any effect size less than +0.40 is ignorable
  • It is possible to meaningfully rank educational factors in comparison to each other by averaging the findings of meta-analyses.

These claims appear appealing, simple, and understandable. But they are also wrong.

The essential problem with Hattie’s meta-meta-analyses is that they accept the results of the underlying meta-analyses without question. Yet many, perhaps most meta-analyses accept all sorts of individual studies of widely varying standards of quality. In Visible Learning, Hattie considers and then discards the possibility that there is anything wrong with individual meta-analyses, specifically rejecting the idea that the methods used in individual studies can greatly bias the findings.

To be fair, a great deal has been learned about the degree to which particular study characteristics bias study findings, always in a positive (i.e., inflated) direction. For example, there is now overwhelming evidence that effect sizes are significantly inflated in studies with small sample sizes, brief durations, use measures made by researchers or developers, are published (vs. unpublished), or use quasi-experiments (vs. randomized experiments) (Cheung & Slavin, 2016). Many meta-analyses even include pre-post studies, or studies that do not have pretests, or have pretest differences but fail to control for them. For example, I once criticized a meta-analysis of gifted education in which some studies compared students accepted into gifted programs to students rejected for those programs, controlling for nothing!

A huge problem with meta-meta-analysis is that until recently, meta-analysts rarely screened individual studies to remove those with fatal methodological flaws. Hattie himself rejects this procedure: “There is…no reason to throw out studies automatically because of lower quality” (Hattie, 2009, p. 11).

In order to understand what is going on in the underlying meta-analyses in a meta-meta-analysis, is it crucial to look all the way down to the individual studies. As a point of illustration, I examined Hattie’s own meta-meta-analysis of feedback, his third ranked factor, with a mean effect size of +0.79. Hattie & Timperly (2007) located 12 meta-analyses. I found some of the ones with the highest mean effect sizes.

At a mean of +1.24, the meta-analysis with the largest effect size in the Hattie & Timperley (2007) review was a review of research on various reinforcement treatments for students in special education by Skiba, Casey, & Center (1985-86). The reviewers required use of single-subject designs, so the review consisted of a total of 35 students treated one at a time, across 25 studies. Yet it is known that single-subject designs produce much larger effect sizes than ordinary group designs (see What Works Clearinghouse, 2017).

The second-highest effect size, +1.13, was from a meta-analysis by Lysakowski & Walberg (1982), on instructional cues, participation, and corrective feedback. Not enough information is provided to understand the individual studies, but there is one interesting note. A study using a single-subject design, involving two students, had an effect size of 11.81. That is the equivalent of raising a child’s IQ from 100 to 277! It was “winsorized” to the next-highest value of 4.99 (which is like adding 75 IQ points). Many of the studies were correlational, with no controls for inputs, or had no control group, or were pre-post designs.

A meta-analysis by Rummel and Feinberg (1988), with a reported effect size of +0.60, is perhaps the most humorous inclusion in the Hattie & Timperley (2007) meta-meta-analysis. It consists entirely of brief lab studies of the degree to which being paid or otherwise reinforced for engaging in an activity that was already intrinsically motivating would reduce subjects’ later participation in that activity. Rummel & Feinberg (1988) reported a positive effect size if subjects later did less of the activity they were paid to do. The reviewers decided to code studies positively if their findings corresponded to the theory (i.e., that feedback and reinforcement reduce later participation in previously favored activities), but in fact their “positive” effect size of +0.60 indicates a negative effect of feedback on performance.

I could go on (and on), but I think you get the point. Hattie’s meta-meta-analyses grab big numbers from meta-analyses of all kinds with little regard to the meaning or quality of the original studies, or of the meta-analyses.

If you are familiar with the What Works Clearinghouse (2007), or our own Best-Evidence Syntheses (www.bestevidence.org) or Evidence for ESSA (www.evidenceforessa.org), you will know that individual studies, except for studies of one-to-one tutoring, almost never have effect sizes as large as +0.40, Hattie’s “hinge point.” This is because WWC, BEE, and Evidence for ESSA all very carefully screen individual studies. We require control groups, controls for pretests, minimum sample sizes and durations, and measures independent of the treatments. Hattie applies no such standards, and in fact proclaims that they are not necessary.

It is possible, in fact essential, to make genuine progress using high-quality rigorous research to inform educational decisions. But first we must agree on what standards to apply.  Modest effect sizes from studies of practical treatments in real classrooms over meaningful periods of time on measures independent of the treatments tell us how much a replicable treatment will actually improve student achievement, in comparison to what would have been achieved otherwise. I would much rather use a program with an effect size of +0.15 from such studies than to use programs or practices found in studies with major flaws to have effect sizes of +0.79. If they understand the situation, I’m sure all educators would agree with me.

To create information that is fair and meaningful, meta-analysts cannot include studies of unknown and mostly low quality. Instead, they need to apply consistent standards of quality for each study, to look carefully at each one and judge its freedom from bias and major methodological flaws, as well as its relevance to practice. A meta-analysis cannot be any better than the studies that go into it. Hattie’s claims are deeply misleading because they are based on meta-analyses that themselves accepted studies of all levels of quality.

Evidence matters in education, now more than ever. Yet Hattie and others who uncritically accept all studies, good and bad, are undermining the value of evidence. This needs to stop if we are to make solid progress in educational practice and policy.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Hattie, J. (2009). Visible learning. New York, NY: Routledge.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81-112.

Lysakowski, R., & Walberg, H. (1982). Instructional effects of cues, participation, and corrective feedback: A quantitative synthesis. American Educational Research Journal, 19 (4), 559-578.

Rummel, A., & Feinberg, R. (1988). Cognitive evaluation theory: A review of the literature. Social Behavior and Personality, 16 (2), 147-164.

Skiba, R., Casey, A., & Center, B. (1985-86). Nonaversive procedures I the treatment of classroom behavior problems. The Journal of Special Education, 19 (4), 459-481.

What Works Clearinghouse (2017). Procedures handbook 4.0. Washington, DC: Author.

Photo credit: U.S. Farm Security Administration [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Fads and Evidence in Education

York, England, has a famous racecourse. When I lived there I never saw a horse race, but I did see women in town for the race all dressed up and wearing very strange contraptions in their hair, called fascinators. The picture below shows a couple of examples. They could be twisted pieces of metal or wire or feathers or just about anything as long as they were . . . well, fascinating. The women paraded down Mickelgate, York’s main street, showing off their fancy clothes and especially, I’d guess, their fascinators.

blog_6-14-18_fascinators_500x333

The reason I bring up fascinators is to contrast the world of fashion and the world of science. In fashion, change happens constantly, but it is usually change for the sake of change. Fascinators, I’d assume, derived from hats, which women have been wearing to fancy horse races as long as there have been fancy horse races. Hats themselves change all the time. I’m guessing that what’s fascinating about a fascinator is that it maintains the concept of a racing-day hat in the most minimalist way possible, almost mocking the hat tradition while at the same time honoring it. The point is, fascinators get thinner because hats used to be giant, floral contraptions. In art, there was realism and then there were all sorts of non-realism. In music there was Frank Sinatra and then Elvis and then Beatles and then disco. Eventually there was hip hop. Change happens, but it’s all about taste. People get tired of what once was popular, so something new comes along.

Science-based fields have a totally different pattern of change. In medicine, engineering, agriculture, and other fields, evidence guides changes. These fields are not 100% fad-free, but ultimately, on big issues, evidence wins out. In these fields, there is plenty of high-quality evidence, and there are very serious consequences for making or not making evidence-based policies and practices. If someone develops an artificial heart valve that is 2% more effective than the existing valves, with no more side effects, surgeons will move toward that valve to save lives (and avoid lawsuits).

In education, which model do we follow? Very, very slowly we are beginning to consider evidence. But most often, our model of change is more like the fascinators. New trends in education take the schools by storm, and often a few years later, the opposite policy or practice will become popular. Over long periods, very similar policies and practices keep appearing, disappearing, and reappearing, perhaps under a different name.

It’s not that we don’t have evidence. We do, and more keeps coming every year. Yet our profession, by and large, prefers to rush from one enthusiasm to another, without the slightest interest in evidence.

Here’s an exercise you might enjoy. List the top ten things schools and districts are emphasizing right now. Put your list into a “time capsule” envelope and file it somewhere. Then take it out in five years, and then ten years. Will those same things be the emphasis in schools in districts then? To really punish yourself, write the NAEP reading and math scores overall and by ethnic groups at fourth and eighth grade. Will those scores be a lot better in five or ten years? Will gaps be diminishing? Not if current trends continue and if we continue to give only lip service to evidence.

Change + no evidence = fashion

Change + evidence = systematic improvement

We can make a different choice. But it will take real leadership. Until that leadership appears, we’ll be doing what we’ve always done, and the results will not change.

Isn’t that fascinating?

Photo credit: Both photos by Chris Phutully [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Meta-Analysis and Its Discontents

Everyone loves meta-analyses. We did an analysis of the most frequently opened articles on Best Evidence in Brief. Almost all of the most popular were meta-analyses. What’s so great about meta-analyses is that they condense a lot of evidence and synthesize it, so instead of just one study that might be atypical or incorrect, a meta-analysis seems authoritative, because it averages many individual studies to find the true effect of a given treatment or variable.

Meta-analyses can be wonderful summaries of useful information. But today I wanted to discuss how they can be misleading. Very misleading.

The problem is that there are no norms among journal editors or meta-analysts themselves about standards for including studies or, perhaps most importantly, how much or what kind of information needs to be reported about each individual study in a meta-analysis. Some meta-analyses are completely statistical. They report all sorts of statistics and very detailed information on exactly how the search for articles took place, but never say anything about even a single study. This is a problem for many reasons. Readers may have no real understanding of what the studies really say. Even if citations for the included studies are available, only a very motivated reader is going to go find any of them. Most meta-analyses do have a table listing studies, but the information in the table may be idiosyncratic or limited.

One reason all of this matters is that without clear information on each study, readers can be easily misled. I remember encountering this when meta-analysis first became popular in the 1980s. Gene Glass, who coined the very term, proposed some foundational procedures, and popularized the methods. Early on, he applied meta-analysis to determine the effects of class size, which by then had been studied several times and found to matter very little except in first grade. Reducing “class size” to one (i.e., one-to-one tutoring) also was known to make a big difference, but few people would include one-to-one tutoring in a review of class size. But Glass and Smith (1978) found a much higher effect, not limited to first grade or tutoring. It was a big deal at the time.

I wanted to understand what happened. I bought and read Glass’ book on class size, but it was nearly impossible to tell what had happened. But then I found in an obscure appendix a distribution of effect sizes. Most studies had effect sizes near zero, as I expected. But one had a huge effect size, of +1.25! It was hard to tell which particular study accounted for this amazing effect but I searched by process of elimination and finally found it.

It was a study of tennis.

blog_6-7-18_tennis_500x355

The outcome measure was the ability to “rally a ball against a wall so many times in 30 seconds.” Not surprisingly, when there were “large class sizes,” most students got very few chances to practice, while in “small class sizes,” they did.

If you removed the clearly irrelevant tennis study, the average effect size for class sizes (other than tutoring) dropped to near zero, as reported in all other reviews (Slavin, 1989).

The problem went way beyond class size, of course. What was important, to me at least, was that Glass’ presentation of the data made it very difficult to find out what was really going on. He had attractive and compelling graphs and charts showing effects of class size, but they all depended on the one tennis study, and there was no easy way to find out.

Because of this review and several others appearing in the 1980s, I wrote an article criticizing numbers–only meta-analyses and arguing that reviewers should show all of the relevant information about the studies in their meta-analyses, and should even describe each study briefly to help readers understand what was happening. I made up a name for this, “best-evidence synthesis” (Slavin, 1986).

Neither the term nor the concept really took hold, I’m sad to say. You still see meta-analyses all the time that do not tell readers enough for them to know what’s really going on. Yet several developments have made the argument for something like best-evidence synthesis a lot more compelling.

One development is the increasing evidence that methodological features can be strongly correlated with effect sizes (Cheung & Slavin, 2016). The evidence is now overwhelming that effect sizes are greatly inflated when sample sizes are small, when study durations are brief, when measures are made by developers or researchers, or when quasi-experiments rather than randomized experiments are used, for example. Many meta-analyses check for the effects of these and other study characteristics, and may make adjustments if there are significant differences. But this is not sufficient, because in a particular meta-analysis, there may not be enough studies to make any study-level factors significant. For example, if Glass had tested “tennis vs. non-tennis,” there would have been no significant difference, because there was only one tennis study. Yet that one study dominated the means anyway. Eliminating studies using, for example, researcher/developer-made measures or very small sample sizes or very brief durations is one way to remove bias from meta-analyses, and this is what we do in our reviews. But at bare minimum, it is important to have enough information available in tables to enable readers or journal reviewers to look for such biasing factors so they can recompute or at least understand the main effects if they are so inclined.

The second development that makes it important to require more information on individual studies in meta-analyses is the increased popularity of meta-meta-analyses, where the average effect sizes from whole meta-analyses are averaged. These have even more potential for trouble than the worst statistics-only reviews, because it is extremely unlikely that many readers will follow the citations to each included meta-analysis and then follow those citations to look for individual studies. It would be awfully helpful if readers or reviewers could trust the individual meta-analyses (and therefore their averages), or at least see for themselves.

As evidence takes on greater importance, this would be a good time to discuss reasonable standards for meta-analyses. Otherwise, we’ll be rallying balls uselessly against walls forever.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292

Glass, G., & Smith, M. L. (1978). Meta-Analysis of research on the relationship of class size and achievement. San Francisco: Far West Laboratory for Educational Research and Development.

Slavin, R.E. (1986). Best-evidence synthesis: An alternative to meta-analytic and traditional reviews. Educational Researcher, 15 (9), 5-11.

Slavin, R. E. (1989). Class size and student achievement:  Small effects of small classes. Educational Psychologist, 24, 99-110.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

When Developers Commission Studies, What Develops?

I have the greatest respect for commercial developers and disseminators of educational programs, software, and professional development. As individuals, I think they genuinely want to improve the practice of education, and help produce better outcomes for children. However, most developers are for-profit companies, and they have shareholders who are focused on the bottom line. So when developers carry out evaluations, or commission evaluation companies to do so on their behalf, perhaps it’s best to keep in mind a bit of dialogue from a Marx Brothers movie. Someone asks Groucho if Chico is honest. “Sure,” says Groucho, “As long as you watch him!”

blog_5-31-18_MarxBros_500x272

         A healthy role for developers in evidence-based reform in education is desirable. Publishers, software developers, and other commercial companies have a lot of capital, and a strong motivation to create new products with evidence of effectiveness that will stand up to scrutiny. In medicine, most advances in practical drugs and treatments are made by drug companies. If you’re a cynic, this may sound disturbing. But for a long time, the federal government has encouraged drug companies to do development and evaluation of new drugs, but they have strict rules about what counts as conclusive evidence. Basically, the government says, following Groucho, “Are drug companies honest? Sure, as long as you watch ‘em.”

            In our field, we may want to think about how to do this. As one contribution, my colleague Betsy Wolf did some interesting research on outcomes of studies sponsored by developers, compared to those conducted by independent, third parties. She looked at all reading/literacy and math studies listed on the What Works Clearinghouse database. The first thing she found was very disturbing. Sure enough, the effect sizes for the developer-commissioned studies (ES = +0.27, n=73) were twice as large as those for independent studies (ES = +0.13, n=96). That’s a huge difference.

Being a curious person, Betsy wanted to know why developer-commissioned studies had effect sizes that were so much larger than independent ones. We now know a lot about study characteristics that inflate effect sizes. The most inflationary are small sample size, use of measures made by researchers or developers (rather than independent measures), and use of quasi-experiments instead of randomized designs. Developer-commissioned studies were in fact much more likely to use researcher/developer-made measures (29% in developer-commissioned vs. 8% in independent studies), and randomized vs. quasi-experiments (51% quasi-experiments for developer-commissioned studies vs. 15% quasi-experiments for independent studies). However, sample sizes were similar in developer-commissioned and independent studies. And most surprising, statistically controlling for all of these factors did not reduce the developer effect by very much.

If there is so much inflation of effect sizes in developer-commissioned studies, then how come controlling for the largest factors that usually cause effect size inflation does not explain the developer effect?

There is a possible reason for this, which Betsy cautiously advances (since it cannot be proven). Perhaps the reason that effect sizes are inflated in developer-commissioned studies is not due to the nature of the studies we can find, but to the studies we cannot find. There has long been recognition of what is called the “file drawer effect,” which happens when studies that do not obtain a positive outcome disappear (into a file drawer). Perhaps developers are especially likely to hide disappointing findings. Unlike academic studies, which are likely to exist as technical reports or dissertations, perhaps commercial companies have no incentive to make null findings findable in any form.

This may not be true, or it may be true of some but not other developers. But if government is going to start taking evidence a lot more seriously, as it has done with the ESSA evidence standards (see www.evidenceforessa.org), it is important to prevent developers, or any researchers, from hiding their null findings.

There is a solution to this problem that is heading rapidly in our direction. This is pre-registration. In pre-registration, researchers or evaluators must file a study design, measures, and analyses about to be used in a study, but perhaps most importantly, pre-registration announces that a study exists, or will exist soon. If a developer pre-registered a study but that study never showed up in the literature, this might be a cause for losing faith in the developer. Imagine that the What Works Clearinghouse, Evidence for ESSA, and journals refused to accept research reports on programs unless the study had been pre-registered, and unless all other studies of the program were made available.

Some areas of medicine use pre-registration, and the Society for Research on Educational Effectiveness is moving toward introducing a pre-registration process for education. Use of pre-registration and other safeguards could be a boon to commercial developers, as it is to drug companies, because it could build public confidence in developer-sponsored research. Admittedly, it would take many years and/or a lot more investment in educational research to make this practical, but there are concrete steps we could take in that direction.

I’m not sure I see any reason we shouldn’t move toward pre-registration. It would be good for Groucho, good for Chico, and good for kids. And that’s good enough for me!

Photo credit: By Paramount Pictures (source) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.