Bad Science II: Brief, Small, and Artificial Studies

Featured

Bad Science II 072612.png

“We learned from correlational research that students who speak Latin do better in school. So this year we’re teaching everything in Latin.”

The oldest joke in academia goes like this. A professor is shown the results of an impressive experiment. “That may work in practice,” she says, “but how will it work in the laboratory?”

For practitioners trying to make sense of the findings of educational research, this is no laughing matter. They are often left to figure out whether or not there is meaningful evidence supporting a given practice or policy. Yet all too often academics report findings from experiments that are too brief, too small, and/or too artificial to be reliable for making educational decisions.

Looking at the original articles, this problem is easy to see. Would you use or recommend a classroom management approach that has been successfully evaluated in a one-hour experiment? Or one evaluated with only 20 students? Or evaluated in a situation in which teachers in the experimental group had graduate students helping them in class every day?

The problem comes when busy educators or researchers rely on reviews of research. The reviews may make sweeping statements about the about the effects of various practices based on very brief, small, or artificial experiments, yet a lot of detective work may be necessary to find this out. Years ago, I was re-analyzing a review of research on class size and found one study with a far larger effect than all others. After much sleuthing I found out why: It was a study of tennis instruction, where students in larger tennis groups get a lot less court time.

So what should a reader do? Some reviews, including Social Programs that WorkBlueprints for Violence Prevention, and our own Best Evidence Encyclopedia, take sample size, duration, and artificiality into account. Otherwise, if you want to know for sure, you’ll have to put on your own deerstalker and do your own detective work, finding the essential experiments that took place in real schools over real periods of time, under realistic conditions. Evidence-based reform in education won’t really take hold until readers can consistently find reliable, easily interpretable and unbiased information on practical programs and practices available to them.

In case you missed last week first part in the series, check it out here: Bad Science I: Bad Measures

Illustration: Slavin, R.E. (2007). Educational research in the age of accountability. Boston: Allyn & Bacon. Reprinted with permission of the author.

Find Bob Slavin on Facebook!

Advertisements

On High School Graduation Rates: Want to Buy My Bridge?

FSK Bridge 02 13 18
 

Francis Scott Key Bridge (Baltimore) By Artondra Hall [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons, edited for size

 

I happen to own the Francis Scott Key Bridge in Baltimore, pictured here. It’s lovely in itself, has beautiful views of downtown and the outer harbor, and rakes in more than $11 million in tolls each year. But I’m willing to sell it to you, cheap!

If you believe that I own a bridge in Baltimore, then let me try out an even more fantastic idea on you. Since 1992, the achievement of America’s 12th graders on NAEP reading and math tests has been unchanged. Yet high school graduation rates have been soaring. From 2006 to 2016, U.S. graduation rates have increased from 73% to 84%, an all-time record. Does this sound plausible to you?

Recently, the Washington Post (https://www.washingtonpost.com/local/education/fbi-us-education-department-investigating-ballou-graduation-scandal/2018/02/02/b307e57c-07ab-11e8-b48c-b07fea957bd5_story.html?utm_term=.84c1176bb8ff) reported a scandal about graduation rates at Ballou High School in Washington, DC, a high-poverty school not known (in the past) for its graduation rates. In 2017, 100% of Ballou students graduated, and 100% were accepted into college. An investigation by radio station WAMU, however, found that a large proportion of the graduating seniors had very poor attendance, poor achievement, and other problems. In fact, the Post reported that one third of all graduating seniors in DC did not meet district graduation standards. Ballou’s principal and the DC Director of Secondary Schools resigned, and there are ongoing investigations. The FBI has recently gotten involved.

In response to these stories, teachers across America wrote to express their views. Almost without exception, the teachers said that the situation in their districts is similar to that in DC. They said they are pressured, even threatened, to promote and then graduate every student possible. Students who fail courses are often offered “credit recovery” programs to obtain their needed credits, and these were found in an investigation by the Los Angeles Times  to have extremely low standards (https://robertslavinsblog.wordpress.com/2017/08/17/the-high-school-graduation-miracle/). Failing students may also be allowed to do projects or otherwise show their knowledge in alternative ways, but these are derided as “Mickey Mouse.” And then there are students like some of those at Ballou, who did not even bother to show up for credit recovery or Mickey Mouse, but were graduated anyway.

The point is, it’s not just Ballou. It’s not just DC. In high-poverty districts coast to coast, standards for graduation have declined. My colleague, Bob Balfanz, coined the term “dropout factories” many years ago to describe high schools, almost always serving high-poverty areas, that produced a high proportion of all dropouts nationwide. In response, our education system got right to work on what it does best: Change the numbers to make the problem appear to go away. The FBI might make an example of DC, but if DC is in fact doing what many high-poverty districts are doing throughout the country, is it fair to punish it disproportionately? It’s not up to me to judge the legalities or ethics involved, but clearly, the problem is much, much bigger.

Some people have argued with me on this issue. “Where’s the harm,” they ask, “in letting students graduate? So many of these students encounter serious barriers to educational success. Why not give them a break?”

I will admit to a sympathy for giving high school students who just barely miss standards legitimate opportunities to graduate, such as taking appropriately demanding makeup courses. But what is happening in DC and elsewhere is very far from this reasonable compromise with reality.

I have done some research in inner-city high schools. In just about every class, there are students who are actively engaged in lessons, and others who would become actively engaged if their teachers used proven programs (in my case it was cooperative learning). But even with the best programs, there were kids in the back of the class with headphones on, who were totally disengaged, no matter what the teacher did. And those were the ones who actually showed up at all.

The kids who were engaged, or became engaged because of excellent instruction, should have a path to graduation, one way or another. The rest should have every opportunity, encouragement, and assistance to reach this goal. Some will choose to take advantage, some will not, but that must be their choice, with appropriate consequences.

Making graduation too easy not only undermines the motivations of students (and teachers). It also undermines the motivation of the entire system to introduce and effectively implement effective programs, from preschool to 12th grade. If educators can keep doing what they’ve always done, knowing that numbers will be fiddled with at the end to make everything come out all right, then the whole system can and will lose a major institutional incentive for improvement.

The high dropout rate of inner-city schools is indeed a crisis. It needs to be treated as such-not a crisis of numbers, but a crisis encountered by hundreds of thousands of vulnerable, valuable students. Loosening standards and then declaring success, which every educator knows to be false, corrupts the system, undermining confidence in the numbers even when they are legitimate. It fosters cynicism that nothing can be done.

Is it too much to expect that we can create and implement effective strategies that would enable virtually all students to succeed on appropriate standards in elementary, middle, and high school, so that virtually all can meet rigorous requirements and walk across a stage, head held high, knowing that they truly attained what a high school diploma is supposed to certify?

If you agree that high school graduation standards have gone off the rails, it is not enough to demand tougher standards. You also have to advocate for and work for application of proven approaches to make deserved and meaningful graduation accessible to all.

On the other hand, if you think the graduation rate has legitimately skyrocketed in the absence of any corresponding improvement in reading or math achievement, please contact me at www.buy-my-bridge.com. It really is a lovely bridge.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Getting the Best Mileage from Proven Programs

Race carWouldn’t you love to have a car that gets 200 miles to the gallon? Or one that can go hundreds of miles on a battery charge? Or one that can accelerate from zero to sixty twice as fast as any on the road?

Such cars exist, but you can’t have them. They are experimental vehicles or race cars that can only be used on a track or in a lab. They may be made of exotic materials, or may not carry passengers or groceries, or may be dangerous on real roads.

In working on our Evidence for ESSA website (www.evidenceforessa.org), we see a lot of studies that are like these experimental cars. For example, there are studies of programs in which the researcher or her graduate students actually did the teaching, or in which students used innovative technology with one adult helper for every machine or every few machines. Such studies are fine for theory building or as pilots, but we do not accept them for Evidence for ESSA, because they could never be replicated in real schools.

However, there is a much more common situation to which we pay very close attention. These are studies in which, for example, teachers receive a great deal of training and coaching, but an amount that seems replicable, in principle. For example, we would reject a study in which the experimenters taught the program, but not one in which they taught ordinary teachers how to use the program.

In such studies, the problem comes in dissemination. If studies validating a program provided a lot of professional development, we would accept it only if the disseminator provides a similar level of professional development, and their estimates of cost and personnel take this level of professional development into account. We put on our website clear expectations that these services be provided at a level similar to what was provided in the research, if the positive outcomes seen in the research are to be obtained.

The problem is that disseminators often offer schools a form of the program that was never evaluated, to keep costs low. They know that schools don’t like to spend a lot on professional development, and they are concerned that if they require the needed levels of PD or other services or materials, schools won’t buy their program. At the extreme end of this, there are programs that were successfully evaluated using extensive professional development, and then put their teacher’s manual on the web for schools to use for free.

A recent study of a program called Mathalicious illustrated the situation. Mathalicious is an on-line math course for middle school. An evaluation found that teachers randomly assigned to just get a license, with minimal training, did not obtain significant positive impacts, compared to a control group. Those who received extensive on-line training, however, did see a significant improvement in math scores, compared to controls.

When we write our program descriptions, we compare program implementation details in the research to what is said or required on the program’s website. If these do not match, within reason, we try to make it clear what were the key elements necessary for success.

Going back to the car analogy, our procedures eliminate those amazing cars that can only operate on special tracks, but we accept cars that can run on streets, carry children and groceries, and generally do what cars are expected to do. But if outstanding cars require frequent recharging, or premium gasoline, or have other important requirements, we’ll say so, in consultation with the disseminator.

In our view, evidence in education is not for academics, it’s for kids. If there is no evidence that a program as disseminated benefits kids, we don’t want to mislead educators who are trying to use evidence to benefit their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Evidence-Based Does Not Equal Evidence-Proven

Chemist

As I speak to educational leaders about using evidence to help them improve outcomes for students, there are two words I hear all the time that give me the fantods (as Mark Twain would say):

Evidence-based

            I like the first word, “evidence,” just fine, but the second word, “based,” sort of negates the first one. The ESSA evidence standards require programs that are evidence-proven, not just evidence-based, for various purposes.

“Evidence-proven” means that a given program, practice, or policy has been put to the test. Ideally, students, teachers, or schools have been assigned at random to use the experimental program or to remain in a control group. The program is provided to the experimental group for a significant period of time, at least a semester, and then final performance on tests that are fair to both groups are compared, using appropriate statistics.

If your doctor gives you medicine, it is evidence proven. It isn’t just the same color or flavor as something proven, it isn’t just generally in line with what research suggests might be a good idea. Instead, it has been found to be effective, compared to current standards of care, in rigorous studies.

“Evidence-based,” on the other hand, is one of those wiggle words that educators love to use to indicate that they are up-to-date and know what’s expected, but don’t actually intend to do anything different from what they are doing now.

Evidence-based is today’s equivalent of “based on scientifically-based research” in No Child Left Behind. It sure sounded good, but what educational program or practice can’t be said to be “based on” some scientific principle?

In a recent Brookings article Mark Dynarski wrote about state ESSA plans, and conversations he’s heard among educators. He says that the plans are loaded with the words “evidence-based,” but with little indication of what specific proven programs they plan to implement, or how they plan to identify, disseminate, implement, and evaluate them.

I hope the ESSA evidence standards give leaders in even a few states the knowledge and the courage to insist on evidence-proven programs, especially in very low-achieving “school improvement” schools that desperately need the very best approaches. I remain optimistic that ESSA can be used to expand evidence-proven practices. But will it in fact have this impact? That remains to be proven.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Higher Ponytails (And Researcher-Made Measures)

blog220_basketball_333x500

Some time ago, I coached my daughter’s fifth grade basketball team. I knew next to nothing about basketball (my sport was…well, chess), but fortunately my research assistant, Holly Roback, eagerly volunteered. She’d played basketball in college, so our girls got outstanding coaching. However, they got whammed. My assistant coach explained it after another disastrous game, “The other team’s ponytails were just higher than ours.” Basically, our girls were terrific at ball handling and free shots, but they came up short in the height department.

Now imagine that in addition to being our team’s coach I was also the league’s commissioner. Imagine that I changed the rules. From now on, lay-ups and jump shots were abolished, and the ball had to be passed three times from player to player before a team could score.

My new rules could be fairly and consistently enforced, but their entire effect would be to diminish the importance of height and enhance the importance of ball handling and set shots.

Of course, I could never get away with this. Every fifth grader, not to mention their parents and coaches, would immediately understand that my rule changes unfairly favored my own team, and disadvantaged theirs (at least the ones with the higher ponytails).

This blog is not about basketball, of course. It is about researcher-made measures or developer-made measures. (I’m using “researcher-made” to refer to both). I’ve been writing a lot about such measures in various blogs on the What Works Clearinghouse (https://wordpress.com/post/robertslavinsblog.wordpress.com/795 and https://wordpress.com/post/robertslavinsblog.wordpress.com/792).

The reason I’m writing again about this topic is that I’ve gotten some criticism for my criticism of researcher-made measures, and I wanted to respond to these concerns.

First, here is my case, simply put. Measures made by researchers or developers are likely to favor whatever content was taught in the experimental group. I’m not in any way suggesting that researchers or developers are deliberately making measures to favor the experimental group. However, it usually works out that way. If the program teaches unusual content, no matter how laudable that content may be, and the control group never saw that content, then the potential for bias is obvious. If the experimental group was taught on computers and control group was not, and the test was given on a computer, the bias is obvious. If the experimental treatment emphasized certain vocabulary, and the control group did not, then a test of those particular words has obvious bias. If a math program spends a lot of time teaching students to do mental rotations of shapes, and the control treatment never did such exercises, a test that includes mental rotations is obviously biased. In our BEE full-scale reviews of pre-K to 12 reading, math, and science programs, available at www.bestevidence.org, we have long excluded such measures, calling them “treatment-inherent.” The WWC calls such measures “over-aligned,” and says it excludes them.

However, the problem turns out to be much deeper. In a 2016 article in the Educational Researcher, Alan Cheung and I tested outcomes from all 645 studies in the BEE achievement reviews, and found that even after excluding treatment-inherent measures, measures from studies that were made by researchers or developers had effect sizes that were far higher than those for measures not made by researchers or developers, by a ratio of two to one (effect sizes =+0.40 for researcher-made measures, +0.20 for independent measures). Graduate student Marta Pellegrini more recently analyzed data from all WWC reading and math studies. The ratio among WWC studies was 2.7 to 1 (effect sizes = +0.52 for researcher-made measures, +0.19 for independent ones). Again, the WWC was supposed to have already removed overaligned studies, all of which (I’d assume) were also researcher-made.

Some of my critics argue that because the WWC already excludes overaligned measures, they have already taken care of the problem. But if that were true, there would not be a ratio of 2.7 to 1 in effect sizes between researcher-made and independent measures, after removing measures considered by the WWC to be overaligned.

Other critics express concern that my analyses (of bias due to researcher-made measures) have only involved reading, math, and science measures, and the situation might be different for measures of social-emotional outcomes, for example, where appropriate measures may not exist.

I will admit that in areas other than achievement the issues are different, and I’ve written about them. So I’ll be happy to limit the simple version of “no researcher-made measures” to achievement measures. The problems of measuring social- emotional outcomes fairly are far more complex, and for another day.

Other critics express concern that even on achievement measures, there are situations in which appropriate measures don’t exist. That may be so, but in policy-oriented reviews such as the WWC or Evidence for ESSA, it’s hard to imagine that there would be no existing measures of reading, writing, math, science, or other achievement outcomes. An achievement objective so rarified that it has never been measured is probably not particularly relevant for policy or practice.

The WWC is not an academic journal, and it is not primarily intended for academics. If a researcher needs to develop a new measure to test a question of theoretical interest, they should do so by all means. But the findings from that measure should not be accepted or reported by the WWC, even if a journal might accept it.

Another version of this criticism is that researchers often have a strong argument that the program they are evaluating emphasizes standards that should be taught to all students, but are not. Therefore, enhanced performance on a (researcher-made) measure of the better standard is prima facie evidence of a positive program impact. This argument confuses the purpose of experimental evaluations with the purpose of standards. Standards exist to express what we want students to know and be able to do. Arguing for a given standard involves considerations of the needs of the economy, standards of other states or countries, norms of the profession, technological or social developments, and so on—but not comparisons of experimental groups scoring well on tests of a new proposed standard to control groups never exposed to content relating to that standard. It’s just not fair.

To get back to basketball, I could have argued that the rules should be changed to emphasize ball handling and reduce the importance of height. Perhaps this would be a good idea, for all I know. But what I could not do was change the rules to benefit my team. In the same way, researchers cannot make their own measures and then celebrate higher scores on them as indicating higher or better standards. As any fifth grader could tell you, advocating for better rules is fine, but changing the rules in the middle of the season is wrong.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

“We Don’t Do Lists”

blog218_Santa_500x332 (2)

Watching the slow, uneven, uncertain rollout of the ESSA evidence standards gives me a mixture of hope and despair. The hope stems from the fact that from coast to coast, educational leaders are actually talking about proven programs and practices at all. That was certainly rare before ESSA. But despair in that I hear many educational leaders trying to find the absolute least their states and districts can do to just barely comply with the law. The ESSA evidence standards apply in particular to schools seeking school improvement funding, which are those in the lowest 5% of their states in academic performance. A previous program with a similar name but more capital letters, School Improvement, was used under NCLB, before ESSA. A large-scale evaluation by MDRC found that the earlier School Improvement made no difference in student achievement, despite billions of dollars in investments. So you’d imagine that this time around, educators responsible for school improvement would be eager to use the new law to introduce proven programs into their lowest-achieving schools. In fact, there are individual leaders, districts, and states who have exactly this intention, and may ultimately provide good examples to the rest. But they face substantial obstacles.

One of the obstacles I hear about often is an opposition among state departments of education to disseminating lists of proven programs. I very much understand and sympathize with their reluctance, as schools have been over-regulated for a long time. However, I do not see how the ESSA evidence standards can make much of a difference if everyone makes their own list of programs. Determining which studies meet ESSA evidence standards is difficult, and requires a great deal of knowledge about research (I know this, of course, because we do such reviews ourselves; see www.evidenceforessa.org).

Some say that they want programs that have been evaluated in their own states. But after taking into account demographics (e.g., urban/rural, ELL/not ELL, etc), are state-to-state differences so great as to require different research in each? We used to work with a school located on the Ohio-Indiana border, which ran right through the building. Were there really programs that were effective on one side of the building but not on the other?

Further, state department leaders frequently complain that they have too few staff to adequately manage school improvement across their states. Should that capacity be concentrated on reviewing research to determine which programs meet ESSA evidence standards and which do not?

The irony of opposing lists for ESSA evidence standards is that most states are chock full of lists that restrict the textbooks, software, and professional development schools can select using state funds. These lists may focus on paperweight, binding, and other minimum quality issues, but they almost never have anything to do with evidence of effectiveness. One state asked us to review their textbook adoption lists for reading and math, grades K-12. Collectively, there were hundreds of books, but just a handful had even a shred of evidence of effectiveness.

Educational leaders are constantly buffeted by opposing interest groups, from politicians to school board members to leaders of unions, from PTAs presidents to university presidents, to for-profit companies promoting their own materials and programs. Educational leaders need a consistent way to ensure that the decisions they make are in the best interests of children, not the often self-serving interests of adults. The ESSA evidence standards, if used wisely, give education leaders an opportunity to say to the whole cacophony of cries for special consideration, “I’d love to help you all, but we can only approve programs for our lowest-achieving schools that are known from rigorous research to benefit our children. We say this because it is the law, but also because we believe our children, and especially our lowest achievers, deserve the most effective programs, no matter what the law says.”

To back up such a radical statement, educational leaders need clarity about what their standards are and which specific programs meet those standards. Otherwise, they either have an “anything goes’ strategy that in effect means that evidence does not matter, or they have competing vendors claiming an evidence base for their favored program. Lists of proven programs can disappoint those whose programs aren’t on the list, but they are at least clear and unambiguous, and communicate to those who want to add to the list exactly what kind of evidence they will need.

States or large districts can create lists of proven programs by starting with existing national lists (such as the What Works Clearinghouse or Evidence for ESSA) and then modifying them, perhaps by adding additional programs that meet the same standards and/or eliminating programs not available in a given location. Over time, existing or new programs can be added as new evidence appears. We, at Evidence for ESSA, are willing to review programs being considered by state or local educators for addition to their own lists, and we will do it for free and in about two weeks. Then we’ll add them to our national list if they qualify.

It is important to say that while lists are necessary, they are not sufficient. Thoughtful needs assessments, information on proven programs (such as effective methods fairs and visits to local users of proven programs), and planning for high-quality implementation of proven programs are also necessary. However, students in struggling schools cannot wait for every school, district, and state to reinvent the wheel. They need the best we can give them right now, while the field is working on even better solutions for the future.

Whether a state or district uses a national list, or starts with such a list and modifies it for its own purposes, a list of proven programs provides an excellent starting point for struggling schools. It plants a flag for all to see, one that says “Because this (state/district/school) is committed to the success of every child, we select and carefully implement programs known to work. Please join us in this enterprise.”

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Swallowing Camels

blog216_camel_500x335

The New Testament contains a wonderful quotation that I use often, because it unfortunately applies to so much of educational research:

Ye blind guides, which strain at a gnat, and swallow a camel (Matthew 23:24).

The point of the quotation, of course, is that we are often fastidious about minor (research) sins while readily accepting major ones.

In educational research, “swallowing camels” applies to studies accepted in top journals or by the What Works Clearinghouse (WWC) despite substantial flaws that lead to major bias, such as use of measures slanted toward the experimental group, or measures administered and scored by the teachers who implemented the treatment. “Straining at gnats” applies to concerns that, while worth attending to, have little potential for bias, yet are often reasons for rejection by journals or downgrading by the WWC. For example, our profession considers p<.05 to indicate statistical significance, while p<.051 should never be mentioned in polite company.

As my faithful readers know, I have written a series of blogs on problems with policies of the What Works Clearinghouse, such as acceptance of researcher/developer-made measuresfailure to weight by sample size, use of “substantively important but not statistically significant” as a qualifying criterion, and several others. However, in this blog, I wanted to share with you some of the very worst, most egregious examples of studies that should never have seen the light of day, yet were accepted by the WWC and remain in it to this day. Accepting the WWC as gospel means swallowing these enormous and ugly camels, and I wanted to make sure that those who use the WWC at least think before they gulp.

Camel #1: DaisyQuest. DaisyQuest is a computer-based program for teaching phonological awareness to children in pre-K to Grade 1. The WWC gives DaisyQuest its highest rating, “positive,” for alphabetics, and ranks it eighth among literacy programs for grades pre-K to 1.

There were four studies of DaisyQuest accepted by the WWC. In each, half of the students received DaisyQuest in groups of 3-4, working with an experimenter. In two of the studies, control students never had their hands on a computer before they took the final tests on a computer. In the other two, control students used math software, so they at least got some experience with computers. The outcome tests were all made by the experimenters and all were closely aligned with the content of the software, with the exception of two Woodcock scales used in one of the studies. All studies used a measure called “Undersea Challenge” that closely resembled the DaisyQuest game format and was taken on the computer. All four studies also used the other researcher-made measures. None of the Woodcock measures showed statistically significant differences, but the researcher-made measures, especially Undersea Challenge and other specific tests of phonemic awareness, segmenting, and blending, did show substantial significant differences.

Recall that in the mid-to late-1990s, when the studies were done, students in preschool and kindergarten were unlikely to be getting any systematic teaching of phonemic awareness. So there is no reason to expect the control students to be learning anything that was tested on the posttests, and it is not surprising that effect sizes averaged +0.62. In the two studies in which control students had never touched a computer, effect sizes were +0.90 and +0.89, respectively.

Camel #2: Brady (1990) study of Reciprocal Teaching

Reciprocal Teaching is a program that teaches students comprehension skills, mostly using science and social studies texts. A 1990 dissertation by P.L. Brady evaluated Reciprocal Teaching in one school, in grades 5-8. The study involved only 12 students, randomly assigned to Reciprocal Teaching or control conditions. The one experimental class was taught by…wait for it…P.L. Brady. The measures included science, social studies, and daily comprehension tests related to the content taught in Reciprocal Teaching but not the control group. They were created and scored by…(you guessed it) P.L. Brady. There were also two Gates-MacGinitie scales, but they had effect sizes much smaller than the researcher-made (and –scored) tests. The Brady study met WWC standards for “potentially positive” because it had a mean effect size of more than +0.25 but was not statistically significant.

Reading Recovery is a one-to-one tutoring program for first graders that has a strong tradition of rigorous research, including a recent large-scale study by May et al. (2016). However, one of the earlier studies of Reading Recovery, by Schwartz (2005), is hard to swallow, so to speak.

In this study, 47 Reading Recovery (RR) teachers across 14 states were asked by e-mail to choose two very similar, low-achieving first graders at the beginning of the year. One student was randomly assigned to receive RR, and one was assigned to the control group, to receive RR in the second semester.

Both students were pre- and posttested on the Observation Survey, a set of measures made by Marie Clay, the developer of RR. In addition, students were tested on Degrees of Reading Power, a standardized test.

The problems with this study mostly have to do with the fact that the teachers who administered pre- and posttests were the very same teachers who provided the tutoring. No researcher or independent tester ever visited the schools. Teachers obviously knew the child they personally tutored. I’m sure the teachers were honest and wanted to be accurate. However, they would have had a strong motivation to see that the outcomes looked good, because they could be seen as evaluations of their own tutoring, and could have had consequences for continuation of the program in their schools.

Most Observation Survey scales involve difficult judgments, so it’s easy to see how teachers’ ratings could be affected by their belief in Reading Recovery.

Further, ten of the 47 teachers never submitted any data. This is a very high rate of attrition within a single school year (21%). Could some teachers, fully aware of their students’ less-than-expected scores, have made some excuse and withheld their data? We’ll never know.

Also recall that most of the tests used in this study were from the Observation Survey made by Clay, which had effect sizes ranging up to +2.49 (!!!). However, on the independent Degrees of Reading Power, the non-significant effect size was only +0.14.

It is important to note that across these “camel” studies, all except Brady (1990) were published. So it was not only the WWC that was taken in.

These “camel” studies are far from unique, and they may not even be the very worst to be swallowed whole by the WWC. But they do give readers an idea of the depth of the problem. No researcher I know of would knowingly accept an experiment in which the control group had never used the equipment on which they were to be posttested, or one with 12 students in which the 6 experimentals were taught by the experimenter, or in which the teachers who tutored students also individually administered the posttests to experimentals and controls. Yet somehow, WWC standards and procedures led the WWC to accept these studies. Swallowing these camels should have caused the WWC a stomach ache of biblical proportions.

 

References

Brady, P. L. (1990). Improving the reading comprehension of middle school students through reciprocal teaching and semantic mapping and strategies. Dissertation Abstracts International, 52 (03A), 230-860.

May, H., Sirinades, P., Gray, A., & Goldsworthy, H. (2016). Reading Recovery: An evaluation of the four-year i3 scale-up. Newark, DE: University of Delaware, Center for Research in Education and Social Policy.

Schwartz, R. M. (2005). Literacy learning of at-risk first grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

“Substantively Important” Isn’t Substantive. It Also Isn’t Important

Since it began in 2002, the What Works Clearinghouse has played an important role in finding, rating, and publicizing findings of evaluations of educational programs. It performs a crucial function for evidence-based reform. For this very reason, it needs to be right. But in several important ways, it uses procedures that are indefensible and have a big impact on its conclusions.

One of these relates to a study rating called “substantively important-positive.” This refers to study outcomes with an effect size of at least +0.25, but that are not statistically significant. I’ve written about this before, but the WWC has recently released a database of information on its studies that makes it easy to analyze WWC data on a large scale, and we have learned a lot more about this topic.

Study outcomes rated as “substantively important – positive” can qualify a study as “potentially positive,” the second-highest WWC rating. “Substantively important-negative” findings (non-significant effect sizes less than -0.25) can cause a study to be rated as potentially negative, which can keep a study from getting a positive rating forever, as a single “potentially negative” rating, under current rules, ensures that a program can never receive a rating better than “mixed,” even if other studies found hundreds of significant positive effects.

People who follow the WWC and know about “substantively important” may assume that it may be a strange rule, but relatively rare in practice. But that is not true.

My graduate student, Amanda Inns, has just done an analysis of WWC data from their own database, and if you are a big fan of the WWC, this is going to be a shock. Amanda has looked at all WWC-accepted reading and math studies. Among these, she found a total of 339 individual outcomes rated “positive” or “potentially positive.” Of these, 155 (46%) reached the “potentially positive” level only because they had effect sizes over +0.25, but were not statistically significant.

Another 36 outcomes were rated “negative” or “potentially negative.” 26 of these (72%) were categorized as “potentially negative” only because they had effect sizes less than -0.25 and were not significant. I’m sure patterns would be similar for subjects other than reading and math.

Put another way, almost half (48%) of outcomes rated positive/potentially positive or negative/potentially negative by the WWC were not statistically significant. As one example of what I’m talking about, consider a program called The Expert Mathematician. It had just one study with only 70 students in 4 classrooms (2 experimental and 2 control). The WWC re-analyzed the data to account for clustering, and the outcomes were nowhere near statistically significant, though they were greater than +0.25. This tiny study, and this study alone, caused The Expert Mathematician to receive the WWC “potentially positive” rating and to be ranked seventh among all middle school math programs. Similarly, Waterford Early Learning received a “potentially positive” rating based on a single tiny study with only 70 kindergarteners in 6 schools. The outcomes ranged from -0.71 to +1.11, and though the mean was more than +0.25, the outcome was far from significant. Yet this study alone put Waterford on the WWC list of proven kindergarten programs.

I’m not taking any position on whether these particular programs are in fact effective. All I am saying is that these very small studies with non-significant outcomes say absolutely nothing of value about that question.

I’m sure that some of you nerdier readers who have followed me this far are saying to yourselves, “well, sure, these substantively important studies may not be statistically significant, but they are probably unbiased estimates of the true effect.”

More bad news. They are not. Not even close.

The problem, also revealed in Amanda Inns’ data, is that studies with large effect sizes but not statistical significance tend to have very small sample sizes (otherwise, they would have been significant). Across WWC reading and math studies that used individual-level assignment, median sample sizes were 48, 74, or 86, for substantively important, significant, or indeterminate (non-significant with ES < +0.25), respectively. For cluster studies, they were 10, 17, and 33 clusters respectively. In other words, “substantively important” outcomes averaged less than half the sample sizes of other outcomes.

And small-sample studies greatly overstate effect sizes. Among all factors that bias effect sizes, small sample size is the most important (only use of researcher/developer-made measures comes close). So a non-significant positive finding in a small study is not an unbiased point estimate that just needs a larger sample to show its significance. It is probably biased, in a consistent, positive direction. Studies with sample sizes less than 100 have about three times the mean effect sizes of studies with sample sizes over 1000, for example.

But “substantively important” ratings can throw a monkey wrench into current policy. The ESSA evidence standards require statistically significant effects for all of its top three levels (strong, moderate, and promising). Yet many educational leaders are using the What Works Clearinghouse as a guide to which programs will meet ESSA evidence standards. They may logically assume that if the WWC says a program is effective, then the federal government stands behind it, regardless of what the ESSA evidence standards actually say. Yet in fact, based on the data analyzed by Amanda Inns for reading and math, 46% of the outcomes rated as positive/potentially positive by WWC (taken to correspond to “strong” or “moderate,” respectively, under ESSA evidence standards) are non-significant, and therefore do not qualify under ESSA.

The WWC needs to remove “substantively important” from its ratings as soon as possible, to avoid a collision with ESSA evidence standards, and to avoid misleading educators any further. Doing so would help make the WWC’s impact on ESSA substantive. And important.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.