Beyond the Spaghetti Bridge: Why Response to Intervention is Not Enough

I know an engineer at Johns Hopkins University who invented the Spaghetti Bridge Challenge. Teams of students are given dry, uncooked spaghetti and glue, and are challenged to build a bridge over a 500-millimeter gap. The bridge that can support the most weight wins.

blog_9-27-18_spaghettibridge_500x333

Spaghetti Bridge tournaments are now held all over the world, and they are wonderful for building interest in engineering. But I don’t think any engineer would actually build a real bridge based on a winning spaghetti bridge prototype. Much as spaghetti bridges do resemble the designs of real bridges, there are many more factors a real engineer has to take into account: Weight of materials, tensile strength, flexibility (in case of high winds or earthquakes), durability, and so on.

In educational innovation and reform, we have lots of great ideas that resemble spaghetti bridges. That’s because they would probably work great if only their components were ideal. An example like this is Response to Intervention (RTI), or its latest version, Multi-Tiered Systems of Supports (MTSS). Both RTI and MTSS start with a terrific idea: Instead of just testing struggling students to decide whether or not to assign them to special education, provide them with high-quality instruction (Tier 1), supplemented by modest assistance if that is not sufficient (Tier 2), supplemented by intensive instruction if Tier 2 is not sufficient (Tier 3). In law, or at least in theory, struggling readers must have had a chance to succeed in high-quality Tier 1, Tier 2, and Tier 3 instruction before they can be assigned to special education.

The problem is that there is no way to ensure that struggling students truly received high-quality instruction at each tier level. Teachers do their best, but it is difficult to make up effective approaches from scratch. MTSS or RTI is a great idea, but their success depends on the effectiveness of whatever struggling students receive as Tier 1, 2, and 3 instruction.

This is where spaghetti bridges come in. Many bridge designs can work in theory (or in spaghetti), but whether or not a bridge really works in the real world depends on how it is made, and with what materials in light of the demands that will be placed on it.

The best way to ensure that all components of RTI or MTSS policy are likely to be effective is to select approaches for each tier that have themselves been proven to work. Fortunately, there is now a great deal of research establishing the effectiveness of programs, proven effective for struggling students that use whole-school or whole-class methods (Tier 1), one-to-small group tutoring (Tier 2), or one-to-one tutoring (Tier 3). Many of these tutoring models are particularly cost-effective because they successfully provide struggling readers with tutoring from well-qualified paraprofessionals, usually ones with bachelor’s degrees but not teaching certificates. Research on both reading and math tutoring has clearly established that such paraprofessional tutors, using structured models, have tutees who gain at least as much as do tutors who are certified teachers. This is important not only because paraprofessionals cost about half as much as teachers, but also because there are chronic teacher shortages in high-poverty areas, such as inner-city and rural locations, so certified teacher tutors may not be available at any cost.

If schools choose proven components for their MTSS/RTI models, and implement them with thought and care, they are sure to see enhanced outcomes for their struggling students. The concept of MTSS/RTI is sound, and the components are proven. How could the outcomes be less than stellar? And in addition to improved achievement for vulnerable learners, hiring many paraprofessionals to serve as tutors in disadvantaged schools could enable schools to attract and identify capable, caring young people with bachelor’s degrees to offer accelerated certification, enriching the local teaching force.

With a spaghetti bridge, a good design is necessary but not sufficient. The components of that design, its ingredients, and its implementation, determine whether the bridge stands or falls in practice. So it is with MTSS and RTI. An approach based on strong evidence of effectiveness is essential to enable these good designs achieve their goals.

Photo credit: CSUF Photos (CC BY-NC-SA 2.0), via flickr

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Rethinking Technology in Education

Antonine de Saint Exupéry, in his 1931 classic Night Flight, had a wonderful line about early airmail service in Patagonia, South America:

“When you are crossing the Andes and your engine falls out, well, there’s nothing to do but throw in your hand.”

blog_10-4-18_Saint_Exupery_363x500

I had reason to think about this quote recently, as I was attending a conference in Santiago, Chile, the presumed destination of the doomed pilot. The conference focused on evidence-based reform in education.

Three of the papers described large scale, randomized evaluations of technology applications in Latin America, funded by the Inter-American Development Bank (IDB). Two of them documented disappointing outcomes of large-scale, traditional uses of technology. One described a totally different application.

One of the studies, reported by Santiago Cueto (Cristia et al., 2017), randomly assigned 318 high-poverty, mostly rural primary schools in Peru to receive sturdy, low-cost, practical computers, or to serve as a control group. Teachers were given great latitude in how to use the computers, but limited professional development in how to use them as pedagogical resources. Worse, the computers had software with limited alignment to the curriculum, and teachers were expected to overcome this limitation. Few did. Outcomes were essentially zero in reading and math.

In another study (Berlinski & Busso, 2017), the IDB funded a very well-designed study in 85 schools in Costa Rica. Schools were randomly assigned to receive one of five approaches. All used the same content on the same schedule to teach geometry to seventh graders. One group used traditional lectures and questions with no technology. The others used active learning, active learning plus interactive whiteboards, active learning plus a computer lab, or active learning plus one computer per student. “Active learning” emphasized discussions, projects, and practical exercises.

On a paper-and-pencil test covering the content studied by all classes, all four of the experimental groups scored significantly worse than the control group. The lowest performance was seen in the computer lab condition, and, worst of all, the one computer per child condition.

The third study, in Chile (Araya, Arias, Bottan, & Cristia, 2018), was funded by the IDB and the International Development Research Center of the Canadian government. It involved a much more innovative and unusual application of technology. Fourth grade classes within 24 schools were randomly assigned to experimental or control conditions. In the experimental group, classes in similar schools were assigned to serve as competitors to each other. Within the math classes, students studied with each other and individually for a bi-monthly “tournament,” in which students in each class were individually given questions to answer on the computers. Students were taught cheers and brought to fever pitch in their preparations. The participating classes were compared to the control classes, which studied the same content using ordinary methods. All classes, experimental and control, were studying the national curriculum on the same schedule, and all used computers, so all that differed was the tournaments and the cooperative studying to prepare for the tournaments.

The outcomes were frankly astonishing. The students in the experimental schools scored much higher on national tests than controls, with an effect size of +0.30.

The differences in the outcomes of these three approaches are clear. What might explain them, and what do they tell us about applications of technology in Latin America and anywhere?

In Peru, the computers were distributed as planned and generally functioned, but teachers receive little professional development. In fact, teachers were not given specific strategies for using the computers, but were expected to come up with their own uses for them.

The Costa Rica study did provide computer users with specific approaches to math and gave teachers much associated professional development. Yet the computers may have been seen as replacements for teachers, and the computers may just not have been as effective as teachers. Alternatively, despite extensive PD, all four of the experimental approaches were very new to the teachers and may have not been well implemented.

In contrast, in the Chilean study, tournaments and cooperative study were greatly facilitated by the computers, but the computers were not central to program effectiveness. The theory of action emphasized enhanced motivation to engage in cooperative study of math. The computers were only a tool to achieve this goal. The tournament strategy resembles a method from the 1970s called Teams-Games-Tournaments (TGT) (DeVries & Slavin, 1978). TGT was very effective, but was complicated for teachers to use, which is why it was not widely adopted. In Chile, computers helped solve the problems of complexity.

It is important to note that in the United States, technology solutions are also not producing major gains in student achievement. Reviews of research on elementary reading (ES=+0.05; Inns et al. 2018) and secondary reading (ES= -0.01; Baye et al., in press) have reported near-zero effects of technology-assisted effects of technology-assisted approaches. Outcomes in elementary math are only somewhat better, averaging an effect size of +0.09 (Pellegrini et al., 2018).

The findings of these rigorous studies of technology in the U.S. and Latin America lead to a conclusion that there is nothing magic about technology. Applications of technology can work if the underlying approach is sound. Perhaps it is best to consider which non-technology approaches are proven or likely to increase learning, and only then imagine how technology could make effective methods easier, less expensive, more motivating, or more instructionally effective. As an analogy, great audio technology can make a concert more pleasant or audible, but the whole experience still depends on great composition and great performances. Perhaps technology in education should be thought of in a similar enabling way, rather than as the core of innovation.

St. Exupéry’s Patagonian pilots crossing the Andes had no “Plan B” if their engines fell out. We do have many alternative ways to put technology to work or to use other methods, if the computer-assisted instruction strategies that have dominated technology since the 1970s keep showing such small or zero effects. The Chilean study and certain exceptions to the overall pattern of research findings in the U.S. suggest appealing “Plans B.”

The technology “engine” is not quite falling out of the education “airplane.” We need not throw in our hand. Instead, it is clear that we need to re-engineer both, to ask not what is the best way to use technology, but what is the best way to engage, excite, and instruct students, and then ask how technology can contribute.

Photo credit: Distributed by Agence France-Presse (NY Times online) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

References

Araya, R., Arias, E., Bottan, N., & Cristia, J. (2018, August 23). Conecta Ideas: Matemáticas con motivatión social. Paper presented at the conference “Educate with Evidence,” Santiago, Chile.

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Berlinski, S., & Busso, M. (2017). Challenges in educational reform: An experiment on active learning in mathematics. Economics Letters, 156, 172-175.

Cristia, J., Ibarraran, P., Cueto, S., Santiago, A., & Severín, E. (2017). Technology and child development: Evidence from the One Laptop per Child program. American Economic Journal: Applied Economics, 9 (3), 295-320.

DeVries, D. L., & Slavin, R. E. (1978). Teams-Games-Tournament:  Review of ten classroom experiments. Journal of Research and Development in Education, 12, 28-38.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018, March 3). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018, March 3). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Curious Case of the Missing Programs

“Let me tell you, my dear Watson, about one of my most curious and vexing cases,” said Holmes. “I call it, ‘The Case of the Missing Programs’. A school superintendent from America sent me a letter.  It appears that whenever she looks in the What Works Clearinghouse to find a program her district wants to use, nine times out of ten there is nothing there!”

Watson was astonished. “But surely there has to be something. Perhaps the missing programs did not meet WWC standards, or did not have positive effects!”

“Not meeting standards or having disappointing outcomes would be something,” responded Holmes, “but the WWC often says nothing at all about a program. Users are apparently confused. They don’t know what to conclude.”

“The missing programs must make the whole WWC less useful and reliable,” mused Watson.

“Just so, my friend,” said Holmes, “and so we must take a trip to America to get to the bottom of this!”

blog_9-6-18_SherlockProf_458x500

While Holmes and Watson are arranging steamship transportation to America, let me fill you in on this very curious case.

In the course of our work on Evidence for ESSA (www.evidenceforessa.org), we are occasionally asked by school district leaders why there is nothing in our website about a given program, text, or software. Whenever this happens, our staff immediately checks to see if there is any evidence we’ve missed. If we are pretty sure that there are no studies of the missing program that meet our standards, we add the program to our website, with a brief indication that there are no qualifying studies. If any studies do meet our standards, we review them as soon as possible and add them as meeting or not meeting ESSA standards.

Sometimes, districts or states send us their entire list of approved texts and software, and we check them all to see that all are included.

From having done this for more than a year, we now have an entry on most of the reading and math programs any district would come up with, though we keep getting more all the time.

All of this seems to us to be obviously essential. If users of Evidence for ESSA look up their favorite programs, or ones they are thinking of adopting, and find that there is no entry, they begin losing confidence in the whole enterprise. They cannot know whether the program they seek was ignored or missed for some reason, or has no evidence of effectiveness, or perhaps has been proven effective but has not been reviewed.

Recently, a large district sent me their list of 98 approved and supplementary texts, software, and other programs in reading and math. They had marked each according to the ratings given by the What Works Clearinghouse and Evidence for ESSA. At the time (a few weeks ago), Evidence for ESSA had listings for 67% of the programs. Today, of course, it has 100%, because we immediately set to work researching and adding in all the programs we’d missed.

What I found astonishing, however, is how few of the district’s programs were mentioned at all in the What Works Clearinghouse. Only 15% of the reading and math programs were in the WWC.

I’ve written previously about how far behind the WWC is in reviewing programs. But the problem with the district list was not just a question of slowness. Many of the programs the WWC missed have been around for some time.

I’m not sure how the WWC decides what to review, but they do not seem to be trying for completeness. I think this is counterproductive. Users of the WWC should expect to be able to find out about programs that meet standards for positive outcomes, those that have an evidence base that meets evidence standards but do not have positive outcomes, those that have evidence not meeting standards, and those that have no evidence at all. Yet it seems clear that the largest category in the WWC is “none of the above.” Most programs a user would be interested in do not appear at all in the WWC. Most often, a lack of a listing means a lack of evidence, but this is not always the case, especially when evidence is recent. One way or another, finding big gaps in any compendium undermines faith in the whole effort. It’s difficult to expect educational leaders to get into the habit of looking for evidence if most of the programs they consider are not listed.

Imagine, for example, that a telephone book was missing a significant fraction of the people who live in a given city. Users would be frustrated about not being able to find their friends, and the gaps would soon undermine confidence in the whole phone book.

****

When Holmes and Watson arrived in the U.S., they spoke with many educators who’d tried to find programs in the WWC, and they heard tales of frustration and impatience. Many former users said they no longer bothered to consult the WWC and had lost faith in evidence in their field. Fortunately, Holmes and Watson got a meeting with U.S. Department of Education officials, who immediately understood the problem and set to work to find the evidence base (or lack of evidence) for every reading and math program in America. Usage of the WWC soared, and support for evidence-based reform in education increased.

Of course, this outcome is fictional. But it need not remain fictional. The problem is real, and the solution is simple. Or as Holmes would say, “Elementary and secondary, my dear Watson!”

Photo credit: By Rumensz [CC0], from Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.