What Works in Elementary Math?

Euclid, the ancient Greek mathematician, is considered the inventor of geometry. His king heard about it, and wanted to learn geometry, but being a king, he was kind of busy. He called in Euclid, and asked him if there was a faster way. “I’m sorry sire,” said Euclid, “but there is no royal road to geometry.”

Skipping forward a couple thousand years, Marta Pellegrini, of the University of Florence in Italy, spent nine months with our group at Johns Hopkins University and led a review of research on effective programs for elementary mathematics  (Pellegrini, Lake, Inns & Slavin, 2018), which was recently released on our Best Evidence Encyclopedia (BEE). What we found was not so different from Euclid’s conclusion, but broader: There’s no royal road to anything in mathematics. Improving mathematics achievement isn’t easy. But it is not impossible.

Our review focused on 78 very high-quality studies (65 used random assignment). 61 programs were divided into eight categories: tutoring, technology, professional development for math content and pedagogy, instructional process programs, whole-school reform, social-emotional approaches, textbooks, and benchmark assessments.

Tutoring had the largest and most reliably positive impacts on math learning. Tutoring included one-to-one and one-to-small group services, and some tutors were certified teachers and some were paraprofessionals (teacher assistants). The successful tutoring models were all well-structured, and tutors received high-quality materials and professional development. Across 13 studies involving face-to-face tutoring, average outcomes were very positive. Surprisingly, tutors who were certified teachers (ES=+0.34) and paraprofessionals (ES=+0.32) obtained very similar student outcomes. Even more surprising, one-to-small group tutoring (ES=+0.32) was as effective as one-to-one (ES=+0.26).

Beyond tutoring, the category with the largest average impacts was instructional programs, classroom organization and management approaches, such as cooperative learning and the Good Behavior Game. The mean effect size was +0.25.

blog_10-11-18_LTF_500x479

After these two categories, there were only isolated studies with positive outcomes. 14 studies of technology approaches had an average effect size of only +0.07. 12 studies of professional development to improve teachers’ knowledge of math content and pedagogy found an average of only +0.04. One study of a social-emotional program called Positive Action found positive effects but seven other SEL studies did not, and the mean for this category was +0.03. One study of a whole-school reform model called the Center for Data-Driven Reform in Education (CDDRE), which helps schools do needs assessments, and then find, select, and implement proven programs, showed positive outcomes (ES=+0.24), but three other whole-school models found no positive effects. Among 16 studies of math curricula and software, only two, Math in Focus (ES=+0.25) and Math Expressions (ES=+0.11), found significant positive outcomes. On average, benchmark assessment approaches made no difference (ES=0.00).

Taken together, the findings of the 78 studies support a surprising conclusion. Few of the successful approaches had much to do with improving math pedagogy. Most were one-to-one or one-to-small group tutoring approaches that closely resemble tutoring models long used with great success in reading. A classroom management approach, PAX Good Behavior Game, and a social-emotional model, Positive Action, had no particular focus on math, yet both had positive effects on math (and reading). A whole-school reform approach, the Center for Data-Driven Reform in Education (CDDRE), helped schools do needs assessments and select proven programs appropriate to their needs, but CDDRE focused equally on reading and math, and had significantly positive outcomes in both subjects. In contrast, math curricula and professional development specifically designed for mathematics had only two positive examples among 28 programs.

The substantial difference in outcomes of tutoring and outcomes of technology applications is also interesting. The well-established positive impacts of one-to-one and one-to-small group tutoring, in reading as well as math, are often ascribed to the tutor’s ability to personalize instruction for each student. Computer-assisted instruction is also personalized, and has been expected, largely on this basis, to improve student achievement, especially in math (see Cheung & Slavin, 2013). Yet in math, and also reading, one-to-one and one-to-small group tutoring, by certified teachers and paraprofessionals, is far more effective than the average for technology approaches. The comparison of outcomes of personalized CAI and (personalized) tutoring make it unlikely that personalization is a key explanation for the effectiveness of tutoring. Tutors must contribute something powerful beyond personalization.

I have argued previously that what tutors contribute, in addition to personalization, is a human connection, encouragement, and praise. A tutored child wants to please his or her tutor, not by completing a set of computerized exercises, but by seeing a tutor’s eyes light up and voice respond when the tutee makes progress.

If this is the secret of the effect of tutoring (beyond personalization), perhaps a similar explanation extends to other approaches that happen to improve mathematics performance without using especially innovative approaches to mathematics content or pedagogy. Approaches such as PAX Good Behavior Game and Positive Action, targeted on behavior and social-emotional skills, respectively, focus on children’s motivations, emotions, and behaviors. In the secondary grades, a program called Building Assets, Reducing Risk (BARR) (Corsello & Sharma, 2015) has an equal focus on social-emotional development, not math, but it also has significant positive effects on math (as well as reading). A study in Chile of a program called Conecta Ideas found substantial positive effects in fourth grade math by having students practice together in preparation for bimonthly math “tournaments” in competition with other schools. Both content and pedagogy were the same in experimental and control classes, but the excitement engendered by the tournaments led to substantial impacts (ES=+0.30 on national tests).

We need breakthroughs in mathematics teaching. Perhaps we have been looking in the wrong places, expecting that improved content and pedagogy will be the key to better learning. They will surely be involved, but perhaps it will turn out that math does not live only in students’ heads, but must also live in their hearts.

There may be no royal road to mathematics, but perhaps there is an emotional road. Wouldn’t it be astonishing if math, the most cerebral of subjects, turns out more than anything else to depend as much on heart as brain?

References

Cheung, A., & Slavin, R. E. (2013). The effectiveness of educational technology applications for enhancing mathematics achievement in K-12 classrooms: A meta-analysis. Educational Research Review, 9, 88-113.

Corsello, M., & Sharma, A. (2015). The Building Assets-Reducing Risks Program: Replication and expansion of an effective strategy to turn around low-achieving schools: i3 development grant final report. Biddeford, ME, Consello Consulting.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018, March 3). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018, March 3). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Photo credit: By Los Angeles Times Photographic Archive, no photographer stated. [CC BY 4.0  (https://creativecommons.org/licenses/by/4.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Succeeding Faster in Education

“If you want to increase your success rate, double your failure rate.” So said Thomas Watson, the founder of IBM. What he meant, of course, is that people and organizations thrive when they try many experiments, even though most experiments fail. Failing twice as often means trying twice as many experiments, leading to twice as many failures—but also, he was saying, many more successes.

blog_9-20-18_TJWatson_500x488
Thomas Watson

In education research and innovation circles, many people know this quote, and use it to console colleagues who have done an experiment that did not produce significant positive outcomes. A lot of consolation is necessary, because most high-quality experiments in education do not produce significant positive outcomes. In studies funded by the Institute for Education Sciences (IES), Investing in Innovation (i3), and England’s Education Endowment Foundation (EEF), all of which require very high standards of evidence, fewer than 20% of experiments show significant positive outcomes.

The high rate of failure in educational experiments is often shocking to non-researchers, especially the government agencies, foundations, publishers, and software developers who commission the studies. I was at a conference recently in which a Peruvian researcher presented the devastating results of an experiment in which high-poverty, mostly rural schools in Peru were randomly assigned to receive computers for all of their students, or to continue with usual instruction. The Peruvian Ministry of Education was so confident that the computers would be effective that they had built a huge model of the specific computers used in the experiment and attached it to the Ministry headquarters. When the results showed no positive outcomes (except for the ability to operate computers), the Ministry quietly removed the computer statue from the top of their building.

Improving Success Rates

Much as I believe Watson’s admonition (“fail more”), there is another principle that he was implying, or so I expect: We have to learn from failure, so we can increase the rate of success. It is not realistic to expect government to continue to invest substantial funding in high-quality educational experiments if the success rate remains below 20%. We have to get smarter, so we can succeed more often. Fortunately, qualitative measures, such as observations, interviews, and questionnaires, are becoming required elements of funded research, facilitating finding out what happened so that researchers can find out what went wrong. Was the experimental program faithfully implemented? Were there unexpected responses toward the program by teachers or students?

In the course of my work reviewing positive and disappointing outcomes of educational innovations, I’ve noticed some patterns that often predict that a given program is likely or unlikely to be effective in a well-designed evaluation. Some of these are as follows.

  1. Small changes lead to small (or zero) impacts. In every subject and grade level, researchers have evaluated new textbooks, in comparison to existing texts. These almost never show positive effects. The reason is that textbooks are just not that different from each other. Approaches that do show positive effects are usually markedly different from ordinary practices or texts.
  2. Successful programs almost always provide a lot of professional development. The programs that have significant positive effects on learning are ones that markedly improve pedagogy. Changing teachers’ daily instructional practices usually requires initial training followed by on-site coaching by well-trained and capable coaches. Lots of PD does not guarantee success, but minimal PD virtually guarantees failure. Sufficient professional development can be expensive, but education itself is expensive, and adding a modest amount to per-pupil cost for professional development and other requirements of effective implementation is often the best way to substantially enhance outcomes.
  3. Effective programs are usually well-specified, with clear procedures and materials. Rarely do programs work if they are unclear about what teachers are expected to do, and helped to do it. In the Peruvian study of one-to-one computers, for example, students were given tablet computers at a per-pupil cost of $438. Teachers were expected to figure out how best to use them. In fact, a qualitative study found that the computers were considered so valuable that many teachers locked them up except for specific times when they were to be used. They lacked specific instructional software or professional development to create the needed software. No wonder “it” didn’t work. Other than the physical computers, there was no “it.”
  4. Technology is not magic. Technology can create opportunities for improvement, but there is little understanding of how to use technology to greatest effect. My colleagues and I have done reviews of research on effects of modern technology on learning. We found near-zero effects of a variety of elementary and secondary reading software (Inns et al., 2018; Baye et al., in press), with a mean effect size of +0.05 in elementary reading and +0.00 in secondary. In math, effects were slightly more positive (ES=+0.09), but still quite small, on average (Pellegrini et al., 2018). Some technology approaches had more promise than others, but it is time that we learned from disappointing as well as promising applications. The widespread belief that technology is the future must eventually be right, but at present we have little reason to believe that technology is transformative, and we don’t know which form of technology is most likely to be transformative.
  5. Tutoring is the most solid approach we have. Reviews of elementary reading for struggling readers (Inns et al., 2018) and secondary struggling readers (Baye et al., in press), as well as elementary math (Pellegrini et al., 2018), find outcomes for various forms of tutoring that are far beyond effects seen for any other type of treatment. Everyone knows this, but thinking about tutoring falls into two camps. One, typified by advocates of Reading Recovery, takes the view that tutoring is so effective for struggling first graders that it should be used no matter what the cost. The other, also perhaps thinking about Reading Recovery, rejects this approach because of its cost. Yet recent research on tutoring methods is finding strategies that are cost-effective and feasible. First, studies in both reading (Inns et al., 2018) and math (Pellegrini et al., 2018) find no difference in outcomes between certified teachers and paraprofessionals using structured one-to-one or one-to-small group tutoring models. Second, although one-to-one tutoring is more effective than one-to-small group, one-to-small group is far more cost-effective, as one trained tutor can work with 4 to 6 students at a time. Also, recent studies have found that tutoring can be just as effective in the upper elementary and middle grades as in first grade, so this strategy may have broader applicability than it has in the past. The real challenge for research on tutoring is to develop and evaluate models that increase cost-effectiveness of this clearly effective family of approaches.

The extraordinary advances in the quality and quantity of research in education, led by investments from IES, i3, and the EEF, have raised expectations for research-based reform. However, the modest percentage of recent studies meeting current rigorous standards of evidence has caused disappointment in some quarters. Instead, all findings, whether immediately successful or not, should be seen as crucial information. Some studies identify programs ready for prime time right now, but the whole body of work can and must inform us about areas worthy of expanded investment, as well as areas in need of serious rethinking and redevelopment. The evidence movement, in the form it exists today, is completing its first decade. It’s still early days. There is much more we can learn and do to develop, evaluate, and disseminate effective strategies, especially for students in great need of proven approaches.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

 Photo credit: IBM [CC BY-SA 3.0  (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Beyond the Spaghetti Bridge: Why Response to Intervention is Not Enough

I know an engineer at Johns Hopkins University who invented the Spaghetti Bridge Challenge. Teams of students are given dry, uncooked spaghetti and glue, and are challenged to build a bridge over a 500-millimeter gap. The bridge that can support the most weight wins.

blog_9-27-18_spaghettibridge_500x333

Spaghetti Bridge tournaments are now held all over the world, and they are wonderful for building interest in engineering. But I don’t think any engineer would actually build a real bridge based on a winning spaghetti bridge prototype. Much as spaghetti bridges do resemble the designs of real bridges, there are many more factors a real engineer has to take into account: Weight of materials, tensile strength, flexibility (in case of high winds or earthquakes), durability, and so on.

In educational innovation and reform, we have lots of great ideas that resemble spaghetti bridges. That’s because they would probably work great if only their components were ideal. An example like this is Response to Intervention (RTI), or its latest version, Multi-Tiered Systems of Supports (MTSS). Both RTI and MTSS start with a terrific idea: Instead of just testing struggling students to decide whether or not to assign them to special education, provide them with high-quality instruction (Tier 1), supplemented by modest assistance if that is not sufficient (Tier 2), supplemented by intensive instruction if Tier 2 is not sufficient (Tier 3). In law, or at least in theory, struggling readers must have had a chance to succeed in high-quality Tier 1, Tier 2, and Tier 3 instruction before they can be assigned to special education.

The problem is that there is no way to ensure that struggling students truly received high-quality instruction at each tier level. Teachers do their best, but it is difficult to make up effective approaches from scratch. MTSS or RTI is a great idea, but their success depends on the effectiveness of whatever struggling students receive as Tier 1, 2, and 3 instruction.

This is where spaghetti bridges come in. Many bridge designs can work in theory (or in spaghetti), but whether or not a bridge really works in the real world depends on how it is made, and with what materials in light of the demands that will be placed on it.

The best way to ensure that all components of RTI or MTSS policy are likely to be effective is to select approaches for each tier that have themselves been proven to work. Fortunately, there is now a great deal of research establishing the effectiveness of programs, proven effective for struggling students that use whole-school or whole-class methods (Tier 1), one-to-small group tutoring (Tier 2), or one-to-one tutoring (Tier 3). Many of these tutoring models are particularly cost-effective because they successfully provide struggling readers with tutoring from well-qualified paraprofessionals, usually ones with bachelor’s degrees but not teaching certificates. Research on both reading and math tutoring has clearly established that such paraprofessional tutors, using structured models, have tutees who gain at least as much as do tutors who are certified teachers. This is important not only because paraprofessionals cost about half as much as teachers, but also because there are chronic teacher shortages in high-poverty areas, such as inner-city and rural locations, so certified teacher tutors may not be available at any cost.

If schools choose proven components for their MTSS/RTI models, and implement them with thought and care, they are sure to see enhanced outcomes for their struggling students. The concept of MTSS/RTI is sound, and the components are proven. How could the outcomes be less than stellar? And in addition to improved achievement for vulnerable learners, hiring many paraprofessionals to serve as tutors in disadvantaged schools could enable schools to attract and identify capable, caring young people with bachelor’s degrees to offer accelerated certification, enriching the local teaching force.

With a spaghetti bridge, a good design is necessary but not sufficient. The components of that design, its ingredients, and its implementation, determine whether the bridge stands or falls in practice. So it is with MTSS and RTI. An approach based on strong evidence of effectiveness is essential to enable these good designs achieve their goals.

Photo credit: CSUF Photos (CC BY-NC-SA 2.0), via flickr

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Rethinking Technology in Education

Antonine de Saint Exupéry, in his 1931 classic Night Flight, had a wonderful line about early airmail service in Patagonia, South America:

“When you are crossing the Andes and your engine falls out, well, there’s nothing to do but throw in your hand.”

blog_10-4-18_Saint_Exupery_363x500

I had reason to think about this quote recently, as I was attending a conference in Santiago, Chile, the presumed destination of the doomed pilot. The conference focused on evidence-based reform in education.

Three of the papers described large scale, randomized evaluations of technology applications in Latin America, funded by the Inter-American Development Bank (IDB). Two of them documented disappointing outcomes of large-scale, traditional uses of technology. One described a totally different application.

One of the studies, reported by Santiago Cueto (Cristia et al., 2017), randomly assigned 318 high-poverty, mostly rural primary schools in Peru to receive sturdy, low-cost, practical computers, or to serve as a control group. Teachers were given great latitude in how to use the computers, but limited professional development in how to use them as pedagogical resources. Worse, the computers had software with limited alignment to the curriculum, and teachers were expected to overcome this limitation. Few did. Outcomes were essentially zero in reading and math.

In another study (Berlinski & Busso, 2017), the IDB funded a very well-designed study in 85 schools in Costa Rica. Schools were randomly assigned to receive one of five approaches. All used the same content on the same schedule to teach geometry to seventh graders. One group used traditional lectures and questions with no technology. The others used active learning, active learning plus interactive whiteboards, active learning plus a computer lab, or active learning plus one computer per student. “Active learning” emphasized discussions, projects, and practical exercises.

On a paper-and-pencil test covering the content studied by all classes, all four of the experimental groups scored significantly worse than the control group. The lowest performance was seen in the computer lab condition, and, worst of all, the one computer per child condition.

The third study, in Chile (Araya, Arias, Bottan, & Cristia, 2018), was funded by the IDB and the International Development Research Center of the Canadian government. It involved a much more innovative and unusual application of technology. Fourth grade classes within 24 schools were randomly assigned to experimental or control conditions. In the experimental group, classes in similar schools were assigned to serve as competitors to each other. Within the math classes, students studied with each other and individually for a bi-monthly “tournament,” in which students in each class were individually given questions to answer on the computers. Students were taught cheers and brought to fever pitch in their preparations. The participating classes were compared to the control classes, which studied the same content using ordinary methods. All classes, experimental and control, were studying the national curriculum on the same schedule, and all used computers, so all that differed was the tournaments and the cooperative studying to prepare for the tournaments.

The outcomes were frankly astonishing. The students in the experimental schools scored much higher on national tests than controls, with an effect size of +0.30.

The differences in the outcomes of these three approaches are clear. What might explain them, and what do they tell us about applications of technology in Latin America and anywhere?

In Peru, the computers were distributed as planned and generally functioned, but teachers receive little professional development. In fact, teachers were not given specific strategies for using the computers, but were expected to come up with their own uses for them.

The Costa Rica study did provide computer users with specific approaches to math and gave teachers much associated professional development. Yet the computers may have been seen as replacements for teachers, and the computers may just not have been as effective as teachers. Alternatively, despite extensive PD, all four of the experimental approaches were very new to the teachers and may have not been well implemented.

In contrast, in the Chilean study, tournaments and cooperative study were greatly facilitated by the computers, but the computers were not central to program effectiveness. The theory of action emphasized enhanced motivation to engage in cooperative study of math. The computers were only a tool to achieve this goal. The tournament strategy resembles a method from the 1970s called Teams-Games-Tournaments (TGT) (DeVries & Slavin, 1978). TGT was very effective, but was complicated for teachers to use, which is why it was not widely adopted. In Chile, computers helped solve the problems of complexity.

It is important to note that in the United States, technology solutions are also not producing major gains in student achievement. Reviews of research on elementary reading (ES=+0.05; Inns et al. 2018) and secondary reading (ES= -0.01; Baye et al., in press) have reported near-zero effects of technology-assisted effects of technology-assisted approaches. Outcomes in elementary math are only somewhat better, averaging an effect size of +0.09 (Pellegrini et al., 2018).

The findings of these rigorous studies of technology in the U.S. and Latin America lead to a conclusion that there is nothing magic about technology. Applications of technology can work if the underlying approach is sound. Perhaps it is best to consider which non-technology approaches are proven or likely to increase learning, and only then imagine how technology could make effective methods easier, less expensive, more motivating, or more instructionally effective. As an analogy, great audio technology can make a concert more pleasant or audible, but the whole experience still depends on great composition and great performances. Perhaps technology in education should be thought of in a similar enabling way, rather than as the core of innovation.

St. Exupéry’s Patagonian pilots crossing the Andes had no “Plan B” if their engines fell out. We do have many alternative ways to put technology to work or to use other methods, if the computer-assisted instruction strategies that have dominated technology since the 1970s keep showing such small or zero effects. The Chilean study and certain exceptions to the overall pattern of research findings in the U.S. suggest appealing “Plans B.”

The technology “engine” is not quite falling out of the education “airplane.” We need not throw in our hand. Instead, it is clear that we need to re-engineer both, to ask not what is the best way to use technology, but what is the best way to engage, excite, and instruct students, and then ask how technology can contribute.

Photo credit: Distributed by Agence France-Presse (NY Times online) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

References

Araya, R., Arias, E., Bottan, N., & Cristia, J. (2018, August 23). Conecta Ideas: Matemáticas con motivatión social. Paper presented at the conference “Educate with Evidence,” Santiago, Chile.

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Berlinski, S., & Busso, M. (2017). Challenges in educational reform: An experiment on active learning in mathematics. Economics Letters, 156, 172-175.

Cristia, J., Ibarraran, P., Cueto, S., Santiago, A., & Severín, E. (2017). Technology and child development: Evidence from the One Laptop per Child program. American Economic Journal: Applied Economics, 9 (3), 295-320.

DeVries, D. L., & Slavin, R. E. (1978). Teams-Games-Tournament:  Review of ten classroom experiments. Journal of Research and Development in Education, 12, 28-38.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018, March 3). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018, March 3). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

The Curious Case of the Missing Programs

“Let me tell you, my dear Watson, about one of my most curious and vexing cases,” said Holmes. “I call it, ‘The Case of the Missing Programs’. A school superintendent from America sent me a letter.  It appears that whenever she looks in the What Works Clearinghouse to find a program her district wants to use, nine times out of ten there is nothing there!”

Watson was astonished. “But surely there has to be something. Perhaps the missing programs did not meet WWC standards, or did not have positive effects!”

“Not meeting standards or having disappointing outcomes would be something,” responded Holmes, “but the WWC often says nothing at all about a program. Users are apparently confused. They don’t know what to conclude.”

“The missing programs must make the whole WWC less useful and reliable,” mused Watson.

“Just so, my friend,” said Holmes, “and so we must take a trip to America to get to the bottom of this!”

blog_9-6-18_SherlockProf_458x500

While Holmes and Watson are arranging steamship transportation to America, let me fill you in on this very curious case.

In the course of our work on Evidence for ESSA (www.evidenceforessa.org), we are occasionally asked by school district leaders why there is nothing in our website about a given program, text, or software. Whenever this happens, our staff immediately checks to see if there is any evidence we’ve missed. If we are pretty sure that there are no studies of the missing program that meet our standards, we add the program to our website, with a brief indication that there are no qualifying studies. If any studies do meet our standards, we review them as soon as possible and add them as meeting or not meeting ESSA standards.

Sometimes, districts or states send us their entire list of approved texts and software, and we check them all to see that all are included.

From having done this for more than a year, we now have an entry on most of the reading and math programs any district would come up with, though we keep getting more all the time.

All of this seems to us to be obviously essential. If users of Evidence for ESSA look up their favorite programs, or ones they are thinking of adopting, and find that there is no entry, they begin losing confidence in the whole enterprise. They cannot know whether the program they seek was ignored or missed for some reason, or has no evidence of effectiveness, or perhaps has been proven effective but has not been reviewed.

Recently, a large district sent me their list of 98 approved and supplementary texts, software, and other programs in reading and math. They had marked each according to the ratings given by the What Works Clearinghouse and Evidence for ESSA. At the time (a few weeks ago), Evidence for ESSA had listings for 67% of the programs. Today, of course, it has 100%, because we immediately set to work researching and adding in all the programs we’d missed.

What I found astonishing, however, is how few of the district’s programs were mentioned at all in the What Works Clearinghouse. Only 15% of the reading and math programs were in the WWC.

I’ve written previously about how far behind the WWC is in reviewing programs. But the problem with the district list was not just a question of slowness. Many of the programs the WWC missed have been around for some time.

I’m not sure how the WWC decides what to review, but they do not seem to be trying for completeness. I think this is counterproductive. Users of the WWC should expect to be able to find out about programs that meet standards for positive outcomes, those that have an evidence base that meets evidence standards but do not have positive outcomes, those that have evidence not meeting standards, and those that have no evidence at all. Yet it seems clear that the largest category in the WWC is “none of the above.” Most programs a user would be interested in do not appear at all in the WWC. Most often, a lack of a listing means a lack of evidence, but this is not always the case, especially when evidence is recent. One way or another, finding big gaps in any compendium undermines faith in the whole effort. It’s difficult to expect educational leaders to get into the habit of looking for evidence if most of the programs they consider are not listed.

Imagine, for example, that a telephone book was missing a significant fraction of the people who live in a given city. Users would be frustrated about not being able to find their friends, and the gaps would soon undermine confidence in the whole phone book.

****

When Holmes and Watson arrived in the U.S., they spoke with many educators who’d tried to find programs in the WWC, and they heard tales of frustration and impatience. Many former users said they no longer bothered to consult the WWC and had lost faith in evidence in their field. Fortunately, Holmes and Watson got a meeting with U.S. Department of Education officials, who immediately understood the problem and set to work to find the evidence base (or lack of evidence) for every reading and math program in America. Usage of the WWC soared, and support for evidence-based reform in education increased.

Of course, this outcome is fictional. But it need not remain fictional. The problem is real, and the solution is simple. Or as Holmes would say, “Elementary and secondary, my dear Watson!”

Photo credit: By Rumensz [CC0], from Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Programs and Practices

One issue I hear about all the time when I speak about evidence-based reform in education relates to the question of programs vs. practices. A program is a specific set of procedures, usually with materials, software, professional development, and other elements, designed to achieve one or more important outcomes, such as improving reading, math, or science achievement. Programs are typically created by non-profit organizations, though they may be disseminated by for-profits. Almost everything in the What Works Clearinghouse (WWC) and Evidence for ESSA is a program.

A practice, on the other hand, is a general principle that a teacher can use. It may not require any particular professional development or materials.  Examples of practices include suggestions to use more feedback, more praise, a faster pace of instruction, more higher-order questions, or more technology.

In general, educators, and especially teachers, love practices, but are not so crazy about programs. Programs have structure, requiring adherence to particular activities and use of particular materials. In contrast, every teacher can use practices as they wish. Educational leaders often say, “We don’t do programs.” What they mean is, “we give our teachers generic professional development and then turn them loose to interpret them.”

One problem with practices is that because they leave the details up to each teacher, teachers are likely to interpret them in a way that conforms to what they are already doing, and then no change happens. As an example of this, I once attended a speech by the late, great Madeline Hunter, extremely popular in the 1970s and ‘80s. She spoke and wrote clearly and excitingly in a very down-to-earth way. The auditorium she spoke to was stuffed to the rafters with teachers, who hung on her every word.

When her speech was over, I was swept out in a throng of happy teachers. They were all saying to each other, “Madeline Hunter supports absolutely everything I’ve ever believed about teaching!”

I love happy teachers, but I was puzzled by their reaction. If all the teachers were already doing the things Madeline Hunter recommended to the best of their ability, then how did her ideas improve their teaching? In actuality, a few studies of Hunters’ principles found no significant effects on student learning, and even more surprising, they found few differences between the teaching behaviors of teachers trained in Hunter’s methods and those who had not been. Essentially, one might argue, Madeline Hunter’s principles were popular precisely because they did not require teachers to change very much, and if teachers do not change their teaching, why would we expect their students’ learning to change?

blog_8-23-18_mowerparts_500x333

Another reason that practices rarely change learning is that they are usually small improvements that teachers are expected to assemble to improve their teaching. However, asking teachers to put together many pieces into major improvements is a bit like giving someone the pieces and parts of a lawnmower and asking them to put them together (see picture above). Some mechanically-minded people could do it, but why bother? Why not start with a whole lawnmower?

In the same way, there are gifted teachers who can assemble principles of effective practice into great instruction, but why make it so difficult? Great teachers who could assemble isolated principles into effective teaching strategies are also sure to be able to take a proven program and implement it very well. Why not start with something known to work and then improve it with effective implementation, rather than starting from scratch?

One problem with practices is that most are impossible to evaluate. By definition, everyone has their own interpretation of every practice. If practices become specific, with specific guides, supports, and materials, they become programs. So a practice is a practice exactly because it is too poorly specified to be a program. And practices that are difficult to clearly specify are also unlikely to improve student outcomes.

There are exceptions, where practices can be evaluated. For example, eliminating ability grouping or reducing class size or assigning (or not assigning) homework are practices that can be evaluated, and can be specified. But these are exceptions.

The squishiness of most practices is the reason that they rarely appear in the WWC or Evidence for ESSA. A proper evaluation contrasts one treatment (an experimental group) to a control group continuing current practices. The treatment group almost has to be a program, because otherwise it is impossible to tell what is being evaluated. For example, how can an experiment evaluate “feedback” if teachers make up their own definitions of “feedback”? How about higher-order questions? How about praise? Rapid pace? Use of these practices can be measured using observation, but differences between the treatment and control groups may be hard to detect because in each case teachers in the control group may also be using the same practices. What teacher does not provide feedback? What teacher does not praise children? What teacher does not use higher-order questions? Some may use these practices more than others, but the differences are likely to be subtle. And subtle differences rarely produce important outcomes.

The distinction between programs and practices has a lot to do with the practices (not programs) promoted by John Hattie. He wants to identify practices that can help teachers know about what works in instruction. That’s a noble goal, but it can rarely be done using real classroom research done over real periods of time. In order to isolate particular practices for study, researchers often do very brief, artificial lab studies that have nothing to do with classroom practices.  For example, some lab studies in Hattie’s own review of feedback contrast teachers giving feedback to teachers giving no feedback. What teacher would do that?

It is worthwhile to use what we know from research, experience, program evaluations, and theory to discuss what practices may be most useful for teachers. But claiming particular effect sizes for such studies is rarely justified. The strongest evidence for practical use in schools will almost always come from experiments evaluating programs. Practices have their place, but focusing on exposing teachers to a lot of practices and expecting them to put them together to improve student outcomes is not likely to work.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.