On Reviews of Research in Education

Not so long ago, every middle class home had at least one encyclopedia. Encyclopedias were prominently displayed, a statement to all that this was a house that valued learning. People consulted the encyclopedia to find out about things of interest to them. Those who did not own encyclopedias found them in the local library, where they were heavily used. As a kid, I loved everything about encyclopedias. I loved to read them, but also loved their musty small, their weight, and their beautiful maps and photos.

There were two important advantages of an encyclopedia. First, it was encyclopedic, so users could be reasonably certain that whatever information they wanted was in there somewhere. Second, they were authoritative. Whatever it said in the encyclopedia was likely to be true, or at least carefully vetted by experts.

blog_10-17-19_encyclopediakid_500x331

In educational research, and all scientific fields, we have our own kinds of encyclopedias. One consists of articles in journals that publish reviews of research. In our field, the Review of Educational Research plays a pre-eminent role in this, but there are many others. Reviews are hugely popular. Invariably, review journals have a much higher citation count than even the most esteemed journals focusing on empirical research. In addition to journals, reviews appear I edited volumes, in online compendia, in technical reports, and other sources. At Johns Hopkins, we produce a bi-weekly newsletter, Best Evidence in Brief (BEiB; https://beibindex.wordpress.com/) that summarizes recent research in education. Two years ago we looked at analytics to find out the favorite articles from BEiB. Although BEiB mostly summarizes individual studies, almost all of its favorite articles were summaries of the findings of recent reviews.

Over time, RER and other review journals become “encyclopedias” of a sort.  However, they are not encyclopedic. No journal tries to ensure that key topics will all be covered over time. Instead, journal reviewers and editors evaluate each review sent to them on its own merits. I’m not criticizing this, but it is the way the system works.

Are reviews in journals authoritative? They are in one sense, because reviews accepted for publication have been carefully evaluated by distinguished experts on the topic at hand. However, review methods vary widely and reviews are written for many purposes. Some are written primarily for theory development, and some are really just essays with citations. In contrast, one category of reviews, meta-analyses, go to great lengths to locate and systematically include all relevant citations. These are not pure types, and most meta-analyses have at least some focus on theory building and discussion of current policy or research issues, even if their main purpose is to systematically review a well-defined set of studies.

Given the state of the art of research reviews in education, how could we create an “encyclopedia” of evidence from all sources on the effectiveness of programs and practices designed to improve student outcomes? The goal of such an activity would be to provide readers with something both encyclopedic and authoritative.

My colleagues and I created two websites that are intended to serve as a sort of encyclopedia of PK-12 instructional programs. The Best Evidence Encyclopedia (BEE; www.bestevidence.org) consists of meta-analyses written by our staff and students, all of which use similar inclusion criteria and review methods. These are used by a wide variety of readers, especially but not only researchers. The BEE has meta-analyses on elementary and secondary reading, reading for struggling readers, writing programs, programs for English learners, elementary and secondary mathematics, elementary and secondary science, early childhood programs, and other topics, so at least as far as achievement outcomes are concerned, it is reasonably encyclopedic. Our second website is Evidence for ESSA, designed more for educators. It seeks to include every program currently in existence, and therefore is truly encyclopedic in reading and mathematics. Sections on social emotional learning, attendance, and science are in progress.

Are the BEE and Evidence for ESSA authoritative as well as encyclopedic? You’ll have to judge for yourself. One important indicator of authoritativeness for the BEE is that all of the meta-analyses are eventually published, so the reviewers for those journals could be considered to be lending authority.

The What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/) could be considered authoritative, as it is a carefully monitored online publication of the U.S. Department of Education. But is it encyclopedic? Probably not, for two reasons. One is that the WWC has difficulty keeping up with new research. Secondly, the WWC does not list programs that do not have any studies that meet its standards. As a result of both of these, a reader who types in the name of a current program may find nothing at all on it. Is this because the program did not meet WWC standards, or because the WWC has not yet reviewed it? There is no way to tell. Still, the WWC makes important contributions in the areas it has reviewed.

Beyond the websites focused on achievement, the most encyclopedic and authoritative source is Blueprints (www.blueprintsprograms.org). Blueprints focuses on drug and alcohol abuse, violence, bullying, social emotional learning, and other topics not extensively covered in other review sources.

In order to provide readers with easy access to all of the reviews meeting a specified level of quality on a given topic, it would be useful to have a source that briefly describes various reviews, regardless of where they appear. For example, a reader might want to know about all of the meta-analyses that focus on elementary mathematics, or dropout prevention, or attendance. These would include review articles published in scientific journals, technical reports, websites, edited volumes, and so on. To be cited in detail, the reviews should have to meet agreed-upon criteria, including a restriction to experimental-control comparison, a broad and well-documented search for eligible studies, documented efforts to include all studies (published or unpublished) that fall within well-specified parameters (e.g., subjects, grade levels, and start and end dates of studies included). Reviews that meet these standards might be highlighted, though others, including less systematic reviews, should be listed as well, as supplementary resources.

Creating such a virtual encyclopedia would be a difficult but straightforward task. At the end, the collection of rigorous reviews would offer readers encyclopedic, authoritative information on the topics of their interest, as well as providing something more important that no paper encyclopedias ever included: contrasting viewpoints from well-informed experts on each topic.

My imagined encyclopedia wouldn’t have the hypnotic musty smell, the impressive heft, or the beautiful maps and photos of the old paper encyclopedias. However, it would give readers access to up-to-date, curated, authoritative, quantitative reviews of key topics in education, with readable and appealing summaries of what was concluded in qualifying reviews.

Also, did I mention that unlike the encyclopedias of old, it would have to be free?

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Can Computers Teach?

Something’s coming

I don’t know

What it is

But it is

Gonna be great!

-Something’s Coming, West Side Story

For more than 40 years, educational technology has been on the verge of transforming educational outcomes for the better. The song “Something’s Coming,” from West Side Story, captures the feeling. We don’t know how technology is going to solve our problems, but it’s gonna be great!

Technology Counts is an occasional section of Education Week. Usually, it publishes enthusiastic predictions about the wonders around the corner, in line with its many advertisements for technology products of all kinds. So it was a bit of a shock to see the most recent edition, dated April 24. An article entitled, “U.S. Teachers Not Seeing Tech Impact,” by Benjamin Herold, reported a nationally representative survey of 700 teachers. They reported huge purchases of digital devices, software, learning apps, and other technology in the past three years. That’s not news, if you’ve been in schools lately. But if you think technology is doing “a lot” to support classroom innovation, you’re out of step with most of the profession. Only 29% of teachers would agree with you, but 41% say “some,” 26% “a little,” and 4% “none.” Equally modest proportions say that technology has “changed their work as a teacher.” The Technology Counts articles describe most teachers as using technology to help them do what they have always done, rather than to innovate.

There are lots of useful things technology is used for, such as teaching students to use computers, and technology may make some tasks easier for teachers and students. But from their earliest beginnings, everyone hoped that computers would help students learn traditional subjects, such as reading and math. Do they?

blog_5-16-19_kidscomputers_500x333

The answer is, not so much. The table below shows average effect sizes for technology programs in reading and math, using data from four recent rigorous reviews of research. Three of these have been posted at www.bestevidence.org. The fourth, on reading strategies for all students, will be posted in the next few weeks.

Mean Effect Sizes for Applications of Technology in Reading and Mathematics
Number of Studies Mean Effect Size
Elementary Reading 16 +0.09
Elementary Reading – Struggling Readers 6 +0.05
Secondary Reading 23 +0.08
Elementary Mathematics 14 +0.07
Study-Weighted Mean 59 +0.08

An effect size of +0.08, which is the average across the four reviews, is not zero. But it is not much. It is certainly not revolutionary. Also, the effects of technology are not improving over time.

As a point of comparison, average effect sizes for tutoring by teaching assistants have the following effect sizes:

Number of Studies Mean Effect Size
Elementary Reading – Struggling Readers 7 +0.34
Secondary Reading 2 +0.23
Elementary Mathematics 10 +0.27
Study-Weighted Mean 19 +0.29

Tutoring by teaching assistants is more than 3 ½ times as effective as technology. Yet the cost differences between tutoring and technology, especially for effective one-to-small group tutoring by teaching assistants, is not much.

Tutoring is not the only effective alternative to technology. Our reviews have identified many types of programs that are more effective than technology.

A valid argument for continuing with use of technology is that eventually, we are bound to come up with more effective technology strategies. It is certainly worthwhile to keep experimenting. But this argument has been made since the early 1970s, and technology is still not ready for prime time, as least as far as teaching reading and math are concerned. I still believe that technology’s day will come, when strategies to get the best from both teachers and technology will reliably be able to improve learning. Until then, let’s use programs and practices already proven to be effective, as we continue to work to improve the outcomes of technology.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Could Proven Programs Eliminate Gaps in Elementary Reading Achievement?

What if every child in America could read at grade level or better? What if the number of students in special education for learning disabilities, or retained in grade, could be cut in half?

What if students who become behavior problems or give up on learning because of nothing more than reading difficulties could instead succeed in reading and no longer be frustrated by failure?

Today these kinds of outcomes are only pipe dreams. Despite decades of effort and billions of dollars directed toward remedial and special education, reading levels have barely increased.  Gaps between middle class and economically disadvantaged students remain wide, as do gaps between ethnic groups. We’ve done so much, you might think, and nothing has really worked at scale.

Yet today we have many solutions to the problems of struggling readers, solutions so effective that if widely and effectively implemented, they could substantially change not only the reading skills, but the life chances of students who are struggling in reading.

blog_4-25-19_teacherreading_500x333

How do I know this is possible? The answer is that the evidence is there for all to see.

This week, my colleagues and I released a review of research on programs for struggling readers. The review, written by Amanda Inns, Cynthia Lake, Marta Pellegrini, and myself, uses academic language and rigorous review methods. But you don’t have to be a research expert to understand what we found out. In ten minutes, just reading this blog, you will know what needs to be done to have a powerful impact on struggling readers.

Everyone knows that there are substantial gaps in student reading performance according to social class and race. According to the National Assessment of Educational Progress, or NAEP, here are key gaps in terms of effect sizes at fourth grade:

Gap in Effect Sizes
No Free/Reduced lunch/

Free/Reduced lunch

0.56
White/African American 0.52
White/Hispanic 0.46

These are big differences. In order to eliminate these gaps, we’d have to provide schools serving disadvantaged and minority students with programs or services sufficient to increase their reading scores by about a half standard deviation. Is this really possible?

Can We Really Eliminate Such Big and Longstanding Gaps?

Yes, we can. And we can do it cost-effectively.

Our review examined thousands of studies of programs intended to improve the reading performance of struggling readers. We found 59 studies of 39 different programs that met very high standards of research quality. 73% of the qualifying studies used random assignment to experimental or control groups, just as the most rigorous medical studies do. We organized the programs into response to intervention (RTI) tiers:

Tier 1 means whole-class programs, not just for struggling readers

Tier 2 means targeted services for students who are struggling to read

Tier 3 means intensive services for students who have serious difficulties.

Our categories were as follows:

Multi-Tier (Tier 1 + tutoring for students who need it)

Tier 1:

  • Whole-class programs

Tier 2:

  • Technology programs
  • One-to-small group tutoring

Tier 3:

  • One-to-one tutoring

We are not advocating for RTI itself, because the data on RTI are unclear. But it is just common sense to use proven programs with all students, then proven remedial approaches with struggling readers, then intensive services for students for whom Tier 2 is not sufficient.

Do We Have Proven Programs Able to Overcome the Gaps?

The table below shows average effect sizes for specific reading approaches. Wherever you see effect sizes that approach or exceed +0.50, you are looking at proven solutions to the gaps, or at least programs that could become a component in a schoolwide plan to ensure the success of all struggling readers.

Programs That Work for Struggling Elementary Readers

Multi-Tier Approaches Grades Proven No. of Studies Mean Effect Size
      Success for All K-5 3 +0.35
      Enhanced Core Reading Instruction 1 1 +0.24
Tier 1 – Classroom Approaches      
     Cooperative Integrated Reading                        & Composition (CIRC) 2-6 3 +0.11
      PALS 1 1 +0.65
Tier 2 – One-to-Small Group Tutoring      
      Read, Write, & Type (T 1-3) 1 1 +0.42
      Lindamood (T 1-3) 1 1 +0.65
      SHIP (T 1-3) K-3 1 +0.39
      Passport to Literacy (TA 1-4/7) 4 4 +0.15
      Quick Reads (TA 1-2) 2-3 2 +0.22
Tier 3 One-to-One Tutoring
      Reading Recovery (T) 1 3 +0.47
      Targeted Reading Intervention (T) K-1 2 +0.50
      Early Steps (T) 1 1 +0.86
      Lindamood (T) K-2 1 +0.69
      Reading Rescue (T or TA) 1 1 +0.40
      Sound Partners (TA) K-1 2 +0.43
      SMART (PV) K-1 1 +0.40
      SPARK (PV) K-2 1 +0.51

Key:    T: Certified teacher tutors

TA: Teaching assistant tutors

PV: Paid volunteers (e.g., AmeriCorps members)

1-X: For small group tutoring, the usual group size for tutoring (e.g., 1-2, 1-4)

(For more information on each program, see www.evidenceforessa.org)

The table is a road map to eliminating the achievement gaps that our schools have wrestled with for so long. It only lists programs that succeeded at a high level, relative to others at the same tier levels. See the full report or www.evidenceforessa for information on all programs.

It is important to note that there is little evidence of the effectiveness of tutoring in grades 3-5. Almost all of the evidence is from grades K-2. However, studies done in England in secondary schools have found positive effects of three reading tutoring programs in the English equivalent of U.S. grades 6-7. These findings suggest that when well-designed tutoring programs for grades 3-5 are evaluated, they will also show very positive impacts. See our review on secondary reading programs at www.bestevidence.org for information on these English middle school tutoring studies. On the same website, you can also see a review of research on elementary mathematics programs, which reports that most of the successful studies of tutoring in math took place in grades 2-5, another indicator that reading tutoring is also likely to be effective in these grades.

Some of the individual programs have shown effects large enough to overcome gaps all by themselves if they are well implemented (i.e., ES = +0.50 or more). Others have effect sizes lower than +0.50 but if combined with other programs elsewhere on the list, or if used over longer time periods, are likely to eliminate gaps. For example, one-to-one tutoring by certified teachers is very effective, but very expensive. A school might implement a Tier 1 or multi-tier approach to solve all the easy problems inexpensively, then use cost-effective one-to-small group methods for students with moderate reading problems, and only then use one-to-one tutoring with the small number of students with the greatest needs.

Schools, districts, and states should consider the availability, practicality, and cost of these solutions to arrive at a workable solution. They then need to make sure that the programs are implemented well enough and long enough to obtain the outcomes seen in the research, or to improve on them.

But the inescapable conclusion from our review is that the gaps can be closed, using proven models that already exist. That’s big news, news that demands big changes.

Photo credit: Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Measuring Social Emotional Skills in Schools: Return of the MOOSES

Throughout the U. S., there is huge interest in improving students’ social emotional skills and related behaviors. This is indeed important as a means of building tomorrow’s society. However, measuring SEL skills is terribly difficult. Not that measuring reading, math, or science learning is easy, but there are at least accepted measures in those areas. In SEL, almost anything goes, and measures cover an enormous range. Some measures might be fine for theoretical research and some would be all right if they were given independently of the teachers who administered the treatment, but SEL measures are inherently squishy.

A few months ago, I wrote a blog on measurement of social emotional skills. In it, I argued that social emotional skills should be measured in pragmatic school research as objectively as possible, especially to avoid measures that merely reflect having students in experimental groups repeating back attitudes or terminology they learned in the program. I expressed the ideal for social emotional measurement in school experiments as MOOSES: Measurable, Observable, Objective, Social Emotional Skills.

Since that time, our group at Johns Hopkins University has received a generous grant from the Gates Foundation to add research on social emotional skills and attendance to our Evidence for ESSA website. This has enabled our group to dig a lot deeper into measures for social emotional learning. In particular, JHU graduate student Sooyeon Byun created a typology of SEL measures arrayed from least to most MOOSE-like. This is as follows.

  1. Cognitive Skills or Low-Level SEL Skills.

Examples include executive functioning tasks such as pencil tapping, the Stroop test, and other measures of cognitive regulation, as well as recognition of emotions. These skills may be of importance as part of theories of action leading to social emotional skills of importance to schools, but they are not goals of obvious importance to educators in themselves.

  1. Attitudes toward SEL (non-behavioral).

These include agreement with statements such as “bullying is wrong,” and statements about why other students engage in certain behaviors (e.g., “He spilled the milk because he was mean.”).

  1. Intention for SEL behaviors (quasi-behavioral).

Scenario-based measures (e.g., what would you do in this situation?).

  1. SEL behaviors based on self-report (semi-behavioral).

Reports of actual behaviors of self, or observations of others, often with frequencies (e.g., “How often have you seen bullying in this school during this school year?”) or “How often do you feel anxious or afraid in class in this school?”)

This category was divided according to who is reporting:

4a. Interested party (e.g., report by teachers or parents who implemented the program and may have reason to want to give a positive report)

4b. Disinterested party (e.g., report by students or by teachers or parents who did not administer the treatment)

  1. MOOSES (Measurable, Observable, Objective Social Emotional Skills)
  • Behaviors observed by independent observers, either researchers, ideally unaware of treatment assignment, or by school officials reporting on behaviors as they always would, not as part of a study (e.g., regular reports of office referrals for various infractions, suspensions, or expulsions).
  • Standardized tests
  • Other school records

blog_2-21-19_twomoose_500x333

Uses for MOOSES

All other things being equal, school researchers and educators should want to know about measures as high as possible on the MOOSES scale. However, all things are never equal, and in practice, some measures lower on the MOOSES scale may be all that exists or ever could exist. For example, it is unlikely that school officials or independent observers could determine students’ anxiety or fear, so self-report (level 4b) may be essential. MOOSES measures (level 5) may be objectively reported by school officials, but limiting attention to such measures may limit SEL measurement to readily observable behaviors, such as aggression, truancy, and other behaviors of importance to school management, and not on difficult-to-observe behaviors such as bullying.

Still, we expect to find in our ongoing review of the SEL literature that there will be enough research on outcomes measured at level 3 or above to enable us to downplay levels 1 and 2 for school audiences, and in many cases to downplay reports by interested parties in level 4a, where teachers or parents who implement a program then rate the behavior of the children they served.

Social emotional learning is important, and we need measures that reflect their importance, minimizing potential bias and staying as close as possible to independent, meaningful measures of behaviors that are of the greatest importance to educators. In our research team, we have very productive arguments about these measurement issues in the course of reviewing individual articles. I placed a cardboard cutout of a “principal” called “Norm” in our conference room. Whenever things get too theoretical, we consult “Norm” for his advice. For example, “Norm” is not too interested in pencil tapping and Stroop tests, but he sure cares a lot about bullying, aggression, and truancy. Of course, as part of our review we will be discussing our issues and initial decisions with real principals and educators, as well as other experts on SEL.

The growing number of studies of SEL in recent years enables reviewers to set higher standards than would have been feasible even just a few years ago. We still have to maintain a balance in which we can be as rigorous as possible but not end up with too few studies to review.  We can all aspire to be MOOSES, but that is not practical for some measures. Instead, it is useful to have a model of the ideal and what approaches the ideal, so we can make sense of the studies that exist today, with all due recognition of when we are accepting measures that are nearly MOOSES but not quite the real Bullwinkle

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

A Mathematical Mystery

My colleagues and I wrote a review of research on elementary mathematics (Pellegrini, Lake, Inns, & Slavin, 2018). I’ve written about it before, but I wanted to hone in on one extraordinary set of findings.

In the review, there were 12 studies that evaluated programs that focused on providing professional development for elementary teachers of mathematics content and mathematics –-specific pedagogy. I was sure that this category would find positive effects on student achievement, but it did not. The most remarkable (and depressing) finding involved the huge year-long Intel study in which 80 teachers received 90 hours of very high-quality in-service during the summer, followed by an additional 13 hours of group discussions of videos of the participants’ class lessons. Teachers using this program were compared to 85 control teachers. After all this, students in the Intel classes scored slightly worse than controls on standardized measures (Garet et al., 2016).

If the Intel study were the only disappointment, one might look for flaws in their approach or their evaluation design or other things specific to that study. But as I noted earlier, all 12 of the studies of this kind failed to find positive effects, and the mean effect size was only +0.04 (n.s.).

Lest anyone jump to the conclusion that nothing works in elementary mathematics, I would point out that this is not the case. The most impactful category was tutoring programs, so that’s a special case. But the second most impactful category had many features in common with professional development focused on mathematics content and pedagogy, but had an average effect size of +0.25. This category consisted of programs focused on classroom management and motivation: Cooperative learning, classroom management strategies using group contingencies, and programs focusing on social emotional learning.

So there are successful strategies in elementary mathematics, and they all provided a lot of professional development. Yet programs for mathematics content and pedagogy, all of which also provided a lot of professional development, did not show positive effects in high-quality evaluations.

I have some ideas about what may be going on here, but I advance them cautiously, as I am not certain about them.

The theory of action behind professional development focused on mathematics content and pedagogy assumes that elementary teachers have gaps in their understanding of mathematics content and mathematics-specific pedagogy. But perhaps whatever gaps they have are not so important. Here is one example. Leading mathematics educators today take a very strong view that fractions should never be taught using pizza slices, but only using number lines. The idea is that pizza slices are limited to certain fractional concepts, while number lines are more inclusive of all uses of fractions. I can understand and, in concept, support this distinction. But how much difference does it make? Students who are learning fractions can probably be divided into three pizza slices. One slice represents students who understand fractions very well, however they are presented, and another slice consists of students who have no earthly idea about fractions. The third slice consists of students who could have learned fractions if it were taught with number lines but not pizzas. The relative sizes of these slices vary, but I’d guess the third slice is the smallest. Whatever it is, the number of students whose success depends on fractions vs. number lines is unlikely to be large enough to shift the whole group mean very much, and that is what is reported in evaluations of mathematics approaches. For example, if the “already got it” slice is one third of all students, and the “probably won’t get it” slice is also one third, the slice consisting of students who might get the concept one way but not the other is also one third. If the effect size for the middle slice were as high as an improbable +0.20, the average for all students would be less than +0.07, averaging across the whole pizza.

blog_2-14-19_slices_500x333

A related possibility relates to teachers’ knowledge. Assume that one slice of teachers already knows a lot of the content before the training. Another slice is not going to learn or use it. The third slice, those who did not know the content before but will use it effectively after training, is the only slice likely to show a benefit, but this benefit will be swamped by the zero effects for the teachers who already knew the content and those who will not learn or use it.

If teachers are standing at the front of the class explaining mathematical concepts, such as proportions, a certain proportion of students are learning the content very well and a certain proportion are bored, terrified, or just not getting it. It’s hard to imagine that the successful students are gaining much from a change of content or pedagogy, and only a small proportion of the unsuccessful students will all of a sudden understand what they did not understand before, just because it is explained better. But imagine that instead of only changing content, the teacher adopts cooperative learning. Now the students are having a lot of fun working with peers. Struggling students have an opportunity to ask for explanations and help in a less threatening environment, and they get a chance to see and ultimately absorb how their more capable teammates approach and solve difficult problems. The already high-achieving students may become even higher achieving, because as every teacher knows, explanation helps the explainer as much as the student receiving the explanation.

The point I am making is that the findings of our mathematics review may reinforce a general lesson we take away from all of our reviews: Subtle treatments produce subtle (i.e., small) impacts. Students quickly establish themselves as high or average or low achievers, after which time it is difficult to fundamentally change their motivations and approaches to learning. Making modest changes in content or pedagogy may not be enough to make much difference for most students. Instead, dramatically changing motivation, providing peer assistance, and making mathematics more fun and rewarding, seems more likely to make a significant change in learning than making subtle changes in content or pedagogy. That is certainly what we have found in systematic reviews of elementary mathematics and elementary and secondary reading.

Whatever the student outcomes are compared to controls, there may be good reason to improve mathematics content and pedagogy. But if we are trying to improve achievement for all students, the whole pizza, we need to use methods that make a more profound impact on all students. And that is true any way you slice it.

References

Garet, M. S., Heppen, J. B., Walters, K., Parkinson, J., Smith, T. M., Song, M., & Borman, G. D. (2016). Focusing on mathematical knowledge: The impact of content-intensive teacher professional development (NCEE 2016-4010). Washington, DC: U.S. Department of Education.

Pellegrini, M., Inns, A., Lake, C., & Slavin, R. E. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the Society for Research on Effective Education, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

What Works in Elementary Math?

Euclid, the ancient Greek mathematician, is considered the inventor of geometry. His king heard about it, and wanted to learn geometry, but being a king, he was kind of busy. He called in Euclid, and asked him if there was a faster way. “I’m sorry sire,” said Euclid, “but there is no royal road to geometry.”

Skipping forward a couple thousand years, Marta Pellegrini, of the University of Florence in Italy, spent nine months with our group at Johns Hopkins University and led a review of research on effective programs for elementary mathematics  (Pellegrini, Lake, Inns & Slavin, 2018), which was recently released on our Best Evidence Encyclopedia (BEE). What we found was not so different from Euclid’s conclusion, but broader: There’s no royal road to anything in mathematics. Improving mathematics achievement isn’t easy. But it is not impossible.

Our review focused on 78 very high-quality studies (65 used random assignment). 61 programs were divided into eight categories: tutoring, technology, professional development for math content and pedagogy, instructional process programs, whole-school reform, social-emotional approaches, textbooks, and benchmark assessments.

Tutoring had the largest and most reliably positive impacts on math learning. Tutoring included one-to-one and one-to-small group services, and some tutors were certified teachers and some were paraprofessionals (teacher assistants). The successful tutoring models were all well-structured, and tutors received high-quality materials and professional development. Across 13 studies involving face-to-face tutoring, average outcomes were very positive. Surprisingly, tutors who were certified teachers (ES=+0.34) and paraprofessionals (ES=+0.32) obtained very similar student outcomes. Even more surprising, one-to-small group tutoring (ES=+0.32) was as effective as one-to-one (ES=+0.26).

Beyond tutoring, the category with the largest average impacts was instructional programs, classroom organization and management approaches, such as cooperative learning and the Good Behavior Game. The mean effect size was +0.25.

blog_10-11-18_LTF_500x479

After these two categories, there were only isolated studies with positive outcomes. 14 studies of technology approaches had an average effect size of only +0.07. 12 studies of professional development to improve teachers’ knowledge of math content and pedagogy found an average of only +0.04. One study of a social-emotional program called Positive Action found positive effects but seven other SEL studies did not, and the mean for this category was +0.03. One study of a whole-school reform model called the Center for Data-Driven Reform in Education (CDDRE), which helps schools do needs assessments, and then find, select, and implement proven programs, showed positive outcomes (ES=+0.24), but three other whole-school models found no positive effects. Among 16 studies of math curricula and software, only two, Math in Focus (ES=+0.25) and Math Expressions (ES=+0.11), found significant positive outcomes. On average, benchmark assessment approaches made no difference (ES=0.00).

Taken together, the findings of the 78 studies support a surprising conclusion. Few of the successful approaches had much to do with improving math pedagogy. Most were one-to-one or one-to-small group tutoring approaches that closely resemble tutoring models long used with great success in reading. A classroom management approach, PAX Good Behavior Game, and a social-emotional model, Positive Action, had no particular focus on math, yet both had positive effects on math (and reading). A whole-school reform approach, the Center for Data-Driven Reform in Education (CDDRE), helped schools do needs assessments and select proven programs appropriate to their needs, but CDDRE focused equally on reading and math, and had significantly positive outcomes in both subjects. In contrast, math curricula and professional development specifically designed for mathematics had only two positive examples among 28 programs.

The substantial difference in outcomes of tutoring and outcomes of technology applications is also interesting. The well-established positive impacts of one-to-one and one-to-small group tutoring, in reading as well as math, are often ascribed to the tutor’s ability to personalize instruction for each student. Computer-assisted instruction is also personalized, and has been expected, largely on this basis, to improve student achievement, especially in math (see Cheung & Slavin, 2013). Yet in math, and also reading, one-to-one and one-to-small group tutoring, by certified teachers and paraprofessionals, is far more effective than the average for technology approaches. The comparison of outcomes of personalized CAI and (personalized) tutoring make it unlikely that personalization is a key explanation for the effectiveness of tutoring. Tutors must contribute something powerful beyond personalization.

I have argued previously that what tutors contribute, in addition to personalization, is a human connection, encouragement, and praise. A tutored child wants to please his or her tutor, not by completing a set of computerized exercises, but by seeing a tutor’s eyes light up and voice respond when the tutee makes progress.

If this is the secret of the effect of tutoring (beyond personalization), perhaps a similar explanation extends to other approaches that happen to improve mathematics performance without using especially innovative approaches to mathematics content or pedagogy. Approaches such as PAX Good Behavior Game and Positive Action, targeted on behavior and social-emotional skills, respectively, focus on children’s motivations, emotions, and behaviors. In the secondary grades, a program called Building Assets, Reducing Risk (BARR) (Corsello & Sharma, 2015) has an equal focus on social-emotional development, not math, but it also has significant positive effects on math (as well as reading). A study in Chile of a program called Conecta Ideas found substantial positive effects in fourth grade math by having students practice together in preparation for bimonthly math “tournaments” in competition with other schools. Both content and pedagogy were the same in experimental and control classes, but the excitement engendered by the tournaments led to substantial impacts (ES=+0.30 on national tests).

We need breakthroughs in mathematics teaching. Perhaps we have been looking in the wrong places, expecting that improved content and pedagogy will be the key to better learning. They will surely be involved, but perhaps it will turn out that math does not live only in students’ heads, but must also live in their hearts.

There may be no royal road to mathematics, but perhaps there is an emotional road. Wouldn’t it be astonishing if math, the most cerebral of subjects, turns out more than anything else to depend as much on heart as brain?

References

Cheung, A., & Slavin, R. E. (2013). The effectiveness of educational technology applications for enhancing mathematics achievement in K-12 classrooms: A meta-analysis. Educational Research Review, 9, 88-113.

Corsello, M., & Sharma, A. (2015). The Building Assets-Reducing Risks Program: Replication and expansion of an effective strategy to turn around low-achieving schools: i3 development grant final report. Biddeford, ME, Consello Consulting.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018, March 3). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018, March 3). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Photo credit: By Los Angeles Times Photographic Archive, no photographer stated. [CC BY 4.0  (https://creativecommons.org/licenses/by/4.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.