How Can You Tell When The Findings of a Meta-Analysis Are Likely to Be Valid?

In Baltimore, Faidley’s, founded in 1886, is a much loved seafood market inside Lexington Market. Faidley’s used to be a real old-fashioned market, with sawdust on the floor and an oyster bar in the center. People lined up behind their favorite oyster shucker. In a longstanding tradition, the oyster shuckers picked oysters out of crushed ice and tapped them with their oyster knives. If they sounded full, they opened them. But if they did not, the shuckers discarded them.

I always noticed that the line was longer behind the shucker who was discarding the most oysters. Why? Because everyone knew that the shucker who was pickier was more likely to come up with a dozen fat, delicious oysters, instead of say, nine great ones and three…not so great.

I bring this up today to tell you how to pick full, fair meta-analyses on educational programs. No, you can’t tap them with an oyster knife, but otherwise, the process is similar. You want meta-analysts who are picky about what goes into their meta-analyses. Your goal is to make sure that a meta-analysis produces results that truly represent what teachers and schools are likely to see in practice when they thoughtfully implement an innovative program. If instead you pick the meta-analysis with the biggest effect sizes, you will always be disappointed.

As a special service to my readers, I’m going to let you in on a few trade secrets about how to quickly evaluate a meta-analysis in education.

One very easy way to evaluate a meta-analysis is to look at the overall effect size, probably shown in the abstract. If the overall mean effect size is more than about +0.40, you probably don’t have to read any further. Unless the treatment is tutoring or some other treatment that you would expect to make a massive difference in student achievement, it is rare to find a single legitimate study with an effect size that large, much less an average that large. A very large effect size is almost a guarantee that a meta-analysis is full of studies with design features that greatly inflate effect sizes, not studies with outstandingly effective treatments.

Next, go to the Methods section, which will have within it a section on inclusion (or selection) criteria. It should list the types of studies that were or were not accepted into the study. Some of the criteria will have to do with the focus of the meta-analysis, specifying, for example, “studies of science programs for students in grades 6 to 12.” But your focus is on the criteria that specify how picky the meta-analysis is. As one example of a picky set of critera, here are the main ones we use in Evidence for ESSA and in every analysis we write:

  1. Studies had to use random assignment or matching to assign students to experimental or control groups, with schools and students in each specified in advance.
  2. Students assigned to the experimental group had to be compared to very similar students in a control group, which uses business-as-usual. The experimental and control students must be well matched, within a quarter standard deviation at pretest (ES=+0.25), and attrition (loss of subjects) must be no more than 15% higher in one group than the other at the end of the study. Why? It is essential that experimental and control groups start and remain the same in all ways other than the treatment. Controls for initial differences do not work well when the differences are large.
  3. There must be at least 30 experimental and 30 control students. Analyses of combined effect sizes must control for sample sizes. Why? Evidence finds substantial inflation of effect sizes in very small studies.
  4. The treatments must be provided for at least 12 weeks. Why? Evidence finds major inflation of effect sizes in very brief studies, and brief studies do not represent the reality of the classroom.
  5. Outcome measures must be measures independent of the program developers and researchers. Usually, this means using national tests of achievement, though not necessarily standardized tests. Why? Research has found that tests made by researchers can inflate effect sizes by double, or more, and research-made measures do not represent the reality of classroom assessment.

There may be other details, but these are the most important. Note that there is a double focus of these standards. Each is intended both to minimize bias, but also to maximize similarity to the conditions faced by schools. What principal or teacher who cares about evidence would be interested in adopting a program evaluated in comparison to a very different control group? Or in a study with few subjects, or a very brief duration? Or in a study that used measures made by the developers or researchers? This set is very similar to what the What Works Clearinghouse (WWC) requires, except #5 (the WWC requires exclusion of “overaligned” measures, but not developer-/researcher-made measures).

If these criteria are all there in the “Inclusion Standards,” chances are you are looking at a top-quality meta-analysis. As a rule, it will have average effect sizes lower than those you’ll see in reviews without some or all of these standards, but the effect sizes you see will probably be close to what you will actually get in student achievement gains if your school implements a given program with fidelity and thoughtfulness.

What I find astonishing is how many meta-analyses do not have standards this high. Among experts, these criteria are not controversial, except for the last one, which shouldn’t be. Yet meta-analyses are often written, and accepted by journals, with much lower standards, thereby producing greatly inflated, unrealistic effect sizes.

As one example, there was a meta-analysis of Direct Instruction programs in reading, mathematics, and language, published in the Review of Educational Research (Stockard et al., 2016). I have great respect for Direct Instruction, which has been doing good work for many years. But this meta-analysis was very disturbing.

The inclusion and exclusion criteria in this meta-analysis did not require experimental-control comparisons, did not require well-matched samples, and did not require any minimum sample size or duration. It was not clear how many of the outcomes measures were made by program developers or researchers, rather than independent of the program.

With these minimal inclusion standards, and a very long time span (back to 1966), it is not surprising that the review found a great many qualifying studies. 528, to be exact. The review also reported extraordinary effect sizes: +0.51 for reading, +0.55 for math, and +0.54 for language. If these effects were all true and meaningful, it would mean that DI is much more effective than one-to-one tutoring, for example.

But don’t get your hopes up. The article included an online appendix that showed the sample sizes, study designs, and outcomes of every study.

First, the authors identified eight experimental designs (plus single-subject designs, which were treated separately). Only two of these would meet anyone’s modern standards of meta-analysis: randomized and matched. The others included pre-post gains (no control group), comparisons to test norms, and other pre-scientific designs.

Sample sizes were often extremely small. Leaving aside single-case experiments, there were dozens of single-digit sample sizes (e.g., six students), often with very large effect sizes. Further, there was no indication of study duration.

What is truly astonishing is that RER accepted this study. RER is the top-rated journal in all of education, based on its citation count. Yet this review, and the Kulik & Fletcher (2016) review I cited in a recent blog, clearly did not meet minimal standards for meta-analyses.

My colleagues and I will be working in the coming months to better understand what has gone wrong with meta-analysis in education, and to propose solutions. Of course, our first step will be to spend a lot of time at oyster bars studying how they set such high standards. Oysters and beer will definitely be involved!

Photo credit: Annette White / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

References

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Stockard, J., Wood, T. W., Coughlin, C., & Rasplica Khoury, C. (2018). The effectiveness of Direct Instruction curricula: A meta-analysis of a half century of research. Review of Educational Research88(4), 479–507. https://doi.org/10.3102/0034654317751919

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

When Scientific Literacy is a Matter of Life and Death

The Covid-19 crisis has put a spotlight on the importance of science.  More than at any time I can recall (with the possible exception of panic over the Soviet launch of Sputnik), scientists are in the news.  We count on them to find a cure for people with the Covid-19 virus and a vaccine to prevent new cases.  We count on them to predict the progression of the pandemic, and to discover public health strategies to minimize its spread.  We are justifiably proud of the brilliance, dedication, and hard work scientists are exhibiting every day.

Yet the Covid-19 pandemic is also throwing a harsh light on the scientific understanding of the whole population.  Today, scientific literacy can be a matter of life or death.  Although political leaders, advised by science experts, may recommend what we should do to minimize risks to ourselves and our families, people have to make their own judgments about what is safe and what is not.  The graphs in the newspaper showing how new infections and deaths are trending have real meaning.  They should inform what choices people make.  We are bombarded with advice on the Internet, from friends and neighbors, from television, in the news.  Yet these sources are likely to conflict.  Which should we believe?  Is it safe to go for a walk?  To the grocery store?  To church?  To a party?  Is Grandpa safer at home or in assisted living?

Scientific literacy is something we all should have learned in school. I would define scientific literacy as an understanding of scientific method, a basic understanding of how things work in nature and in technology, and an understanding of how scientists generate new knowledge and subject possible treatments, such as medicines, to rigorous tests.  All of these understandings, and many more, are ordinarily useful in generally understanding the news, for example, but for most people they do not have major personal consequences.  But now they do, and it is terrifying to hear the misconceptions and misinformation people have.  In the current situation, a misconception or misinformation can kill you, or cause you to make decisions that can lead to the death of a family member.

blog_7-2-20_sciencelab_500x333

The importance of scientific literacy in the whole population is now apparent in everyday life.  Yet scientific literacy has not been emphasized in our schools.  Especially in elementary schools, science has taken a back seat, because reading and mathematics are tested every year on state tests, beginning in third grade, but science is not tested in most years.  Many elementary teachers will admit that their own preparation in science was insufficient.  In secondary schools, science classes seem to have been developed to produce scientists, which is of course necessary, but not to produce a population that values and understands scientific information.  And now we are paying the price for this limited focus.

One indicator of our limited focus on science education is the substantial imbalance between the amount of rigorous research in science compared to the amount in mathematics and reading.  I have written reviews of research in each of these areas (see www.bestevidence.org), and it is striking how many fewer experimental studies there are in elementary and secondary science.  Take a look at the What Works Clearinghouse, for another example.  There are many programs in the WWC that focus on reading and mathematics, but science?  Not so many.   Given the obvious importance of science and technology to our economy, you would imagine that investments in research in science education would be a top priority, but judging from the numbers of studies of science programs for elementary and secondary schools, that is certainly not taking place.

The Covid-19 pandemic is giving us a hard lesson in the importance of science for all Americans, not just those preparing to become scientists.  I hope we are learning this lesson, and when the crisis is over, I hope our government and private foundations will greatly increase their investments in research, development, evaluation, and dissemination of proven science approaches for all students.

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Photo credit: Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

Cherry Picking? Or Making Better Trees?

Everyone knows that cherry picking is bad. Bad, bad, bad, bad, bad, bad, bad. Cherry picking means showing non-representative examples to give a false impression of the quality of a product, an argument, or a scientific finding. In educational research, for example, cherry picking might mean a publisher or software developer showing off a school using their product that is getting great results, without saying anything about all the other schools using their product that are getting mediocre results, or worse. Very bad, and very common. The existence of cherry picking is one major reason that educational leaders should always look at valid research involving many schools to evaluate the likely impact of a program. The existence of cherry picking also explains why preregistration of experiments is so important, to make it difficult for developers to do many experiments and then publicize only the ones that got the best results, ignoring the others.

However, something that looks a bit like cherry picking can be entirely appropriate, and is in fact an important way to improve educational programs and outcomes. This is when there are variations in outcomes among various programs of a given type. The average across all programs of that type is unimpressive, but some individual programs have done very well, and have replicated their findings in multiple studies.

As an analogy, let’s move from cherries to apples. The first delicious apple was grown by a farmer in Iowa in 1880. He happened to notice that fruit from one particular branch or one tree had a beautiful shape and a terrific flavor. The Stark Seed Company was looking for new apple varieties, and they bought his tree They grafted the branch on an ordinary rootstock, and (as apples are wont to do), every apple on the grafted tree looked and tasted like the ones from that one unusual branch.

blog_4-16-20_applepicking_333x500 Had the farmer been hoping to sell his whole orchard, and had he taken potential buyers to see this one tree, and offered potential buyers picked apples from this particular branch, then that would be gross cherry-picking. However, he knew (and the Stark Seed Company knew) all about grafting, so instead of using his exceptional branch to fool anyone (note that I am resisting the urge to mention “graft and corruption”), the farmer and Stark could replicate that amazing branch. The key here is the word “replicate.” If it were impossible to replicate the amazing branch, the farmer would have had a local curiosity at most, or perhaps just a delicious occasional snack. But with replication, this one branch transformed the eating apple for a century.

Now let’s get back to education. Imagine that there were a category of educational programs that generally had mediocre results in rigorous experiments. There is always variation in educational outcomes, so the developers of each program would know of individual schools using their program and getting fantastic results. This would be useful for marketing, but if the program developers are honest, they would make all studies of their program available, rather than claiming that the unusual super-duper schools represent what an average school that adopts their program is likely to obtain.

However, imagine that there is a program that resembles others in its category in most ways, yet time and again gets results far beyond those obtained by similar programs of the same type. Perhaps there is a “secret sauce,” some specific factor that explains the exceptional outcomes, or perhaps the organization that created and/or disseminates the program is exceptionally capable. Either way, any potential user would be missing something if they selected a program based on the mediocre average achievement outcomes for its category. If the outcomes for one or more programs are outstanding (and assuming costs and implementation characteristics are similar), then the average achievement effects for the category should no longer be particularly relevant, because any educator who cares about evidence should be looking for the most effective programs, since no one would want to implement an entire category.

I was thinking about apples and cherries because of our group’s work reviewing research on various tutoring programs (Neitzel et al., 2020). As is typical of reviews, we were computing average effect sizes for achievement impacts of categories. Yet these average impacts were much less than the replicated impacts for particular programs. For example, the mean effect size for one-to-small group tutoring was +0.20. Yet various individual programs had mean effect sizes of +0.31, +0.39, +0.42, +0.43, +0.46, and +0.64. In light of these findings, is the practical impact of small group tutoring truly +0.20, or is it somewhere in the range of +0.31 to +0.64? If educators chose programs based on evidence, they would be looking a lot harder at the programs with the larger impacts, not at the mean of all small-group tutoring approaches

Educational programs cannot be replicated (grafted) as easily as apple trees can. But just as the value to the Stark Seed Company of the Iowa farmer’s orchard could not be determined by averaging ratings of a sampling of all of his apples, the value of a category of educational programs cannot be determined by its average effects on achievement. Rather, the value of the category should depend on the effectiveness of its best, replicated, and replicable examples.

At least, you have to admit it’s a delicious idea!

References

Neitzel, A., Lake, C., Pellegrini, M., & Slavin, R. (2020). A synthesis of quantitative research on programs for struggling readers in elementary schools. Available at www.bestevidence.org. Manuscript submitted for publication.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

 

On Reviews of Research in Education

Not so long ago, every middle class home had at least one encyclopedia. Encyclopedias were prominently displayed, a statement to all that this was a house that valued learning. People consulted the encyclopedia to find out about things of interest to them. Those who did not own encyclopedias found them in the local library, where they were heavily used. As a kid, I loved everything about encyclopedias. I loved to read them, but also loved their musty small, their weight, and their beautiful maps and photos.

There were two important advantages of an encyclopedia. First, it was encyclopedic, so users could be reasonably certain that whatever information they wanted was in there somewhere. Second, they were authoritative. Whatever it said in the encyclopedia was likely to be true, or at least carefully vetted by experts.

blog_10-17-19_encyclopediakid_500x331

In educational research, and all scientific fields, we have our own kinds of encyclopedias. One consists of articles in journals that publish reviews of research. In our field, the Review of Educational Research plays a pre-eminent role in this, but there are many others. Reviews are hugely popular. Invariably, review journals have a much higher citation count than even the most esteemed journals focusing on empirical research. In addition to journals, reviews appear I edited volumes, in online compendia, in technical reports, and other sources. At Johns Hopkins, we produce a bi-weekly newsletter, Best Evidence in Brief (BEiB; https://beibindex.wordpress.com/) that summarizes recent research in education. Two years ago we looked at analytics to find out the favorite articles from BEiB. Although BEiB mostly summarizes individual studies, almost all of its favorite articles were summaries of the findings of recent reviews.

Over time, RER and other review journals become “encyclopedias” of a sort.  However, they are not encyclopedic. No journal tries to ensure that key topics will all be covered over time. Instead, journal reviewers and editors evaluate each review sent to them on its own merits. I’m not criticizing this, but it is the way the system works.

Are reviews in journals authoritative? They are in one sense, because reviews accepted for publication have been carefully evaluated by distinguished experts on the topic at hand. However, review methods vary widely and reviews are written for many purposes. Some are written primarily for theory development, and some are really just essays with citations. In contrast, one category of reviews, meta-analyses, go to great lengths to locate and systematically include all relevant citations. These are not pure types, and most meta-analyses have at least some focus on theory building and discussion of current policy or research issues, even if their main purpose is to systematically review a well-defined set of studies.

Given the state of the art of research reviews in education, how could we create an “encyclopedia” of evidence from all sources on the effectiveness of programs and practices designed to improve student outcomes? The goal of such an activity would be to provide readers with something both encyclopedic and authoritative.

My colleagues and I created two websites that are intended to serve as a sort of encyclopedia of PK-12 instructional programs. The Best Evidence Encyclopedia (BEE; www.bestevidence.org) consists of meta-analyses written by our staff and students, all of which use similar inclusion criteria and review methods. These are used by a wide variety of readers, especially but not only researchers. The BEE has meta-analyses on elementary and secondary reading, reading for struggling readers, writing programs, programs for English learners, elementary and secondary mathematics, elementary and secondary science, early childhood programs, and other topics, so at least as far as achievement outcomes are concerned, it is reasonably encyclopedic. Our second website is Evidence for ESSA, designed more for educators. It seeks to include every program currently in existence, and therefore is truly encyclopedic in reading and mathematics. Sections on social emotional learning, attendance, and science are in progress.

Are the BEE and Evidence for ESSA authoritative as well as encyclopedic? You’ll have to judge for yourself. One important indicator of authoritativeness for the BEE is that all of the meta-analyses are eventually published, so the reviewers for those journals could be considered to be lending authority.

The What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/) could be considered authoritative, as it is a carefully monitored online publication of the U.S. Department of Education. But is it encyclopedic? Probably not, for two reasons. One is that the WWC has difficulty keeping up with new research. Secondly, the WWC does not list programs that do not have any studies that meet its standards. As a result of both of these, a reader who types in the name of a current program may find nothing at all on it. Is this because the program did not meet WWC standards, or because the WWC has not yet reviewed it? There is no way to tell. Still, the WWC makes important contributions in the areas it has reviewed.

Beyond the websites focused on achievement, the most encyclopedic and authoritative source is Blueprints (www.blueprintsprograms.org). Blueprints focuses on drug and alcohol abuse, violence, bullying, social emotional learning, and other topics not extensively covered in other review sources.

In order to provide readers with easy access to all of the reviews meeting a specified level of quality on a given topic, it would be useful to have a source that briefly describes various reviews, regardless of where they appear. For example, a reader might want to know about all of the meta-analyses that focus on elementary mathematics, or dropout prevention, or attendance. These would include review articles published in scientific journals, technical reports, websites, edited volumes, and so on. To be cited in detail, the reviews should have to meet agreed-upon criteria, including a restriction to experimental-control comparison, a broad and well-documented search for eligible studies, documented efforts to include all studies (published or unpublished) that fall within well-specified parameters (e.g., subjects, grade levels, and start and end dates of studies included). Reviews that meet these standards might be highlighted, though others, including less systematic reviews, should be listed as well, as supplementary resources.

Creating such a virtual encyclopedia would be a difficult but straightforward task. At the end, the collection of rigorous reviews would offer readers encyclopedic, authoritative information on the topics of their interest, as well as providing something more important that no paper encyclopedias ever included: contrasting viewpoints from well-informed experts on each topic.

My imagined encyclopedia wouldn’t have the hypnotic musty smell, the impressive heft, or the beautiful maps and photos of the old paper encyclopedias. However, it would give readers access to up-to-date, curated, authoritative, quantitative reviews of key topics in education, with readable and appealing summaries of what was concluded in qualifying reviews.

Also, did I mention that unlike the encyclopedias of old, it would have to be free?

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Can Computers Teach?

Something’s coming

I don’t know

What it is

But it is

Gonna be great!

-Something’s Coming, West Side Story

For more than 40 years, educational technology has been on the verge of transforming educational outcomes for the better. The song “Something’s Coming,” from West Side Story, captures the feeling. We don’t know how technology is going to solve our problems, but it’s gonna be great!

Technology Counts is an occasional section of Education Week. Usually, it publishes enthusiastic predictions about the wonders around the corner, in line with its many advertisements for technology products of all kinds. So it was a bit of a shock to see the most recent edition, dated April 24. An article entitled, “U.S. Teachers Not Seeing Tech Impact,” by Benjamin Herold, reported a nationally representative survey of 700 teachers. They reported huge purchases of digital devices, software, learning apps, and other technology in the past three years. That’s not news, if you’ve been in schools lately. But if you think technology is doing “a lot” to support classroom innovation, you’re out of step with most of the profession. Only 29% of teachers would agree with you, but 41% say “some,” 26% “a little,” and 4% “none.” Equally modest proportions say that technology has “changed their work as a teacher.” The Technology Counts articles describe most teachers as using technology to help them do what they have always done, rather than to innovate.

There are lots of useful things technology is used for, such as teaching students to use computers, and technology may make some tasks easier for teachers and students. But from their earliest beginnings, everyone hoped that computers would help students learn traditional subjects, such as reading and math. Do they?

blog_5-16-19_kidscomputers_500x333

The answer is, not so much. The table below shows average effect sizes for technology programs in reading and math, using data from four recent rigorous reviews of research. Three of these have been posted at www.bestevidence.org. The fourth, on reading strategies for all students, will be posted in the next few weeks.

Mean Effect Sizes for Applications of Technology in Reading and Mathematics
Number of Studies Mean Effect Size
Elementary Reading 16 +0.09
Elementary Reading – Struggling Readers 6 +0.05
Secondary Reading 23 +0.08
Elementary Mathematics 14 +0.07
Study-Weighted Mean 59 +0.08

An effect size of +0.08, which is the average across the four reviews, is not zero. But it is not much. It is certainly not revolutionary. Also, the effects of technology are not improving over time.

As a point of comparison, average effect sizes for tutoring by teaching assistants have the following effect sizes:

Number of Studies Mean Effect Size
Elementary Reading – Struggling Readers 7 +0.34
Secondary Reading 2 +0.23
Elementary Mathematics 10 +0.27
Study-Weighted Mean 19 +0.29

Tutoring by teaching assistants is more than 3 ½ times as effective as technology. Yet the cost differences between tutoring and technology, especially for effective one-to-small group tutoring by teaching assistants, is not much.

Tutoring is not the only effective alternative to technology. Our reviews have identified many types of programs that are more effective than technology.

A valid argument for continuing with use of technology is that eventually, we are bound to come up with more effective technology strategies. It is certainly worthwhile to keep experimenting. But this argument has been made since the early 1970s, and technology is still not ready for prime time, as least as far as teaching reading and math are concerned. I still believe that technology’s day will come, when strategies to get the best from both teachers and technology will reliably be able to improve learning. Until then, let’s use programs and practices already proven to be effective, as we continue to work to improve the outcomes of technology.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Could Proven Programs Eliminate Gaps in Elementary Reading Achievement?

What if every child in America could read at grade level or better? What if the number of students in special education for learning disabilities, or retained in grade, could be cut in half?

What if students who become behavior problems or give up on learning because of nothing more than reading difficulties could instead succeed in reading and no longer be frustrated by failure?

Today these kinds of outcomes are only pipe dreams. Despite decades of effort and billions of dollars directed toward remedial and special education, reading levels have barely increased.  Gaps between middle class and economically disadvantaged students remain wide, as do gaps between ethnic groups. We’ve done so much, you might think, and nothing has really worked at scale.

Yet today we have many solutions to the problems of struggling readers, solutions so effective that if widely and effectively implemented, they could substantially change not only the reading skills, but the life chances of students who are struggling in reading.

blog_4-25-19_teacherreading_500x333

How do I know this is possible? The answer is that the evidence is there for all to see.

This week, my colleagues and I released a review of research on programs for struggling readers. The review, written by Amanda Inns, Cynthia Lake, Marta Pellegrini, and myself, uses academic language and rigorous review methods. But you don’t have to be a research expert to understand what we found out. In ten minutes, just reading this blog, you will know what needs to be done to have a powerful impact on struggling readers.

Everyone knows that there are substantial gaps in student reading performance according to social class and race. According to the National Assessment of Educational Progress, or NAEP, here are key gaps in terms of effect sizes at fourth grade:

Gap in Effect Sizes
No Free/Reduced lunch/

Free/Reduced lunch

0.56
White/African American 0.52
White/Hispanic 0.46

These are big differences. In order to eliminate these gaps, we’d have to provide schools serving disadvantaged and minority students with programs or services sufficient to increase their reading scores by about a half standard deviation. Is this really possible?

Can We Really Eliminate Such Big and Longstanding Gaps?

Yes, we can. And we can do it cost-effectively.

Our review examined thousands of studies of programs intended to improve the reading performance of struggling readers. We found 59 studies of 39 different programs that met very high standards of research quality. 73% of the qualifying studies used random assignment to experimental or control groups, just as the most rigorous medical studies do. We organized the programs into response to intervention (RTI) tiers:

Tier 1 means whole-class programs, not just for struggling readers

Tier 2 means targeted services for students who are struggling to read

Tier 3 means intensive services for students who have serious difficulties.

Our categories were as follows:

Multi-Tier (Tier 1 + tutoring for students who need it)

Tier 1:

  • Whole-class programs

Tier 2:

  • Technology programs
  • One-to-small group tutoring

Tier 3:

  • One-to-one tutoring

We are not advocating for RTI itself, because the data on RTI are unclear. But it is just common sense to use proven programs with all students, then proven remedial approaches with struggling readers, then intensive services for students for whom Tier 2 is not sufficient.

Do We Have Proven Programs Able to Overcome the Gaps?

The table below shows average effect sizes for specific reading approaches. Wherever you see effect sizes that approach or exceed +0.50, you are looking at proven solutions to the gaps, or at least programs that could become a component in a schoolwide plan to ensure the success of all struggling readers.

Programs That Work for Struggling Elementary Readers

Multi-Tier Approaches Grades Proven No. of Studies Mean Effect Size
      Success for All K-5 3 +0.35
      Enhanced Core Reading Instruction 1 1 +0.24
Tier 1 – Classroom Approaches      
     Cooperative Integrated Reading                        & Composition (CIRC) 2-6 3 +0.11
      PALS 1 1 +0.65
Tier 2 – One-to-Small Group Tutoring      
      Read, Write, & Type (T 1-3) 1 1 +0.42
      Lindamood (T 1-3) 1 1 +0.65
      SHIP (T 1-3) K-3 1 +0.39
      Passport to Literacy (TA 1-4/7) 4 4 +0.15
      Quick Reads (TA 1-2) 2-3 2 +0.22
Tier 3 One-to-One Tutoring
      Reading Recovery (T) 1 3 +0.47
      Targeted Reading Intervention (T) K-1 2 +0.50
      Early Steps (T) 1 1 +0.86
      Lindamood (T) K-2 1 +0.69
      Reading Rescue (T or TA) 1 1 +0.40
      Sound Partners (TA) K-1 2 +0.43
      SMART (PV) K-1 1 +0.40
      SPARK (PV) K-2 1 +0.51

Key:    T: Certified teacher tutors

TA: Teaching assistant tutors

PV: Paid volunteers (e.g., AmeriCorps members)

1-X: For small group tutoring, the usual group size for tutoring (e.g., 1-2, 1-4)

(For more information on each program, see www.evidenceforessa.org)

The table is a road map to eliminating the achievement gaps that our schools have wrestled with for so long. It only lists programs that succeeded at a high level, relative to others at the same tier levels. See the full report or www.evidenceforessa for information on all programs.

It is important to note that there is little evidence of the effectiveness of tutoring in grades 3-5. Almost all of the evidence is from grades K-2. However, studies done in England in secondary schools have found positive effects of three reading tutoring programs in the English equivalent of U.S. grades 6-7. These findings suggest that when well-designed tutoring programs for grades 3-5 are evaluated, they will also show very positive impacts. See our review on secondary reading programs at www.bestevidence.org for information on these English middle school tutoring studies. On the same website, you can also see a review of research on elementary mathematics programs, which reports that most of the successful studies of tutoring in math took place in grades 2-5, another indicator that reading tutoring is also likely to be effective in these grades.

Some of the individual programs have shown effects large enough to overcome gaps all by themselves if they are well implemented (i.e., ES = +0.50 or more). Others have effect sizes lower than +0.50 but if combined with other programs elsewhere on the list, or if used over longer time periods, are likely to eliminate gaps. For example, one-to-one tutoring by certified teachers is very effective, but very expensive. A school might implement a Tier 1 or multi-tier approach to solve all the easy problems inexpensively, then use cost-effective one-to-small group methods for students with moderate reading problems, and only then use one-to-one tutoring with the small number of students with the greatest needs.

Schools, districts, and states should consider the availability, practicality, and cost of these solutions to arrive at a workable solution. They then need to make sure that the programs are implemented well enough and long enough to obtain the outcomes seen in the research, or to improve on them.

But the inescapable conclusion from our review is that the gaps can be closed, using proven models that already exist. That’s big news, news that demands big changes.

Photo credit: Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Measuring Social Emotional Skills in Schools: Return of the MOOSES

Throughout the U. S., there is huge interest in improving students’ social emotional skills and related behaviors. This is indeed important as a means of building tomorrow’s society. However, measuring SEL skills is terribly difficult. Not that measuring reading, math, or science learning is easy, but there are at least accepted measures in those areas. In SEL, almost anything goes, and measures cover an enormous range. Some measures might be fine for theoretical research and some would be all right if they were given independently of the teachers who administered the treatment, but SEL measures are inherently squishy.

A few months ago, I wrote a blog on measurement of social emotional skills. In it, I argued that social emotional skills should be measured in pragmatic school research as objectively as possible, especially to avoid measures that merely reflect having students in experimental groups repeating back attitudes or terminology they learned in the program. I expressed the ideal for social emotional measurement in school experiments as MOOSES: Measurable, Observable, Objective, Social Emotional Skills.

Since that time, our group at Johns Hopkins University has received a generous grant from the Gates Foundation to add research on social emotional skills and attendance to our Evidence for ESSA website. This has enabled our group to dig a lot deeper into measures for social emotional learning. In particular, JHU graduate student Sooyeon Byun created a typology of SEL measures arrayed from least to most MOOSE-like. This is as follows.

  1. Cognitive Skills or Low-Level SEL Skills.

Examples include executive functioning tasks such as pencil tapping, the Stroop test, and other measures of cognitive regulation, as well as recognition of emotions. These skills may be of importance as part of theories of action leading to social emotional skills of importance to schools, but they are not goals of obvious importance to educators in themselves.

  1. Attitudes toward SEL (non-behavioral).

These include agreement with statements such as “bullying is wrong,” and statements about why other students engage in certain behaviors (e.g., “He spilled the milk because he was mean.”).

  1. Intention for SEL behaviors (quasi-behavioral).

Scenario-based measures (e.g., what would you do in this situation?).

  1. SEL behaviors based on self-report (semi-behavioral).

Reports of actual behaviors of self, or observations of others, often with frequencies (e.g., “How often have you seen bullying in this school during this school year?”) or “How often do you feel anxious or afraid in class in this school?”)

This category was divided according to who is reporting:

4a. Interested party (e.g., report by teachers or parents who implemented the program and may have reason to want to give a positive report)

4b. Disinterested party (e.g., report by students or by teachers or parents who did not administer the treatment)

  1. MOOSES (Measurable, Observable, Objective Social Emotional Skills)
  • Behaviors observed by independent observers, either researchers, ideally unaware of treatment assignment, or by school officials reporting on behaviors as they always would, not as part of a study (e.g., regular reports of office referrals for various infractions, suspensions, or expulsions).
  • Standardized tests
  • Other school records

blog_2-21-19_twomoose_500x333

Uses for MOOSES

All other things being equal, school researchers and educators should want to know about measures as high as possible on the MOOSES scale. However, all things are never equal, and in practice, some measures lower on the MOOSES scale may be all that exists or ever could exist. For example, it is unlikely that school officials or independent observers could determine students’ anxiety or fear, so self-report (level 4b) may be essential. MOOSES measures (level 5) may be objectively reported by school officials, but limiting attention to such measures may limit SEL measurement to readily observable behaviors, such as aggression, truancy, and other behaviors of importance to school management, and not on difficult-to-observe behaviors such as bullying.

Still, we expect to find in our ongoing review of the SEL literature that there will be enough research on outcomes measured at level 3 or above to enable us to downplay levels 1 and 2 for school audiences, and in many cases to downplay reports by interested parties in level 4a, where teachers or parents who implement a program then rate the behavior of the children they served.

Social emotional learning is important, and we need measures that reflect their importance, minimizing potential bias and staying as close as possible to independent, meaningful measures of behaviors that are of the greatest importance to educators. In our research team, we have very productive arguments about these measurement issues in the course of reviewing individual articles. I placed a cardboard cutout of a “principal” called “Norm” in our conference room. Whenever things get too theoretical, we consult “Norm” for his advice. For example, “Norm” is not too interested in pencil tapping and Stroop tests, but he sure cares a lot about bullying, aggression, and truancy. Of course, as part of our review we will be discussing our issues and initial decisions with real principals and educators, as well as other experts on SEL.

The growing number of studies of SEL in recent years enables reviewers to set higher standards than would have been feasible even just a few years ago. We still have to maintain a balance in which we can be as rigorous as possible but not end up with too few studies to review.  We can all aspire to be MOOSES, but that is not practical for some measures. Instead, it is useful to have a model of the ideal and what approaches the ideal, so we can make sense of the studies that exist today, with all due recognition of when we are accepting measures that are nearly MOOSES but not quite the real Bullwinkle

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

A Mathematical Mystery

My colleagues and I wrote a review of research on elementary mathematics (Pellegrini, Lake, Inns, & Slavin, 2018). I’ve written about it before, but I wanted to hone in on one extraordinary set of findings.

In the review, there were 12 studies that evaluated programs that focused on providing professional development for elementary teachers of mathematics content and mathematics –-specific pedagogy. I was sure that this category would find positive effects on student achievement, but it did not. The most remarkable (and depressing) finding involved the huge year-long Intel study in which 80 teachers received 90 hours of very high-quality in-service during the summer, followed by an additional 13 hours of group discussions of videos of the participants’ class lessons. Teachers using this program were compared to 85 control teachers. After all this, students in the Intel classes scored slightly worse than controls on standardized measures (Garet et al., 2016).

If the Intel study were the only disappointment, one might look for flaws in their approach or their evaluation design or other things specific to that study. But as I noted earlier, all 12 of the studies of this kind failed to find positive effects, and the mean effect size was only +0.04 (n.s.).

Lest anyone jump to the conclusion that nothing works in elementary mathematics, I would point out that this is not the case. The most impactful category was tutoring programs, so that’s a special case. But the second most impactful category had many features in common with professional development focused on mathematics content and pedagogy, but had an average effect size of +0.25. This category consisted of programs focused on classroom management and motivation: Cooperative learning, classroom management strategies using group contingencies, and programs focusing on social emotional learning.

So there are successful strategies in elementary mathematics, and they all provided a lot of professional development. Yet programs for mathematics content and pedagogy, all of which also provided a lot of professional development, did not show positive effects in high-quality evaluations.

I have some ideas about what may be going on here, but I advance them cautiously, as I am not certain about them.

The theory of action behind professional development focused on mathematics content and pedagogy assumes that elementary teachers have gaps in their understanding of mathematics content and mathematics-specific pedagogy. But perhaps whatever gaps they have are not so important. Here is one example. Leading mathematics educators today take a very strong view that fractions should never be taught using pizza slices, but only using number lines. The idea is that pizza slices are limited to certain fractional concepts, while number lines are more inclusive of all uses of fractions. I can understand and, in concept, support this distinction. But how much difference does it make? Students who are learning fractions can probably be divided into three pizza slices. One slice represents students who understand fractions very well, however they are presented, and another slice consists of students who have no earthly idea about fractions. The third slice consists of students who could have learned fractions if it were taught with number lines but not pizzas. The relative sizes of these slices vary, but I’d guess the third slice is the smallest. Whatever it is, the number of students whose success depends on fractions vs. number lines is unlikely to be large enough to shift the whole group mean very much, and that is what is reported in evaluations of mathematics approaches. For example, if the “already got it” slice is one third of all students, and the “probably won’t get it” slice is also one third, the slice consisting of students who might get the concept one way but not the other is also one third. If the effect size for the middle slice were as high as an improbable +0.20, the average for all students would be less than +0.07, averaging across the whole pizza.

blog_2-14-19_slices_500x333

A related possibility relates to teachers’ knowledge. Assume that one slice of teachers already knows a lot of the content before the training. Another slice is not going to learn or use it. The third slice, those who did not know the content before but will use it effectively after training, is the only slice likely to show a benefit, but this benefit will be swamped by the zero effects for the teachers who already knew the content and those who will not learn or use it.

If teachers are standing at the front of the class explaining mathematical concepts, such as proportions, a certain proportion of students are learning the content very well and a certain proportion are bored, terrified, or just not getting it. It’s hard to imagine that the successful students are gaining much from a change of content or pedagogy, and only a small proportion of the unsuccessful students will all of a sudden understand what they did not understand before, just because it is explained better. But imagine that instead of only changing content, the teacher adopts cooperative learning. Now the students are having a lot of fun working with peers. Struggling students have an opportunity to ask for explanations and help in a less threatening environment, and they get a chance to see and ultimately absorb how their more capable teammates approach and solve difficult problems. The already high-achieving students may become even higher achieving, because as every teacher knows, explanation helps the explainer as much as the student receiving the explanation.

The point I am making is that the findings of our mathematics review may reinforce a general lesson we take away from all of our reviews: Subtle treatments produce subtle (i.e., small) impacts. Students quickly establish themselves as high or average or low achievers, after which time it is difficult to fundamentally change their motivations and approaches to learning. Making modest changes in content or pedagogy may not be enough to make much difference for most students. Instead, dramatically changing motivation, providing peer assistance, and making mathematics more fun and rewarding, seems more likely to make a significant change in learning than making subtle changes in content or pedagogy. That is certainly what we have found in systematic reviews of elementary mathematics and elementary and secondary reading.

Whatever the student outcomes are compared to controls, there may be good reason to improve mathematics content and pedagogy. But if we are trying to improve achievement for all students, the whole pizza, we need to use methods that make a more profound impact on all students. And that is true any way you slice it.

References

Garet, M. S., Heppen, J. B., Walters, K., Parkinson, J., Smith, T. M., Song, M., & Borman, G. D. (2016). Focusing on mathematical knowledge: The impact of content-intensive teacher professional development (NCEE 2016-4010). Washington, DC: U.S. Department of Education.

Pellegrini, M., Inns, A., Lake, C., & Slavin, R. E. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the Society for Research on Effective Education, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

What Works in Elementary Math?

Euclid, the ancient Greek mathematician, is considered the inventor of geometry. His king heard about it, and wanted to learn geometry, but being a king, he was kind of busy. He called in Euclid, and asked him if there was a faster way. “I’m sorry sire,” said Euclid, “but there is no royal road to geometry.”

Skipping forward a couple thousand years, Marta Pellegrini, of the University of Florence in Italy, spent nine months with our group at Johns Hopkins University and led a review of research on effective programs for elementary mathematics  (Pellegrini, Lake, Inns & Slavin, 2018), which was recently released on our Best Evidence Encyclopedia (BEE). What we found was not so different from Euclid’s conclusion, but broader: There’s no royal road to anything in mathematics. Improving mathematics achievement isn’t easy. But it is not impossible.

Our review focused on 78 very high-quality studies (65 used random assignment). 61 programs were divided into eight categories: tutoring, technology, professional development for math content and pedagogy, instructional process programs, whole-school reform, social-emotional approaches, textbooks, and benchmark assessments.

Tutoring had the largest and most reliably positive impacts on math learning. Tutoring included one-to-one and one-to-small group services, and some tutors were certified teachers and some were paraprofessionals (teacher assistants). The successful tutoring models were all well-structured, and tutors received high-quality materials and professional development. Across 13 studies involving face-to-face tutoring, average outcomes were very positive. Surprisingly, tutors who were certified teachers (ES=+0.34) and paraprofessionals (ES=+0.32) obtained very similar student outcomes. Even more surprising, one-to-small group tutoring (ES=+0.32) was as effective as one-to-one (ES=+0.26).

Beyond tutoring, the category with the largest average impacts was instructional programs, classroom organization and management approaches, such as cooperative learning and the Good Behavior Game. The mean effect size was +0.25.

blog_10-11-18_LTF_500x479

After these two categories, there were only isolated studies with positive outcomes. 14 studies of technology approaches had an average effect size of only +0.07. 12 studies of professional development to improve teachers’ knowledge of math content and pedagogy found an average of only +0.04. One study of a social-emotional program called Positive Action found positive effects but seven other SEL studies did not, and the mean for this category was +0.03. One study of a whole-school reform model called the Center for Data-Driven Reform in Education (CDDRE), which helps schools do needs assessments, and then find, select, and implement proven programs, showed positive outcomes (ES=+0.24), but three other whole-school models found no positive effects. Among 16 studies of math curricula and software, only two, Math in Focus (ES=+0.25) and Math Expressions (ES=+0.11), found significant positive outcomes. On average, benchmark assessment approaches made no difference (ES=0.00).

Taken together, the findings of the 78 studies support a surprising conclusion. Few of the successful approaches had much to do with improving math pedagogy. Most were one-to-one or one-to-small group tutoring approaches that closely resemble tutoring models long used with great success in reading. A classroom management approach, PAX Good Behavior Game, and a social-emotional model, Positive Action, had no particular focus on math, yet both had positive effects on math (and reading). A whole-school reform approach, the Center for Data-Driven Reform in Education (CDDRE), helped schools do needs assessments and select proven programs appropriate to their needs, but CDDRE focused equally on reading and math, and had significantly positive outcomes in both subjects. In contrast, math curricula and professional development specifically designed for mathematics had only two positive examples among 28 programs.

The substantial difference in outcomes of tutoring and outcomes of technology applications is also interesting. The well-established positive impacts of one-to-one and one-to-small group tutoring, in reading as well as math, are often ascribed to the tutor’s ability to personalize instruction for each student. Computer-assisted instruction is also personalized, and has been expected, largely on this basis, to improve student achievement, especially in math (see Cheung & Slavin, 2013). Yet in math, and also reading, one-to-one and one-to-small group tutoring, by certified teachers and paraprofessionals, is far more effective than the average for technology approaches. The comparison of outcomes of personalized CAI and (personalized) tutoring make it unlikely that personalization is a key explanation for the effectiveness of tutoring. Tutors must contribute something powerful beyond personalization.

I have argued previously that what tutors contribute, in addition to personalization, is a human connection, encouragement, and praise. A tutored child wants to please his or her tutor, not by completing a set of computerized exercises, but by seeing a tutor’s eyes light up and voice respond when the tutee makes progress.

If this is the secret of the effect of tutoring (beyond personalization), perhaps a similar explanation extends to other approaches that happen to improve mathematics performance without using especially innovative approaches to mathematics content or pedagogy. Approaches such as PAX Good Behavior Game and Positive Action, targeted on behavior and social-emotional skills, respectively, focus on children’s motivations, emotions, and behaviors. In the secondary grades, a program called Building Assets, Reducing Risk (BARR) (Corsello & Sharma, 2015) has an equal focus on social-emotional development, not math, but it also has significant positive effects on math (as well as reading). A study in Chile of a program called Conecta Ideas found substantial positive effects in fourth grade math by having students practice together in preparation for bimonthly math “tournaments” in competition with other schools. Both content and pedagogy were the same in experimental and control classes, but the excitement engendered by the tournaments led to substantial impacts (ES=+0.30 on national tests).

We need breakthroughs in mathematics teaching. Perhaps we have been looking in the wrong places, expecting that improved content and pedagogy will be the key to better learning. They will surely be involved, but perhaps it will turn out that math does not live only in students’ heads, but must also live in their hearts.

There may be no royal road to mathematics, but perhaps there is an emotional road. Wouldn’t it be astonishing if math, the most cerebral of subjects, turns out more than anything else to depend as much on heart as brain?

References

Cheung, A., & Slavin, R. E. (2013). The effectiveness of educational technology applications for enhancing mathematics achievement in K-12 classrooms: A meta-analysis. Educational Research Review, 9, 88-113.

Corsello, M., & Sharma, A. (2015). The Building Assets-Reducing Risks Program: Replication and expansion of an effective strategy to turn around low-achieving schools: i3 development grant final report. Biddeford, ME, Consello Consulting.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018, March 3). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018, March 3). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Photo credit: By Los Angeles Times Photographic Archive, no photographer stated. [CC BY 4.0  (https://creativecommons.org/licenses/by/4.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.