In Meta-Analyses, Weak Inclusion Standards Lead to Misleading Conclusions. Here’s Proof.

By Robert Slavin and Amanda Neitzel, Johns Hopkins University

In two recent blogs (here and here), I’ve written about Baltimore’s culinary glories: crabs and oysters. My point was just that in both cases, there is a lot you have to discard to get to what matters. But I was of course just setting the stage for a problem that is deadly serious, at least to anyone concerned with evidence-based reform in education.

Meta-analysis has contributed a great deal to educational research and reform, helping readers find out about the broad state of the evidence on practical approaches to instruction and school and classroom organization. Recent methodological developments in meta-analysis and meta-regression, and promotion of the use of these methods by agencies such as IES and NSF, have expanded awareness and use of modern methods.

Yet looking at large numbers of meta-analyses published over the past five years, even up to the present, the quality is highly uneven. That’s putting it nicely.  The problem is that most meta-analyses in education are far too unselective with regards to the methodological quality of the studies they include. Actually, I’ve been ranting about this for many years, and along with colleagues, have published several articles on it (e.g., Cheung & Slavin, 2016; Slavin & Madden, 2011; Wolf et al., 2020). But clearly, my colleagues and I are not making enough of a difference.

My colleague, Amanda Neitzel, and I thought of a simple way we could communicate the enormous difference it makes if a meta-analysis accepts studies that contain design elements known to inflate effect sizes. In this blog, we once again use the Kulik & Fletcher (2016) meta-analysis of research on computerized intelligent tutoring, which I critiqued in my blog a few weeks ago (here). As you may recall, the only methodological inclusion standards used by Kulik & Fletcher required that studies use RCTs or QEDs, and that they have a duration of at least 30 minutes (!!!). However, they included enough information to allow us to determine the effect sizes that would have resulted if they had a) weighted for sample size in computing means, which they did not, and b) excluded studies with various features known to inflate effect size estimates. Here is a table summarizing our findings when we additionally excluded studies containing procedures known to inflate mean effect sizes:

If you follow meta-analyses, this table should be shocking. It starts out with 50 studies and a very large effect size, ES=+0.65. Just weighting the mean for study sample sizes reduces this to +0.56. Eliminating small studies (n<60) cut the number of studies almost in half (n=27) and cut the effect size to +0.39. But the largest reductions are due to excluding “local” measures, which on inspection are always measures made by developers or researchers themselves. (The alternative was “standardized measures.”) By itself, excluding local measures (and weighting) cut the number of included studies to 12, and the effect size to +0.10, which was not significantly different from zero (p=.17). Excluding small, brief, and “local” measures only slightly changes the results, because both small and brief studies almost always use “local” (i.e., researcher-made) measures. Excluding all three, and weighting for sample size, leaves this review with only nine studies and an effect size of +0.09, which is not significantly different from zero (p=.21).

The estimates at the bottom of the chart represent what we call “selective standards.” These are the standards we apply in every meta-analysis we write (see www.bestevidence.org), and in Evidence for ESSA (www.evidenceforessa.org).

It is easy to see why this matters. Selective standards almost always produce much lower estimates of effect sizes than do reviews with much less selective standards, which therefore include studies containing design features that have a strong positive bias on effect sizes. Consider how this affects mean effect sizes in meta-analyses. For example, imagine a study that uses two measures of achievement. One is a measure made by the researcher or developer specifically to be “sensitive” to the program’s outcomes. The other is a test independent of the program, such as GRADE/GMADE or Woodcock, standardized tests but not necessarily state tests. Imagine that the researcher-made measure obtains an effect size of +0.30, while the independent measure has an effect size of +0.10. A less-selective meta-analysis would report a mean effect size of +0.20, a respectable-sounding impact. But a selective meta-analysis would report an effect size of +0.10, a very small impact. Which of these estimates represents an outcome with meaning for practice? Clearly, school leaders should not value the +0.30 or +0.20 estimates, which require use of a test designed to be “sensitive” to the treatment. They should care about the gains on the independent test, which represents what educators are trying to achieve and what they are held accountable for. The information from the researcher-made test may be valuable to the researchers, but it has little or no value to educators or students.

The point of this exercise is to illustrate that in meta-analyses, choices of methodological exclusions may entirely determine the outcomes. Had they chosen other exclusions, the Kulik & Fletcher meta-analysis could have reported any effect size from +0.09 (n.s.) to +0.65 (p<.000).

The importance of these exclusions is not merely academic. Think how you’d explain the chart above to your sister the principal:

            Principal Sis: I’m thinking of using one of those intelligent tutoring programs to improve achievement in our math classes. What do you suggest?

            You:  Well, it all depends. I saw a review of this in the top journal in education research. It says that if you include very small studies, very brief studies, and studies in which the researchers made the measures, you could have an effect size of +0.65! That’s like seven additional months of learning!

            Principal Sis:  I like those numbers! But why would I care about small or brief studies, or measures made by researchers? I have 500 kids, we teach all year, and our kids have to pass tests that we don’t get to make up!

            You (sheepishly):  I guess you’re right, Sis. Well, if you just look at the studies with large numbers of students, which continued for more than 12 weeks, and which used independent measures, the effect size was only +0.09, and that wasn’t even statistically significant.

            Principal Sis:  Oh. In that case, what kinds of programs should we use?

From a practical standpoint, study features such as small samples or researcher-made measures add a lot to effect sizes while adding nothing to the value to students or schools of the programs or practices they want to know about. They just add a lot of bias. It’s like trying to convince someone that corn on the cob is a lot more valuable than corn off the cob, because you get so much more quantity (by weight or volume) for the same money with corn on the cob.     Most published meta-analyses only require that studies have control groups, and some do not even require that much. Few exclude researcher- or developer-made measures, or very small or brief studies. The result is that effect sizes in published meta-analyses are very often implausibly large.

Meta-analyses that include studies lacking control groups or studies with small samples, brief durations, pretest differences, or researcher-made measures report overall effect sizes that cannot be fairly compared to other meta-analyses that excluded such studies. If outcomes do not depend on the power of the particular program but rather on the number of potentially biasing features they did or did not exclude, then outcomes of meta-analyses are meaningless.

It is important to note that these two examples are not at all atypical. As we have begun to look systematically at published meta-analyses, most of them fail to exclude or control for key methodological factors known to contribute a great deal of bias. Something very serious has to be done to change this. Also, I’d remind readers that there are lots of programs that do meet strict standards and show positive effects based on reality, not on including biasing factors. At www.evidenceforessa.org, you can see more than 120 reading and math programs that meet selective standards for positive impacts. The problem is that in meta-analyses that include studies containing biasing factors, these truly effective programs are swamped by a blizzard of bias.

In my recent blog (here) I proposed a common set of methodological inclusion criteria that I would think most methodologists would agree to.  If these (or a similar consensus list) were consistently used, we could make more valid comparisons both within and between meta-analyses. But as long as inclusion criteria remain highly variable from meta-analysis to meta-analysis, then all we can do is pick out the few that do use selective standards, and ignore the rest. What a terrible waste.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Slavin, R. E., Madden, N. A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4, 370–380.

Wolf, R., Morrison, J.M., Inns, A., Slavin, R. E., & Risman, K. (2020). Average effect sizes in developer-commissioned and independent evaluations. Journal of Research on Educational Effectiveness. DOI: 10.1080/19345747.2020.1726537

Photo credit: Deeper Learning 4 All, (CC BY-NC 4.0)

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

How to Make Evidence in Education Make a Difference

By Robert Slavin

I have a vision of how education in the U.S. and the world will begin to make solid, irreversible progress in student achievement. In this vision, school leaders will constantly be looking for the most effective programs, proven in rigorous research to accelerate student achievement. This process of informed selection will be aided by government, which will provide special incentive funds to help schools implement proven programs.

In this imagined future, the fact that schools are selecting programs based on good evidence means that publishers, software companies, professional development companies, researchers, and program developers, as well as government, will be engaged in a constant process of creating, evaluating, and disseminating new approaches to every subject and grade level. As in medicine, developers and researchers will be held to strict standards of evidence, but if they develop programs that meet these high standards, they can be confident that their programs will be widely adopted, and will truly make a difference in student learning.

Discovering and disseminating effective classroom programs is not all we have to get right in education. For example, we also need great teachers, principals, and other staff who are well prepared and effectively deployed. A focus on evidence could help at every step of that process, of course, but improving programs and improving staff are not an either-or proposition. We can and must do both. If medicine, for example, focused only on getting the best doctors, nurses, technicians, other staff, but medical research and dissemination of proven therapies were underfunded and little heeded, then we’d have great staff prescribing ineffective or possibly harmful medicines and procedures. In agriculture, we could try to attract farmers who are outstanding in their fields, but that would not have created the agricultural revolution that has largely solved the problem of hunger in most parts of the world. Instead, decades of research created or identified improvements in seeds, stock, fertilizers, veterinary practices, farming methods, and so on, for all of those outstanding farmers to put into practice.

Back to education, my vision of evidence-based reform depends on many actions. Because of the central role government plays in public education, government must take the lead. Some of this will cost money, but it would be a tiny proportion of the roughly $600 billion we spend on K-12 education annually, at all levels (federal, state, and local). Other actions would cost little or nothing, focusing only on standards for how existing funds are used. Key actions to establish evidence of impact as central to educational decisions are as follows:

  1. Invest substantially in practical, replicable approaches to improving outcomes for students, especially achievement outcomes.

Rigorous, high-quality evidence of effectiveness for educational programs has been appearing since about 2006 at a faster rate than ever before, due in particular to investments by the Institute for Education Sciences (IES), Investing in Innovation/Education Innovation Research (i3/EIR), and the National Science Foundation (NSF) in the U.S., and the Education Endowment Foundation in England, but also other parts of government and private foundations. All have embraced rigorous evaluations involving random assignment to conditions, appropriate measures independent of developers or researchers, and at the higher funding levels, third-party evaluators. These are very important developments, and they have given the research field, educators, and policy makers excellent reasons for confidence that the findings of such research have direct meaning for practice. One problem is that, as is true in every applied field that embraces rigorous research, most experiments do not find positive impacts. Only about 20% of such experiments do find positive outcomes. The solution to this is to learn from successes and failures, so that our success rate improves over time. We also need to support a much larger enterprise of development of new solutions to enduring problems of education, in all subjects and grade levels, and to continue to support rigorous evaluations of the most promising of these innovations. In other words, we should not be daunted by the fact that most evaluations do not find positive impacts, but instead we need to increase the success rate by learning from our own evidence, and to carry out many more experiments. Even 20% of a very big number is a big number.

2. Improve communications of research findings to researchers, educators, policy makers, and the general public.

Evidence will not make a substantial difference in education until key stakeholders see it as a key to improving students’ success. Improving communications certainly includes making it easy for various audiences to find out which programs and practices are truly effective. But we also need to build excitement about evidence. To do this, government might establish large-scale, widely publicized, certain-to-work demonstrations of the use and outcomes of proven approaches, so that all will see how evidence can lead to meaningful change.

I will be writing more on in depth on this topic in future blogs.

3. Set specific standards of evidence, and provide incentive funding for schools to adopt and implement proven practices.

The Every Student Succeeds Act (ESSA) boldly defined “strong,” “moderate,” “promising,” and lower levels of evidence of effectiveness for educational programs, and required use of programs meeting one of these top categories for certain federal funding, especially school improvement funding for low-achieving schools. This certainly increased educators’ interest in evidence, but in practice, it is unclear how much this changed practice or outcomes. These standards need to be made more specific. In addition, the standards need to be applied to funding that is clearly discretionary, to help schools adopt new programs, not to add new evidence requirements to traditional funding sources. The ESSA evidence standards have had less impact than hoped for because they mainly apply to school improvement, a longstanding source of federal funding. As a result, many districts and states have fought hard to have the programs they already have declared “effective,” regardless of their actual evidence base. To make evidence popular, it is important to make proven programs available as something extra, a gift to schools and children rather than a hurdle to continuing existing programs. In coming blogs I’ll write further about how government could greatly accelerate and intensify the process of development, evaluation, communication, and dissemination, so that the entire process can begin to make undeniable improvements in particular areas of critical importance demonstrating how evidence can make a difference for students.

Photo credit: Deeper Learning 4 All/(CC BY-NC 4.0)

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

How Can You Tell When The Findings of a Meta-Analysis Are Likely to Be Valid?

In Baltimore, Faidley’s, founded in 1886, is a much loved seafood market inside Lexington Market. Faidley’s used to be a real old-fashioned market, with sawdust on the floor and an oyster bar in the center. People lined up behind their favorite oyster shucker. In a longstanding tradition, the oyster shuckers picked oysters out of crushed ice and tapped them with their oyster knives. If they sounded full, they opened them. But if they did not, the shuckers discarded them.

I always noticed that the line was longer behind the shucker who was discarding the most oysters. Why? Because everyone knew that the shucker who was pickier was more likely to come up with a dozen fat, delicious oysters, instead of say, nine great ones and three…not so great.

I bring this up today to tell you how to pick full, fair meta-analyses on educational programs. No, you can’t tap them with an oyster knife, but otherwise, the process is similar. You want meta-analysts who are picky about what goes into their meta-analyses. Your goal is to make sure that a meta-analysis produces results that truly represent what teachers and schools are likely to see in practice when they thoughtfully implement an innovative program. If instead you pick the meta-analysis with the biggest effect sizes, you will always be disappointed.

As a special service to my readers, I’m going to let you in on a few trade secrets about how to quickly evaluate a meta-analysis in education.

One very easy way to evaluate a meta-analysis is to look at the overall effect size, probably shown in the abstract. If the overall mean effect size is more than about +0.40, you probably don’t have to read any further. Unless the treatment is tutoring or some other treatment that you would expect to make a massive difference in student achievement, it is rare to find a single legitimate study with an effect size that large, much less an average that large. A very large effect size is almost a guarantee that a meta-analysis is full of studies with design features that greatly inflate effect sizes, not studies with outstandingly effective treatments.

Next, go to the Methods section, which will have within it a section on inclusion (or selection) criteria. It should list the types of studies that were or were not accepted into the study. Some of the criteria will have to do with the focus of the meta-analysis, specifying, for example, “studies of science programs for students in grades 6 to 12.” But your focus is on the criteria that specify how picky the meta-analysis is. As one example of a picky set of critera, here are the main ones we use in Evidence for ESSA and in every analysis we write:

  1. Studies had to use random assignment or matching to assign students to experimental or control groups, with schools and students in each specified in advance.
  2. Students assigned to the experimental group had to be compared to very similar students in a control group, which uses business-as-usual. The experimental and control students must be well matched, within a quarter standard deviation at pretest (ES=+0.25), and attrition (loss of subjects) must be no more than 15% higher in one group than the other at the end of the study. Why? It is essential that experimental and control groups start and remain the same in all ways other than the treatment. Controls for initial differences do not work well when the differences are large.
  3. There must be at least 30 experimental and 30 control students. Analyses of combined effect sizes must control for sample sizes. Why? Evidence finds substantial inflation of effect sizes in very small studies.
  4. The treatments must be provided for at least 12 weeks. Why? Evidence finds major inflation of effect sizes in very brief studies, and brief studies do not represent the reality of the classroom.
  5. Outcome measures must be measures independent of the program developers and researchers. Usually, this means using national tests of achievement, though not necessarily standardized tests. Why? Research has found that tests made by researchers can inflate effect sizes by double, or more, and research-made measures do not represent the reality of classroom assessment.

There may be other details, but these are the most important. Note that there is a double focus of these standards. Each is intended both to minimize bias, but also to maximize similarity to the conditions faced by schools. What principal or teacher who cares about evidence would be interested in adopting a program evaluated in comparison to a very different control group? Or in a study with few subjects, or a very brief duration? Or in a study that used measures made by the developers or researchers? This set is very similar to what the What Works Clearinghouse (WWC) requires, except #5 (the WWC requires exclusion of “overaligned” measures, but not developer-/researcher-made measures).

If these criteria are all there in the “Inclusion Standards,” chances are you are looking at a top-quality meta-analysis. As a rule, it will have average effect sizes lower than those you’ll see in reviews without some or all of these standards, but the effect sizes you see will probably be close to what you will actually get in student achievement gains if your school implements a given program with fidelity and thoughtfulness.

What I find astonishing is how many meta-analyses do not have standards this high. Among experts, these criteria are not controversial, except for the last one, which shouldn’t be. Yet meta-analyses are often written, and accepted by journals, with much lower standards, thereby producing greatly inflated, unrealistic effect sizes.

As one example, there was a meta-analysis of Direct Instruction programs in reading, mathematics, and language, published in the Review of Educational Research (Stockard et al., 2016). I have great respect for Direct Instruction, which has been doing good work for many years. But this meta-analysis was very disturbing.

The inclusion and exclusion criteria in this meta-analysis did not require experimental-control comparisons, did not require well-matched samples, and did not require any minimum sample size or duration. It was not clear how many of the outcomes measures were made by program developers or researchers, rather than independent of the program.

With these minimal inclusion standards, and a very long time span (back to 1966), it is not surprising that the review found a great many qualifying studies. 528, to be exact. The review also reported extraordinary effect sizes: +0.51 for reading, +0.55 for math, and +0.54 for language. If these effects were all true and meaningful, it would mean that DI is much more effective than one-to-one tutoring, for example.

But don’t get your hopes up. The article included an online appendix that showed the sample sizes, study designs, and outcomes of every study.

First, the authors identified eight experimental designs (plus single-subject designs, which were treated separately). Only two of these would meet anyone’s modern standards of meta-analysis: randomized and matched. The others included pre-post gains (no control group), comparisons to test norms, and other pre-scientific designs.

Sample sizes were often extremely small. Leaving aside single-case experiments, there were dozens of single-digit sample sizes (e.g., six students), often with very large effect sizes. Further, there was no indication of study duration.

What is truly astonishing is that RER accepted this study. RER is the top-rated journal in all of education, based on its citation count. Yet this review, and the Kulik & Fletcher (2016) review I cited in a recent blog, clearly did not meet minimal standards for meta-analyses.

My colleagues and I will be working in the coming months to better understand what has gone wrong with meta-analysis in education, and to propose solutions. Of course, our first step will be to spend a lot of time at oyster bars studying how they set such high standards. Oysters and beer will definitely be involved!

Photo credit: Annette White / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

References

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Stockard, J., Wood, T. W., Coughlin, C., & Rasplica Khoury, C. (2018). The effectiveness of Direct Instruction curricula: A meta-analysis of a half century of research. Review of Educational Research88(4), 479–507. https://doi.org/10.3102/0034654317751919

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

After the Pandemic: Can We Welcome Students Back to Better Schools?

I am writing in March, 2020, at what may be the scariest point in the COVID-19 pandemic in the U.S. We are just now beginning to understand the potential catastrophe, and also to begin taking actions most likely to reduce the incidence of the disease.

One of the most important preventive measures is school closure. At this writing, thirty entire states have closed their schools, as have many individual districts, including Los Angeles. It is clear that school closures will go far beyond this, both in the U.S. and elsewhere.

I am not an expert on epidemiology, but I did want to make some observations about how widespread school closure could affect education, and (ever the optimist) how this disaster could provide a basis for major improvements in the long run.

Right now, schools are closing for a few weeks, with an expectation that after spring break, all will be well again, and schools might re-open. From what I read, this is unlikely. The virus will continue to spread until it runs out of vulnerable people. The purpose of school closures is to reduce the rate of transmission. Children themselves tend not to get the disease, for some reason, but they do transmit the disease, mostly at school (and then to adults). Only when there are few new cases to transmit can schools be responsibly re-opened. No one knows for sure, but a recent article in Education Week predicted that schools will probably not re-open this school year (Will, 2020). Kansas is the first state to announce that schools will be closed for the rest of the school year, but others will surely follow.

Will students suffer from school closure? There will be lasting damage if students lose parents, grandparents, and other relatives, of course. Their achievement may take a dip, but a remarkable study reported by Ceci (1991) examined the impact of two or more years of school closures in the Netherlands in World War II, and found an initial loss in IQ scores that quickly rebounded after schools re-opened after the war. From an educational perspective, the long-term impact of closure itself may not be so bad. A colleague, Nancy Karweit (1989), studied achievement in districts with long teacher strikes, and did not find much of a lasting impact.

In fact, there is a way in which wise state and local governments might use an opportunity presented by school closures. If schools closing now stay closed through the end of the school year, that could leave large numbers of teachers and administrators with not much to do (assuming they are not furloughed, which could happen). Imagine that, where feasible, this time were used for school leaders to consider how they could welcome students back to much improved schools, and to blog_3-26_20_teleconference2_500x334provide teachers with (electronic) professional development to implement proven programs. This might involve local, regional, or national conversations focused on what strategies are known to be effective for each of the key objectives of schooling. For example, a national series of conversations could take place on proven strategies for beginning reading, for middle school mathematics, for high school science, and so on. By design, the conversations would be focused not just on opinions, but on rigorous evidence of what works. A focus on improving health and disease prevention would be particularly relevant to the current crisis, along with implementing proven academic solutions.

Particular districts might decide to implement proven programs, and then use school closure to provide time for high-quality professional development on instructional strategies that meet the ESSA evidence standards.

Of course, all of the discussion and professional development would have to be done using electronic communications, for obvious reasons of public health. But might it be possible to make wise use of school closure to improve the outcomes of schooling using professional development in proven strategies? With rapid rollout of existing proven programs and dedicated funding, it certainly seems possible.

States and districts are making a wide variety of decisions about what to do during the time that schools are closed. Many are moving to e-learning, but this may be of little help in areas where many students lack computers or access to the internet at home. In some places, a focus on professional development for next school year may be the best way to make the best of a difficult situation.

There have been many times in the past when disasters have led to lasting improvements in health and education. This could be one of these opportunities, if we seize the moment.

Photo credit: Liam Griesacker

References

Ceci, S. J. (1991). How much does schooling influence general intelligence and its cognitive components? A reassessment of the evidence. Developmental Psychology, 27(5), 703–722. https://doi.org/10.1037/0012-1649.27.5.703

Karweit, N. (1989). Time and learning: A review. In R. E. Slavin (Ed.), School and Classroom Organization. Hillsdale, NJ: Erlbaum.

Will, M. (2020, March 15). School closure for the coronavirus could extend to the end of the school year, some say. Education Week.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Cooperative Learning and Achievement

Once upon a time, two teachers went together to an evening workshop on effective teaching strategies. The speaker was dynamic, her ideas were interesting, and everyone in the large audience enjoyed the speech. Afterwards, the two teachers drove back to the town where they lived. The driver talked excitedly with her friend about all the wonderful ideas they’d heard, raised questions about how to put them into practice, and related them to things she’d read, heard, and experienced before.

After an hour’s drive, however, the driver realized that her friend had been asleep for the whole return trip.

Now here’s my question: who learned the most from the speech? Both the driver and her friend were equally excited by the speech and paid equal attention to it. Yet no one would doubt that the driver learned much more, because after the lecture, she talked all about it, thinking her friend was awake.

Every teacher knows how much they learn about any topic by teaching it, or discussing it with others. Imagine how much more the driver and her friend would have learned from the lecture if they had both been participating fully, sharing ideas, perceptions, agreements, disagreements, and new ideas.

So far, this is all obvious, right? Everyone knows that people learn when they are engaged, when they have opportunities to discuss with others, explain to others, ask questions of others, and receive explanations.

Yet in traditionally organized classes, learning does not often happen like this. Teachers teach, students listen, and if genuine discussion takes place at all, it is between the teacher and a small minority of students who always raise their hands and ask good questions. Even in the most exciting and interactive of classes, many students, often a majority, say little or nothing. They may give an answer if called upon, but “giving an answer” is not at all the same as engagement. Even in classes that are organized in groups and encourage group interaction, some students do most of the participating, while others just watch, at best. Evidence from research, especially studies by Noreen Webb (2008), find that the students who learn the most in group settings are those who give full explanations to others. These are the drivers, returning to my opening story. Those who receive a lot of explanations also learn. Who learns least? Those who neither explain nor receive explanations.

For achievement outcomes, it is not enough to put students into groups and let them talk. Research finds that cooperative learning works best when there are group goals and individual accountability. That is, groups can earn recognition or small privileges (e.g., lining up first for recess) if the average of each team member’s score meets a high standard. The purpose of group goals and individual accountability is to incentivize team members to help and encourage each other to excel, and to avoid having, for example, one student do all the work while the others watch (Chapman, 2001). Students can be silent in groups, as they can be in class, but this is less likely if they are working with others toward a common goal that they can achieve only if all team members succeed.

blog_3-5-20_coopstudents_500x333

The effectiveness of cooperative learning for enhancing achievement has been known for a long time (see Rohrbeck et al., 2003; Roseth et al., 2008; Slavin, 1995, 2014). Forms of cooperative learning are frequently seen in elementary and secondary schools, but they are far from standard practice. Forms of cooperative learning that use group goals and individual accountability are even more rare.

There are many examples of programs that incorporate cooperative learning and meet the ESSA Strong or Moderate standards in reading, math, SEL, and attendance. You can see descriptions of the programs by visiting www.evidenceforessa.org and clicking on the cooperative learning filter. As you can see, it is remarkable how many of the programs identified as effective for improving student achievement by the What Works Clearinghouse or Evidence for ESSA make use of well-structured cooperative learning, usually with students working in teams or groups of 4-5 students, mixed in past performance. In fact, in reading and mathematics, only one-to-one or small-group tutoring are more effective than approaches that make extensive use of cooperative learning.

There are many successful approaches to cooperative learning adapted for different subjects, specific objectives, and age levels (see Slavin, 1995). There is no magic to cooperative learning; outcomes depend on use of proven strategies and high-quality implementation. The successful forms of cooperative learning provide at least a good start for educators seeking ways to make school engaging, exciting, social, and effective for learning. Students not only learn from cooperation in small groups, but they love to do so. They are typically eager to work with their classmates. Why shouldn’t we routinely give them this opportunity?

References

Chapman, E. (2001, April). More on moderations in cooperative learning outcomes. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.

Rohrbeck, C. A., Ginsburg-Block, M. D., Fantuzzo, J. W., & Miller, T. R. (2003). Peer-assisted learning interventions with elementary school students: A meta-analytic review. Journal of Educational Psychology, 94(2), 240–257.

Roseth, C., Johnson, D., & Johnson, R. (2008). Promoting early adolescents’ achievement and peer relationships: The effects of cooperative, competitive, and individualistic goal structures. Psychological Bulletin, 134(2), 223–246.

Slavin, R. E. (1995). Cooperative learning: Theory, research, and practice (2nd ed.). Boston, MA: Allyn & Bacon.

Slavin, R. E. (2014). Make cooperative learning powerful: Five essential strategies to make cooperative learning effective. Educational Leadership, 72 (2), 22-26.

Webb, N. M. (2008). Learning in small groups. In T. L. Good (Ed.), 21st century learning (Vol. 1, pp. 203–211). Thousand Oaks, CA: Sage.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

 

New Sections on Social Emotional Learning and Attendance in Evidence for ESSA!

We are proud to announce the launch of two new sections of our Evidence for ESSA website (www.evidenceforessa.org): K-12 social-emotional learning and attendance. Funded by a grant from the Bill and Melinda Gates Foundation, the new sections represent our first foray beyond academic achievement.

blog_2-6-20_evidenceessa_500x333

The social-emotional learning section represents the greatest departure from our prior work. This is due to the nature of SEL, which combines many quite diverse measures. We identified 17 distinct measures, which we grouped in four overarching categories, as follows:

Academic Competence

  • Academic performance
  • Academic engagement

Problem Behaviors

  • Aggression/misconduct
  • Bullying
  • Disruptive behavior
  • Drug/alcohol abuse
  • Sexual/racial harassment or aggression
  • Early/risky sexual behavior

Social Relationships

  • Empathy
  • Interpersonal relationships
  • Pro-social behavior
  • Social skills
  • School climate

Emotional Well-Being

  • Reduction of anxiety/depression
  • Coping skills/stress management
  • Emotional regulation
  • Self-esteem/self-efficacy

Evidence for ESSA reports overall effect sizes and ratings for each of the four categories, as well as the 17 individual measures (which are themselves composed of many measures used by various qualifying studies). So in contrast to reading and math, where programs are rated based on the average of all qualifying  reading or math measures, an SEL program could be rated “strong” in one category, “promising” in another, and “no qualifying evidence” or “qualifying studies found no significant positive effects” on others.

Social-Emotional Learning

The SEL review, led by Sooyeon Byun, Amanda Inns, Cynthia Lake, and Liz Kim at Johns Hopkins University, located 24 SEL programs that both met our inclusion standards and had at least one study that met strong, moderate, or promising standards on at least one of the four categories of outcomes.

There is much more evidence at the elementary and middle school levels than at the high school level. Recognizing that some programs had qualifying outcomes at multiple levels, there were 7 programs with positive evidence for pre-K/K, 10 for 1-2, 13 for 3-6, and 9 for middle school. In contrast, there were only 4 programs with positive effects in senior high schools. Fourteen studies took place in urban locations, 5 in suburbs, and 5 in rural districts.

The outcome variables most often showing positive impacts include social skills (12), school climate (10), academic performance (10), pro-social behavior (8), aggression/misconduct (7), disruptive behavior (7), academic engagement (7), interpersonal relationships (7), anxiety/depression (6), bullying (6), and empathy (5). Fifteen of the programs targeted whole classes or schools, and 9 targeted individual students.

Several programs stood out in terms of the size of the impacts. Take the Lead found effect sizes of +0.88 for social relationships and +0.51 for problem behaviors. Check, Connect, and Expect found effect sizes of +0.51 for emotional well-being, +0.29 for problem behaviors, and +0.28 for academic competence. I Can Problem Solve found effect sizes of +0.57 on school climate. The Incredible Years Classroom and Parent Training Approach reported effect sizes of +.57 for emotional regulation, +0.35 for pro-social behavior, and +0.21 for aggression/misconduct. The related Dinosaur School classroom management model reported effect sizes of +0.31 for aggression/misbehavior. Class-Wide Function-Related Intervention Teams (CW-FIT), an intervention for elementary students with emotional and behavioral disorders, had effect sizes of +0.47 and +0.30 across two studies for academic engagement and +0.38 and +0.21 for disruptive behavior. It also reported effect sizes of +0.37 for interpersonal relationships, +0.28 for social skills, and +0.26 for empathy. Student Success Skills reported effect sizes of +0.30 for problem behaviors, +0.23 for academic competence, and +0.16 for social relationships.

In addition to the 24 highlighted programs, Evidence for ESSA lists 145 programs that were no longer available, had no qualifying studies (e.g., no control group), or had one or more qualifying studies but none that met the ESSA Strong, Moderate, or Promising criteria. These programs can be found by clicking on the “search” bar.

There are many problems inherent to interpreting research on social-emotional skills. One is that some programs may appear more effective than others because they use measures such as self-report, or behavior ratings by the teachers who taught the program. In contrast, studies that used more objective measures, such as independent observations or routinely collected data, may obtain smaller impacts. Also, SEL studies typically measure many outcomes and only a few may have positive impacts.

In the coming months, we will be doing analyses and looking for patterns in the data, and will have more to say about overall generalizations. For now, the new SEL section provides a guide to what we know now about individual programs, but there is much more to learn about this important topic.

Attendance

Our attendance review was led by Chenchen Shi, Cynthia Lake, and Amanda Inns. It located ten attendance programs that met our standards. Only three of these reported on chronic absenteeism, which refers to students missing more than 10% of days. Many more focused on average daily attendance (ADA). Among programs focused on average daily attendance, a Milwaukee elementary school program called SPARK had the largest impact (ES=+0.25). This is not an attendance program per se, but it uses AmeriCorps members to provide tutoring services across the school, as well as involving families. SPARK has been shown to have strong effects on reading, as well as its impressive effects on attendance. Positive Action is another schoolwide approach, in this case focused on SEL. It has been found in two major studies in grades K-8 to improve student reading and math achievement, as well as overall attendance, with a mean effect size of +0.20.

The one program to report data on both ADA and chronic absenteeism is called Attendance and Truancy Intervention and Universal Procedures, or ATI-UP. It reported an effect size in grades K-6 of +0.19 for ADA and +0.08 for chronic attendance. Talent Development High School (TDHS) is a ninth grade intervention program that provides interdisciplinary learning communities and “double dose” English and math classes for students who need them. TDHS reported an effect size of +0.17.

An interesting approach with a modest effect size but very modest cost is now called EveryDay Labs (formerly InClass Today). This program helps schools organize and implement a system to send postcards to parents reminding them of the importance of student attendance. If students start missing school, the postcards include this information as well. The effect size across two studies was a respectable +0.16.

As with SEL, we will be doing further work to draw broader lessons from research on attendance in the coming months. One pattern that seems clear already is that effective attendance improvement models work on building close relationships between at-risk students and concerned adults. None of the effective programs primarily uses punishment to improve attendance, but instead they focus on providing information to parents and students and on making it clear to students that they are welcome in school and missed when they are gone.

Both SEL and attendance are topics of much discussion right now, and we hope these new sections will be useful and timely in helping schools make informed choices about how to improve social-emotional and attendance outcomes for all students.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Do School Districts Really Have Difficulty Meeting ESSA Evidence Standards?

The Center for Educational Policy recently released a report on how school districts are responding to the Every Student Succeeds Act (ESSA) requirement that schools seeking school improvement grants select programs that meet ESSA’s strong, moderate, or promising standards of evidence. Education Week ran a story on the CEP report.

The report noted that many states, districts, and schools are taking the evidence requirements seriously, and are looking at websites and consulting with researchers to help them identify programs that meet the standards. This is all to the good.

However, the report also notes continuing problems districts and schools are having finding out “what works.” Two particular problems were cited. One was that districts and schools were not equipped to review research to find out what works. The other was that rural districts and schools found few programs proven effective in rural schools.

I find these concerns astounding. The same concerns were expressed when ESSA was first passed, in 2015. But that was almost four years ago. Since 2015, the What Works Clearinghouse has added information to help schools identify programs that meet the top two ESSA evidence categories, strong and moderate. Our own Evidence for ESSA, launched in February, 2017, has up-to-date information on virtually all PK-12 reading and math programs currently in dissemination. Among hundreds of programs examined, 113 meet ESSA standards for strong, moderate, or promising evidence of effectiveness. WWC, Evidence for ESSA, and other sources are available online at no cost. The contents of the entire Evidence for ESSA website were imported into Ohio’s own website on this topic, and dozens of states, perhaps all of them, have informed their districts and schools about these sources.

The idea that districts and schools could not find information on proven programs if they wanted to do so is difficult to believe, especially among schools eligible for school improvement grants. Such schools, and the districts in which they are located, write a lot of grant proposals for federal and state funding. The application forms for school improvement grants always explain the evidence requirements, because that is the law. Someone in every state involved with federal funding knows about the WWC and Evidence for ESSA websites. More than 90,000 unique users have used Evidence for ESSA, and more than 800 more sign on each week.

blog_10-10-19_generickids_500x333

As to rural schools, it is true that many studies of educational programs have taken place in urban areas. However, 47 of the 113 programs qualified by Evidence for ESSA were validated in at least one rural study, or a study including a large enough rural sample to enable researchers to separately report program impacts for rural students. Also, almost all widely disseminated programs have been used in many rural schools. So rural districts and schools that care about evidence can find programs that have been evaluated in rural locations, or at least that were evaluated in urban or suburban schools but widely disseminated in rural schools.

Also, it is important to note that if a program was successfully evaluated only in urban or suburban schools, the program still meets the ESSA evidence standards. If no studies of a given outcome were done in rural locations, a rural school in need of better outcomes could, in effect, be asked to choose between a program proven to work somewhere and probably used in dissemination in rural schools, or they could choose a program not proven to work anywhere. Every school and district has to make the best choices for their kids, but if I were a rural superintendent or principal, I’d read up on proven programs, and then go visit some rural schools using that program nearby. Wouldn’t you?

I have no reason to suspect that the CEP survey is incorrect. There are many indications that district and school leaders often do feel that the ESSA evidence rules are too difficult to meet. So what is really going on?

My guess is that there are many district and school leaders who do not want to know about evidence on proven programs. For example, they may have longstanding, positive relationships with representatives of publishers or software developers, or they may be comfortable and happy with the materials and services they are already using, evidence-proven or not. If they do not have evidence of effectiveness that would pass muster with WWC or Evidence for ESSA, the publishers and software developers may push hard on state and district officials, put forward dubious claims for evidence (such as studies with no control groups), and do their best to get by in a system that increasingly demands evidence that they lack. In my experience, district and state officials often complain about having inadequate staff to review evidence of effectiveness, but their concern may be less often finding out what works as it is defending themselves from publishers, software developers, or current district or school users of programs, who maintain that they have been unfairly rated by WWC, Evidence for ESSA, or other reviews. State and district leaders who stand up to this pressure may have to spend a lot of time reviewing evidence or hearing arguments.

On the plus side, at the same time that publishers and software producers may be seeking recognition for their current products, many are also sponsoring evaluations of some of their products that they feel are mostly likely to perform well in rigorous evaluations. Some may be creating new programs that resemble programs that have met evidence standards. If the federal ESSA law continues to demand evidence for certain federal funding purposes, or even to expand this requirement to additional parts of federal grant-making, then over time the ESSA law will have its desired effect, rewarding the creation and evaluation of programs that do meet standards by making it easier to disseminate such programs. The difficulties the evidence movement is experiencing are likely to diminish over time as more proven programs appear, and as federal, state, district, and school leaders get comfortable with evidence.

Evidence-based reform was always going to be difficult, because of the amount of change it entails and the stakes involved. But sooner or later, it is the right thing to do, and leaders who insist on evidence will see increasing levels of learning among their students, at minimal cost beyond what they already spend on untested or ineffective approaches. Medicine went through a similar transition in 1962, when the U.S. Congress first required that medicines be rigorously evaluated for effectiveness and safety. At first, many leaders in the medical profession resisted the changes, but after a while, they came to insist on them. The key is political leadership willing to support the evidence requirement strongly and permanently, so that educators and vendors alike will see that the best way forward is to embrace evidence and make it work for kids.

Photo courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Send Us Your Evaluations!

In last week’s blog, I wrote about reasons that many educational leaders are wary of the ESSA evidence standards, and the evidence-based reform movement more broadly. Chief among these concerns was a complaint that few educational leaders had the training in education research methods to evaluate the validity of educational evaluations. My response to this was to note that it should not be necessary for educational leaders to read and assess individual evaluations of educational programs, because free, easy-to-interpret review websites, such as the What Works Clearinghouse and Evidence for ESSA, already do such reviews. Our Evidence for ESSA website (www.evidenceforessa.org) lists reading and math programs available for use anywhere in the U.S., and we are constantly on the lookout for any we might have missed. If we have done our job well, you should be able to evaluate the evidence base for any program, in perhaps five minutes.

Other evidence-based fields rely on evidence reviews. Why not education? Your physician may or may not know about medical research, but most rely on websites that summarize the evidence. Farmers may be outstanding in their fields, but they rely on evidence summaries. When you want to know about the safety and reliability of cars you might buy, you consult Consumer Reports. Do you understand exactly how they get their ratings? Neither do I, but I trust their expertise. Why should this not be the same for educational programs?

At Evidence for ESSA, we are aiming to provide information on every program available to you, if you are a school or district leader. At the moment, we cover reading and mathematics, grades pre-k to 12. We want to be sure that if a sales rep or other disseminator offers you a program, you can look it up on Evidence for ESSA and it will be there. If there are no studies of the program that meet our standards, we will say so. If there are qualifying studies that either do or do not have evidence of positive outcomes that meet ESSA evidence standards, we will say so. On our website, there is a white box on the homepage. If you type in the name of any reading or math program, the website should show you what we have been able to find out.

What we do not want to happen is that you type in a program title and find nothing. In our website, “nothing” has no useful meaning. We have worked hard to find every program anyone has heard of, and we have found hundreds. But if you know of any reading or math program that does not appear when you type in its name, please tell us. If you have studies of that program that might meet our inclusion criteria, please send them to us, or citations to them. We know that there are always additional programs entering use, and additional research on existing programs.

Why is this so important to us? The answer is simple, Evidence for ESSA exists because we believe it is essential for the progress of evidence-based reform for educators and policy makers to be confident that they can easily find the evidence on any program, not just the most widely used. Our vision is that someday, it will be routine for educators thinking of adopting educational programs to quickly consult Evidence for ESSA (or other reviews) to find out what has been proven to work, and what has not. I heard about a superintendent who, before meeting with any sales rep, asked them to show her the evidence for the effectiveness of their program on Evidence for ESSA or the What Works Clearinghouse. If they had it, “Come on in,” she’d say. If not, “Maybe later.”

Only when most superintendents and other school officials do this will program publishers and other providers know that it is worth their while to have high-quality evaluations done of each of their programs. Further, they will find it worthwhile to invest in the development of programs likely to work in rigorous evaluations, to provide enough quality professional development to give their programs a chance to succeed, and to insist that schools that adopt their proven programs incorporate the methods, materials, and professional development that their own research has told them are needed for success. Insisting on high-quality PD, for example, adds cost to a program, and providers may worry that demanding sufficient PD will price them out of the market. But if all programs are judged on their proven outcomes, they all will require adequate PD, to be sure that the programs will work when evaluated. That is how evidence will transform educational practice and outcomes.

So our attempt to find and fairly evaluate every program in existence is not due to our being nerds or obsessive compulsive neurotics (though these may be true, too). But thorough, rigorous review of the whole body of evidence in every subject and grade level, and for attendance, social emotional learning, and other non-academic outcomes, is part of a plan.

You can help us on this part of our plan. Tell us about anything we have missed, or any mistakes we have made. You will be making an important contribution to the progress of our profession, and to the success of all children.

blog_6-6-19_mail_500x381
Send us your evaluations!

Photo credit: George Grantham Bain Collection, Library of Congress [Public domain]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Why Do Some Educators Push Back Against Evidence?

In December, 2015, the U.S. Congress passed the Every Student Succeeds Act, or ESSA. Among many other provisions, ESSA defined levels of evidence supporting educational programs: Strong (at least one randomized experiment with positive outcomes), moderate (at least one quasi-experimental study with positive outcomes), and promising (at least one correlational study with positive outcomes). For various forms of federal funding, schools are required (in school improvement) or encouraged (in seven other funding streams) to use programs falling into one of these top three categories. There is also a fourth category, “demonstrates a rationale,” but this one has few practical consequences.

3 ½  years later, the ESSA evidence standards are increasing interest in evidence of effectiveness for educational programs, especially among schools applying for school improvement funding and in state departments of education, which are responsible for managing the school improvement grant process. All of this is to the good, in my view.

On the other hand, evidence is not yet transforming educational practice. Even in portions of ESSA that encourage or require use of proven programs among schools seeking federal funding, schools and districts often try to find ways around the evidence requirements rather than truly embracing them. Even when schools do say they used evidence in their proposals, they may have just accepted assurances from publishers or developers stating that their programs meet ESSA standards, even when this is clearly not so.

blog_5-30-19_pushingcar_500x344
Why are these children in India pushing back on a car?  And why do many educators in our country push back on evidence?

Educators care a great deal about their children’s achievement, and they work hard to ensure their success. Implementing proven, effective programs does not guarantee success, but it greatly increases the chances. So why has evidence of effectiveness played such a limited role in program selection and implementation, even when ESSA, the national education law, defines evidence and requires use of proven programs under certain circumstances?

The Center on Education Policy Report

Not long ago, the Center on Education Policy (CEP) at George Washington University published a report of telephone interviews of state leaders in seven states. The interviews focused on problems states and districts were having with implementation of the ESSA evidence standards. Six themes emerged:

  1. Educational leaders are not comfortable with educational research methods.
  2. State leaders feel overwhelmed serving large numbers of schools qualifying for school improvement.
  3. Districts have to seriously re-evaluate longstanding relationships with vendors of education products.
  4. State and district staff are confused about the prohibition on using Title I school improvement funds on “Tier 4” programs (ones that demonstrate a rationale, but have not been successfully evaluated in a rigorous study).
  5. Some state officials complained that the U.S. Department of Education had not been sufficiently helpful with implementation of ESSA evidence standards.
  6. State leaders had suggestions to make education research more accessible to educators.

What is the Reality?

I’m sure that the concerns expressed by the state and district leaders in the CEP report are sincerely felt. But most of them raise issues that have already been solved at the federal, state, and/or district levels. If these concerns are as widespread as they appear to be, then we have serious problems of communication.

  1. The first theme in the CEP report is one I hear all the time. I find it astonishing, in light of the reality.

No educator needs to be a research expert to find evidence of effectiveness for educational programs. The federal What Works Clearinghouse (https://ies.ed.gov/ncee/wwc/) and our Evidence for ESSA (www.evidenceforessa.org) provide free information on the outcomes of programs, at least in reading and mathematics, that is easy to understand and interpret. Evidence for ESSA provides information on programs that do meet ESSA standards as well as those that do not. We are constantly scouring the literature for studies of replicable programs, and when asked, we review entire state and district lists of adopted programs and textbooks, at no cost. The What Works Clearinghouse is not as up-to-date and has little information on programs lacking positive findings, but it also provides easily interpreted information on what works in education.

In fact, few educational leaders anywhere are evaluating the effectiveness of individual programs by reading research reports one at a time. The What Works Clearinghouse and Evidence for ESSA employ experts who know how to find and evaluate outcomes of valid research and to describe the findings clearly. Why would every state and district re-do this job for themselves? It would be like having every state do its own version of Consumer Reports, or its own reviews of medical treatments. It just makes no sense. In fact, at least in the case of Evidence for ESSA, we know that more than 80,000 unique readers have used Evidence for ESSA since it launched in 2017. I’m sure even larger numbers have used the What Works Clearinghouse and other reviews. The State of Ohio took our entire Evidence for ESSA website and put it on its own state servers with some other information. Several other states have strongly promoted the site. The bottom line is that educational leaders do not have to be research mavens to know what works, and tens of thousands of them know where to find fair and useful information.

  1. State leaders are overwhelmed. I’m sure this is true, but most state departments of education have long been understaffed. This problem is not unique to ESSA.
  2. Districts have to seriously re-evaluate longstanding relationships with vendors. I suspect that this concern is at the core of the problem on evidence. The fact is that most commercial programs do not have adequate evidence of effectiveness. Either they have no qualifying studies (by far the largest number), or they do have qualifying evidence that is not significantly positive. A vendor with programs that do not meet ESSA standards is not going to be a big fan of evidence, or ESSA. These are often powerful organizations with deep personal relationships with state and district leaders. When state officials adhere to a strict definition of evidence, defined in ESSA, local vendors push back hard. Understaffed state departments are poorly placed to fight with vendors and their friends in district offices, so they may be forced to accept weak or no evidence.
  3. Confusions about Tier 4 evidence. ESSA is clear that to receive certain federal funds schools must use programs with evidence in Tiers 1, 2, or 3, but not 4. The reality is that definitions of Tier 4 are so weak that any program on Earth can meet this standard. What program anywhere does not have a rationale? The problem is that districts, states, and vendors have used confusion about Tier 4 to justify any program they wish. Some states are more sophisticated than others and do not allow this, but the very existence of Tier 4 in ESSA language creates a loophole that any clever sales rep or educator can use, or at least try to get away with.
  4. The U. S. Department of Education is not helpful enough. In reality, USDoE is understaffed and overwhelmed on many fronts. In any case, ESSA puts a lot of emphasis on state autonomy, so the feds feel unwelcome in performing oversight.

The Future of Evidence in Education

Despite the serious problems in implementation of ESSA, I still think it is a giant step forward. Every successful field, such as medicine, agriculture, and technology, has started its own evidence revolution fighting entrenched interests and anxious stakeholders. As late as the 1920s, surgeons refused to wash their hands before operations, despite substantial evidence going back to the 1800s that handwashing was essential. Evidence eventually triumphs, though it often takes many years. Education is just at the beginning of its evidence revolution, and it will take many years to prevail. But I am unaware of any field that embraced evidence, only to retreat in the face of opposition. Evidence eventually prevails because it is focused on improving outcomes for people, and people vote. Sooner or later, evidence will transform the practice of education, as it has in so many other fields.

Photo credit: Roger Price from Hong Kong, Hong Kong [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Measuring Social Emotional Skills in Schools: Return of the MOOSES

Throughout the U. S., there is huge interest in improving students’ social emotional skills and related behaviors. This is indeed important as a means of building tomorrow’s society. However, measuring SEL skills is terribly difficult. Not that measuring reading, math, or science learning is easy, but there are at least accepted measures in those areas. In SEL, almost anything goes, and measures cover an enormous range. Some measures might be fine for theoretical research and some would be all right if they were given independently of the teachers who administered the treatment, but SEL measures are inherently squishy.

A few months ago, I wrote a blog on measurement of social emotional skills. In it, I argued that social emotional skills should be measured in pragmatic school research as objectively as possible, especially to avoid measures that merely reflect having students in experimental groups repeating back attitudes or terminology they learned in the program. I expressed the ideal for social emotional measurement in school experiments as MOOSES: Measurable, Observable, Objective, Social Emotional Skills.

Since that time, our group at Johns Hopkins University has received a generous grant from the Gates Foundation to add research on social emotional skills and attendance to our Evidence for ESSA website. This has enabled our group to dig a lot deeper into measures for social emotional learning. In particular, JHU graduate student Sooyeon Byun created a typology of SEL measures arrayed from least to most MOOSE-like. This is as follows.

  1. Cognitive Skills or Low-Level SEL Skills.

Examples include executive functioning tasks such as pencil tapping, the Stroop test, and other measures of cognitive regulation, as well as recognition of emotions. These skills may be of importance as part of theories of action leading to social emotional skills of importance to schools, but they are not goals of obvious importance to educators in themselves.

  1. Attitudes toward SEL (non-behavioral).

These include agreement with statements such as “bullying is wrong,” and statements about why other students engage in certain behaviors (e.g., “He spilled the milk because he was mean.”).

  1. Intention for SEL behaviors (quasi-behavioral).

Scenario-based measures (e.g., what would you do in this situation?).

  1. SEL behaviors based on self-report (semi-behavioral).

Reports of actual behaviors of self, or observations of others, often with frequencies (e.g., “How often have you seen bullying in this school during this school year?”) or “How often do you feel anxious or afraid in class in this school?”)

This category was divided according to who is reporting:

4a. Interested party (e.g., report by teachers or parents who implemented the program and may have reason to want to give a positive report)

4b. Disinterested party (e.g., report by students or by teachers or parents who did not administer the treatment)

  1. MOOSES (Measurable, Observable, Objective Social Emotional Skills)
  • Behaviors observed by independent observers, either researchers, ideally unaware of treatment assignment, or by school officials reporting on behaviors as they always would, not as part of a study (e.g., regular reports of office referrals for various infractions, suspensions, or expulsions).
  • Standardized tests
  • Other school records

blog_2-21-19_twomoose_500x333

Uses for MOOSES

All other things being equal, school researchers and educators should want to know about measures as high as possible on the MOOSES scale. However, all things are never equal, and in practice, some measures lower on the MOOSES scale may be all that exists or ever could exist. For example, it is unlikely that school officials or independent observers could determine students’ anxiety or fear, so self-report (level 4b) may be essential. MOOSES measures (level 5) may be objectively reported by school officials, but limiting attention to such measures may limit SEL measurement to readily observable behaviors, such as aggression, truancy, and other behaviors of importance to school management, and not on difficult-to-observe behaviors such as bullying.

Still, we expect to find in our ongoing review of the SEL literature that there will be enough research on outcomes measured at level 3 or above to enable us to downplay levels 1 and 2 for school audiences, and in many cases to downplay reports by interested parties in level 4a, where teachers or parents who implement a program then rate the behavior of the children they served.

Social emotional learning is important, and we need measures that reflect their importance, minimizing potential bias and staying as close as possible to independent, meaningful measures of behaviors that are of the greatest importance to educators. In our research team, we have very productive arguments about these measurement issues in the course of reviewing individual articles. I placed a cardboard cutout of a “principal” called “Norm” in our conference room. Whenever things get too theoretical, we consult “Norm” for his advice. For example, “Norm” is not too interested in pencil tapping and Stroop tests, but he sure cares a lot about bullying, aggression, and truancy. Of course, as part of our review we will be discussing our issues and initial decisions with real principals and educators, as well as other experts on SEL.

The growing number of studies of SEL in recent years enables reviewers to set higher standards than would have been feasible even just a few years ago. We still have to maintain a balance in which we can be as rigorous as possible but not end up with too few studies to review.  We can all aspire to be MOOSES, but that is not practical for some measures. Instead, it is useful to have a model of the ideal and what approaches the ideal, so we can make sense of the studies that exist today, with all due recognition of when we are accepting measures that are nearly MOOSES but not quite the real Bullwinkle

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.