Don’t Just Do Something. Do Something Effective.

I recently visited York, England, where my wife and I worked part-time for about 8 years. York is world famous for its huge cathedral, intact medieval walls, medieval churches, and other medieval sights. But on this trip we had some time for local touring, and chose to visit a more modern place, but one far ghastlier than a ton of dungeons.

The place is the York Cold War Bunker. Built in 1961 and operated to 1991, it was intended to monitor the results of a nuclear attack on Britain. Volunteers, mostly women, were trained to detect the locations, sizes, and radiation levels of nuclear bombs dropped on Britain. This was a command bunker that collected its own data, with a staff of 60, but also monitored dozens of three-man bunkers all over the North of England, all collecting similar data. The idea was that a national network of these bunkers would determine where in the country it was safe to go after a nuclear war. The bunker had air, water, and food for 30 days, after which the volunteers had to leave. And most likely die of radiation poisoning.

blog_2-28-19_yorkbunker_500x332

The very interesting docent informed us of one astounding fact. When the bunker network was planned in 1957, the largest nuclear weapons were like those used in Hiroshima and Nagasaki, less than one megaton in yield. By 1961, when the bunkers started operation, the largest bombs were 50-megaton behemoths.

The day the Soviet Union successfully tested its 50-megaton bomb, the bunkers were instantly obsolete. Not only would a single bomb create fatal levels of radiation all over Britain, but it would also likely destroy the telephone and radio systems on which the bunkers depended.

Yet for 30 years, this utterly useless system was maintained, with extensive training, monitoring, and support.

There must have been thousands of military leaders, politicians, scientists, and ordinary readers of Popular Science, who knew full well that the bunkers were useless from the day they opened. The existence of the bunkers was not a secret, and in fact it was publicized. Why were they maintained? And what does this have to do with educational research?

The Cold War Bunkers illustrate an aspect of human nature that is important in understanding all sorts of behavior. When a catastrophe is impending, people find it comforting to do something, even if that something is known (by some at least) to be useless or even counterproductive. The British government could simply not say to its citizens that in case of a nuclear war, everyone was toast. Full stop. Instead, they had to offer hope, however slim. Around the same time the (doomed) bunkers were going into operation in Britain, my entire generation of students was learning to crawl under our desks for protection in case of nuclear attack. I suppose it made some people think that, well, at least something was being done. It scared the bejabbers out of us kids, but no one asked us.

In education, we face many very difficult, often terrifying problems. Every one of them has one or more widespread solutions. But do these solutions work?

Consider DARE, for Drug Awareness and Resistance Education, a well-researched example of what might be called “do-something-itis.” Research on DARE has never found positive effects on drug or alcohol abuse, and sometimes finds negative effects. In the case of DARE, there are many alternative drug and alcohol prevention programs that have been proven effective. Yet DARE continues, giving concerned educators and parents a comforting sense that something is being done to prevent drug and alcohol abuse among their teenagers.

Another good example of “do-something-itis” is benchmark assessments, where students take brief versions of their state tests 4-5 times a year, to give teachers and principals early warnings about areas in which students might be lagging or need additional, targeted assistance. This sounds like a simple, obvious strategy to improve test scores. However, in our reviews of research on studies of elementary and secondary reading and elementary mathematics, the effects of using benchmark assessments average an effect size close to 0.00. Yet I’m sure that schools will still be using benchmark assessments for many years, because with all the importance placed on state tests, educators will always feel better doing something focused on the problem. Of course, they should do something, actually quite a lot, but why not use “somethings” proven to work instead of benchmark assessments proven not to work?

In education, there are many very serious problems, and, in response, each one is given a solution that seems to address it. Often, the solutions are unresearched, or researched and found to be ineffective. A unifying attribute of these solutions is that they are simple and easy to understand, so most people are satisfied that at least something is being done. One example is the many states that threaten to retain third graders if they are not reading adequately (typically, at “proficient” levels on state tests) to address the serious gaps in literacy in the high school. Yet in most states, the programs used to improve student reading in grades K-3 are not proven to be effective. Often, the solution provided is a single reading teacher to provide one-to-one tutoring to students in K-3. One-to-one tutoring is very effective for the students who get it, but an average U.S. school has 280 students in grades K-3, about half of whom (on average) are unlikely to score proficient at third grade. Obviously, one tutor working one-to-one cannot do much for 140 students. Again, there are effective and cost-effective alternatives, such as proven one-to-small group tutoring by teaching assistants, but few states or schools use proven strategies of this kind.

I could go on, but I’m sure you get the idea. School systems can be seen as a huge network of dedicated people working very hard to accomplish crucial goals. Sort of like Cold War Bunkers. Yet many of their resources, talents, and efforts are underutilized, because most school systems insist on using programs and practices that appear to be doing something to prevent or solve major problems, but that have not been proven to do so.

It is time for our field to begin to focus the efforts and abilities of its talented, hard-working teachers and principals on solutions that are not just doing something, but are doing something effective. Every year, research identifies more and more effective programs known to work from rigorous experiments. This research progressively undermines the argument that doing something is at least better than doing nothing in the face of serious problems. In most areas of education, doing nothing is not the relevant option. If we do know how to solve these problems, then the alternative to doing something (of unknown value) is not doing nothing. Instead, the cure for do-something-itis is doing something that works.

Photo credit: Nilfanion [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Measuring Social Emotional Skills in Schools: Return of the MOOSES

Throughout the U. S., there is huge interest in improving students’ social emotional skills and related behaviors. This is indeed important as a means of building tomorrow’s society. However, measuring SEL skills is terribly difficult. Not that measuring reading, math, or science learning is easy, but there are at least accepted measures in those areas. In SEL, almost anything goes, and measures cover an enormous range. Some measures might be fine for theoretical research and some would be all right if they were given independently of the teachers who administered the treatment, but SEL measures are inherently squishy.

A few months ago, I wrote a blog on measurement of social emotional skills. In it, I argued that social emotional skills should be measured in pragmatic school research as objectively as possible, especially to avoid measures that merely reflect having students in experimental groups repeating back attitudes or terminology they learned in the program. I expressed the ideal for social emotional measurement in school experiments as MOOSES: Measurable, Observable, Objective, Social Emotional Skills.

Since that time, our group at Johns Hopkins University has received a generous grant from the Gates Foundation to add research on social emotional skills and attendance to our Evidence for ESSA website. This has enabled our group to dig a lot deeper into measures for social emotional learning. In particular, JHU graduate student Sooyeon Byun created a typology of SEL measures arrayed from least to most MOOSE-like. This is as follows.

  1. Cognitive Skills or Low-Level SEL Skills.

Examples include executive functioning tasks such as pencil tapping, the Stroop test, and other measures of cognitive regulation, as well as recognition of emotions. These skills may be of importance as part of theories of action leading to social emotional skills of importance to schools, but they are not goals of obvious importance to educators in themselves.

  1. Attitudes toward SEL (non-behavioral).

These include agreement with statements such as “bullying is wrong,” and statements about why other students engage in certain behaviors (e.g., “He spilled the milk because he was mean.”).

  1. Intention for SEL behaviors (quasi-behavioral).

Scenario-based measures (e.g., what would you do in this situation?).

  1. SEL behaviors based on self-report (semi-behavioral).

Reports of actual behaviors of self, or observations of others, often with frequencies (e.g., “How often have you seen bullying in this school during this school year?”) or “How often do you feel anxious or afraid in class in this school?”)

This category was divided according to who is reporting:

4a. Interested party (e.g., report by teachers or parents who implemented the program and may have reason to want to give a positive report)

4b. Disinterested party (e.g., report by students or by teachers or parents who did not administer the treatment)

  1. MOOSES (Measurable, Observable, Objective Social Emotional Skills)
  • Behaviors observed by independent observers, either researchers, ideally unaware of treatment assignment, or by school officials reporting on behaviors as they always would, not as part of a study (e.g., regular reports of office referrals for various infractions, suspensions, or expulsions).
  • Standardized tests
  • Other school records

blog_2-21-19_twomoose_500x333

Uses for MOOSES

All other things being equal, school researchers and educators should want to know about measures as high as possible on the MOOSES scale. However, all things are never equal, and in practice, some measures lower on the MOOSES scale may be all that exists or ever could exist. For example, it is unlikely that school officials or independent observers could determine students’ anxiety or fear, so self-report (level 4b) may be essential. MOOSES measures (level 5) may be objectively reported by school officials, but limiting attention to such measures may limit SEL measurement to readily observable behaviors, such as aggression, truancy, and other behaviors of importance to school management, and not on difficult-to-observe behaviors such as bullying.

Still, we expect to find in our ongoing review of the SEL literature that there will be enough research on outcomes measured at level 3 or above to enable us to downplay levels 1 and 2 for school audiences, and in many cases to downplay reports by interested parties in level 4a, where teachers or parents who implement a program then rate the behavior of the children they served.

Social emotional learning is important, and we need measures that reflect their importance, minimizing potential bias and staying as close as possible to independent, meaningful measures of behaviors that are of the greatest importance to educators. In our research team, we have very productive arguments about these measurement issues in the course of reviewing individual articles. I placed a cardboard cutout of a “principal” called “Norm” in our conference room. Whenever things get too theoretical, we consult “Norm” for his advice. For example, “Norm” is not too interested in pencil tapping and Stroop tests, but he sure cares a lot about bullying, aggression, and truancy. Of course, as part of our review we will be discussing our issues and initial decisions with real principals and educators, as well as other experts on SEL.

The growing number of studies of SEL in recent years enables reviewers to set higher standards than would have been feasible even just a few years ago. We still have to maintain a balance in which we can be as rigorous as possible but not end up with too few studies to review.  We can all aspire to be MOOSES, but that is not practical for some measures. Instead, it is useful to have a model of the ideal and what approaches the ideal, so we can make sense of the studies that exist today, with all due recognition of when we are accepting measures that are nearly MOOSES but not quite the real Bullwinkle

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

A Mathematical Mystery

My colleagues and I wrote a review of research on elementary mathematics (Pellegrini, Lake, Inns, & Slavin, 2018). I’ve written about it before, but I wanted to hone in on one extraordinary set of findings.

In the review, there were 12 studies that evaluated programs that focused on providing professional development for elementary teachers of mathematics content and mathematics –-specific pedagogy. I was sure that this category would find positive effects on student achievement, but it did not. The most remarkable (and depressing) finding involved the huge year-long Intel study in which 80 teachers received 90 hours of very high-quality in-service during the summer, followed by an additional 13 hours of group discussions of videos of the participants’ class lessons. Teachers using this program were compared to 85 control teachers. After all this, students in the Intel classes scored slightly worse than controls on standardized measures (Garet et al., 2016).

If the Intel study were the only disappointment, one might look for flaws in their approach or their evaluation design or other things specific to that study. But as I noted earlier, all 12 of the studies of this kind failed to find positive effects, and the mean effect size was only +0.04 (n.s.).

Lest anyone jump to the conclusion that nothing works in elementary mathematics, I would point out that this is not the case. The most impactful category was tutoring programs, so that’s a special case. But the second most impactful category had many features in common with professional development focused on mathematics content and pedagogy, but had an average effect size of +0.25. This category consisted of programs focused on classroom management and motivation: Cooperative learning, classroom management strategies using group contingencies, and programs focusing on social emotional learning.

So there are successful strategies in elementary mathematics, and they all provided a lot of professional development. Yet programs for mathematics content and pedagogy, all of which also provided a lot of professional development, did not show positive effects in high-quality evaluations.

I have some ideas about what may be going on here, but I advance them cautiously, as I am not certain about them.

The theory of action behind professional development focused on mathematics content and pedagogy assumes that elementary teachers have gaps in their understanding of mathematics content and mathematics-specific pedagogy. But perhaps whatever gaps they have are not so important. Here is one example. Leading mathematics educators today take a very strong view that fractions should never be taught using pizza slices, but only using number lines. The idea is that pizza slices are limited to certain fractional concepts, while number lines are more inclusive of all uses of fractions. I can understand and, in concept, support this distinction. But how much difference does it make? Students who are learning fractions can probably be divided into three pizza slices. One slice represents students who understand fractions very well, however they are presented, and another slice consists of students who have no earthly idea about fractions. The third slice consists of students who could have learned fractions if it were taught with number lines but not pizzas. The relative sizes of these slices vary, but I’d guess the third slice is the smallest. Whatever it is, the number of students whose success depends on fractions vs. number lines is unlikely to be large enough to shift the whole group mean very much, and that is what is reported in evaluations of mathematics approaches. For example, if the “already got it” slice is one third of all students, and the “probably won’t get it” slice is also one third, the slice consisting of students who might get the concept one way but not the other is also one third. If the effect size for the middle slice were as high as an improbable +0.20, the average for all students would be less than +0.07, averaging across the whole pizza.

blog_2-14-19_slices_500x333

A related possibility relates to teachers’ knowledge. Assume that one slice of teachers already knows a lot of the content before the training. Another slice is not going to learn or use it. The third slice, those who did not know the content before but will use it effectively after training, is the only slice likely to show a benefit, but this benefit will be swamped by the zero effects for the teachers who already knew the content and those who will not learn or use it.

If teachers are standing at the front of the class explaining mathematical concepts, such as proportions, a certain proportion of students are learning the content very well and a certain proportion are bored, terrified, or just not getting it. It’s hard to imagine that the successful students are gaining much from a change of content or pedagogy, and only a small proportion of the unsuccessful students will all of a sudden understand what they did not understand before, just because it is explained better. But imagine that instead of only changing content, the teacher adopts cooperative learning. Now the students are having a lot of fun working with peers. Struggling students have an opportunity to ask for explanations and help in a less threatening environment, and they get a chance to see and ultimately absorb how their more capable teammates approach and solve difficult problems. The already high-achieving students may become even higher achieving, because as every teacher knows, explanation helps the explainer as much as the student receiving the explanation.

The point I am making is that the findings of our mathematics review may reinforce a general lesson we take away from all of our reviews: Subtle treatments produce subtle (i.e., small) impacts. Students quickly establish themselves as high or average or low achievers, after which time it is difficult to fundamentally change their motivations and approaches to learning. Making modest changes in content or pedagogy may not be enough to make much difference for most students. Instead, dramatically changing motivation, providing peer assistance, and making mathematics more fun and rewarding, seems more likely to make a significant change in learning than making subtle changes in content or pedagogy. That is certainly what we have found in systematic reviews of elementary mathematics and elementary and secondary reading.

Whatever the student outcomes are compared to controls, there may be good reason to improve mathematics content and pedagogy. But if we are trying to improve achievement for all students, the whole pizza, we need to use methods that make a more profound impact on all students. And that is true any way you slice it.

References

Garet, M. S., Heppen, J. B., Walters, K., Parkinson, J., Smith, T. M., Song, M., & Borman, G. D. (2016). Focusing on mathematical knowledge: The impact of content-intensive teacher professional development (NCEE 2016-4010). Washington, DC: U.S. Department of Education.

Pellegrini, M., Inns, A., Lake, C., & Slavin, R. E. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the Society for Research on Effective Education, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Systems

What came first? The can or the can opener?

The answer to this age-old question is that the modern can and can opener were invented at exactly the same moment. This had to be true because a can without a can opener (yes, they existed) is of very little value, and a can opener without a can is the sound of one hand clapping (i.e., less than worthless).

The can and the can opener are together a system. Between them, they make it possible to preserve, transport, and distribute foods.

blog_2-7-19_canopening_333x500

In educational innovation, we frequently talk as though individual variables are sufficient to improve student achievement. You hear things like “more time-good,” “more technology-good,” and so on. Any of these factors can be effective as part of a system of innovations, or useless or harmful without other aligned components. As one example, consider time. A recent Florida study provided an extra hour each day for reading instruction, 180 hours over the course of a year, at a cost per student of $800 per student, or $300,000-$400,000 per school. The effect on reading performance, compared to schools that did not receive additional time, was very small (effect size =+0.09). In contrast, time used for one-to-one or one-to-small group tutoring by teaching assistants for example, can have a much larger impact on reading in elementary schools (effect size=+0.29), at about half the cost. As a system, cost-effective tutoring requires a coordinated combination of time, training for teaching assistants, use of proven materials, and monitoring of progress. Separately, each of these factors is nowhere near as effective as all of them taken together in a coordinated system. Each is a can with no can opener, or a can opener with no can: The sound of one hand clapping. Together, they can be very effective.

The importance of systems explains why programs are so important. Programs invariably combine individual elements to attempt to improve student outcomes. Not all programs are effective, of course, but those that have been proven to work have hit upon a balanced combination of instructional methods, classroom organization, professional development, technology, and supportive materials that, if implemented together with care and attention, have been proven to work. The opposite of a program is a “variable,” such as “time” or “technology,” that educators try to use with few consistent, proven links to other elements.

All successful human enterprises, such as schools, involve many individual variables. Moving these enterprises forward in effectiveness can rarely be done by changing one variable. Instead, we have to design coordinated plans to improve outcomes. A can opener can’t, a can can’t, but together, a can opener and a can can.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.