Compared to What? Getting Control Groups Right

Several years ago, I had a grant from the National Science Foundation to review research on elementary science programs. I therefore got to attend NSF conferences for principal investigators. At one such conference, we were asked to present poster sessions. The group next to mine was showing an experiment in science education that had remarkably large effect sizes. I got to talking with the very friendly researcher, and discovered that the experiment involved a four-week unit on a topic in middle school science. I think it was electricity. Initially, I was very excited, electrified even, but then I asked a few questions about the control group.

“Of course there was a control group,” he said. “They would have taught electricity too. It’s pretty much a required portion of middle school science.”

Then I asked, “When did the control group teach about electricity?”

“We had no way of knowing,” said my new friend.

“So it’s possible that they had a four-week electricity unit before the time when your program was in use?”

“Sure,” he responded.

“Or possibly after?”

“Could have been,” he said. “It would have varied.”

Being the nerdy sort of person I am, I couldn’t just let this go.

“I assume you pretested students at the beginning of your electricity unit and at the end?”

“Of course.”

“But wouldn’t this create the possibility that control classes that received their electricity unit before you began would have already finished the topic, so they would make no more progress in this topic during your experiment?”

“…I guess so.”

“And,” I continued, “students who received their electricity instruction after your experiment would make no progress either because they had no electricity instruction between pre- and posttest?”

I don’t recall how the conversation ended, but the point is, wonderful though my neighbor’s science program might be, the science achievement outcome of his experiment were, well, meaningless.

In the course of writing many reviews of research, my colleagues and I encounter misuses of control groups all the time, even in articles in respected journals written by well-known researchers. So I thought I’d write a blog on the fundamental issues involved in using control groups properly, and the ways in which control groups are often misused.

The purpose of a control group

The purpose of a control group in any experiment, randomized or matched, is to provide a valid estimate of what the experimental group would have achieved had it not received the experimental treatment, or if the study had not taken place at all. Through random assignment or matching, the experimental and control groups are essentially equal at pretest on all important variables (e.g., pretest scores, demographics), and nothing happens in the course of the experiment to upset this initial equality.

How control groups go wrong

Inequality in opportunities to learn tested content. Often, experiments appear to be legitimate (e.g., experimental and control groups are well matched at pretest), but the design contains major bias, because the content being taught in the experimental group is not the same as the content taught in the control group, and the final outcome measure is aligned to what the experimental group was taught but not what the control group was taught. My story at the start of this blog was an example of this. Between pre- and posttest, all students in the experimental group were learning about electricity, but many of those in the control group had already completed electricity or had not received it yet, so they might have been making great progress on other topics, which were not tested, but were unlikely to make much progress on the electricity content that was tested. In this case, the experimental and control groups could be said to be unequal in opportunities to learn electricity. In such a case, it matters little what the exact content or teaching methods were for the experimental program. Teaching a lot more about electricity is sure to add to learning of that topic regardless of how it is taught.

There are many other circumstances in which opportunities to learn are unequal. Many studies use unusual content, and then use tests partially or completely aligned to this unusual content, but not to what the control group was learning. Another common case is where experimental students learn something involving use of technology, but the control group uses paper and pencil to learn the same content. If the final test is given on the technology used by the experimental but not the control group, the potential for bias is obvious.

blog_2-20-20_schoolstudy_500x333 (2)

Unequal opportunities to learn (as a source of bias in experiments) relates to a topic I’ve written a lot about. Use of developer- or researcher-made outcome measures may introduce unequal opportunities to learn, because these measures are more aligned with what the experimental group was learning than what the control group was learning. However, the problem of unequal opportunities to learn is broader than that of developer/researcher-made measures. For example, the story that began this blog illustrated serious bias, but the measure could have been an off-the-shelf, valid measure of electricity concepts.

Problems with control groups that arise during the experiment. Many problems with control groups only arise after an experiment is under way, or completed. These involve situations in which there are different numbers of students/classes/schools that are not counted in the analysis. Usually, these are cases in which, in theory, experimental and control groups have equal opportunities to learn the tested content at the beginning of the experiment. However, some number of students assigned to the experimental group do not participate in the experiment enough to be considered to have truly received the treatment. Typical examples of this include after-school and summer-school programs. A group of students is randomly assigned to receive after-school services, for example, but perhaps only 60% of the students actually show up, or attend enough days to constitute sufficient participation. The problem is that the researchers know exactly who attended and who did not in the experimental group, but they have no idea which control students would or would not have attended if the control group had had the opportunity. The 40% of students who did not attend can probably be assumed to be less highly motivated, lower achieving, have less supportive parents, or to possess other characteristics that, on average, may identify students who are less likely to do well than students in general. If the researchers drop these 40% of students, the remaining 60% who did participate are likely (on average) to be more motivated, higher achieving, and so on, so the experimental program may look a lot more effective than it truly is. This kind of problem comes up quite often in studies of technology programs, because researchers can easily find out how often students in the experimental group actually logged in and did the required work. If they drop students who did not use the technology as prescribed, then the remaining students who did use the technology as intended are likely to perform better than control students, who will be a mix of students who would or would not have used the technology if they’d had the chance. Because these control groups contain more and less motivated students, while the experimental group only contains the students who were motivated to use the technology, the experimental group may have a huge advantage.

Problems of this kind can be avoided by using intent to treat (ITT) methods, in which all students who were pretested remain in the sample and are analyzed whether or not they used the software or attended the after-school program. Both the What Works Clearinghouse and Evidence for ESSA require use of ITT models in situations of this kind. The problem is that use of ITT analyses almost invariably reduces estimates of effect sizes, but to do otherwise may introduce quite a lot of bias in favor of the experimental groups.

Experiments without control groups

Of course, there are valid research designs that do not require use of control groups at all. These include regression discontinuity designs (in which long-term data trends are studied to see if there is a sharp change at the point when a treatment is introduced) and single-case experimental designs (in which as few as one student/class/school is observed frequently to see what happens when treatment conditions change). However, these designs have their own problems, and single case designs are rarely used outside of special education.

Control groups are essential in most rigorous experimental research in education, and with proper design they can do what they were intended to do with little bias. Education researchers are becoming increasingly sophisticated about fair use of control groups. Next time I go to an NSF conference, for example, I hope I won’t see posters on experiments that compare students who received an experimental treatment to those who did not even receive instruction on the same topic between pretest and posttest.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Getting Schools Excited About Participating in Research

If America’s school leaders are ever going to get excited about evidence, they need to participate in it. It’s not enough to just make school leaders aware of programs and practices. Instead, they need to serve as sites for experiments evaluating programs that they are eager to implement, or at least have friends or peers nearby who are doing so.

The U.S. Department of Education has funded quite a lot of research on attractive programs A lot of the studies they have funded have not shown positive impacts, but many have been found to be effective. Those effective programs could provide a means of engaging many schools in rigorous research, while at the same time serving as examples of how evidence can help schools improve their results.

Here is my proposal. It quite often happens that some part of the U.S. Department of Education wants to expand the use of proven programs on a given topic. For example, imagine that they wanted to expand use of proven reading programs for struggling readers in elementary schools, or proven mathematics programs in Title I middle schools.

Rather than putting out the usual request for proposals, the Department might announce that schools could qualify for funding to implement a qualifying proven program, but in order to participate they had to agree to participate in an evaluation of the program. They would have to identify two similar schools from a district, or from neighboring districts, that would agree to participate if their proposal is successful. One school in each pair would be assigned at random to use a given program in the first year or two, and the second school could start after the one- or two-year evaluation period was over. Schools would select from a list of proven programs and choose one that seems appropriate to their needs.

blog_2-6-20_celebrate_500x334            Many pairs of schools would be funded to use each proven program, so across all schools involved, this would create many large, randomized experiments. Independent evaluation groups would carry out the experiments. Students in participating schools would be pretested at the beginning of the evaluation period (one or two years), and posttested at the end, using tests independent of the developers or researchers.

There are many attractions to this plan. First, large randomized evaluations on promising programs could be carried out nationwide in real schools under normal conditions. Second, since the Department was going to fund expansion of promising programs anyway, the additional cost might be minimal, just the evaluation cost. Third, the experiment would provide a side-by-side comparison of many programs focusing on high-priority topics in very diverse locations. Fourth, the school leaders would have the opportunity to select the program they want, and would be motivated, presumably, to put energy into high-quality implementation. At the end of such a study, we would know a great deal about which programs really work in ordinary circumstances with many types of students and schools. But just as importantly, the many schools that participated would have had a positive experience, implementing a program they believe in and finding out in their own schools what outcomes the program can bring them. Their friends and peers would be envious and eager to get into the next study.

A few sets of studies of this kind could build a constituency of educators that might support the very idea of evidence. And this could transform the evidence movement, providing it with a national, enthusiastic audience for research.

Wouldn’t that be great?

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

New Sections on Social Emotional Learning and Attendance in Evidence for ESSA!

We are proud to announce the launch of two new sections of our Evidence for ESSA website (www.evidenceforessa.org): K-12 social-emotional learning and attendance. Funded by a grant from the Bill and Melinda Gates Foundation, the new sections represent our first foray beyond academic achievement.

blog_2-6-20_evidenceessa_500x333

The social-emotional learning section represents the greatest departure from our prior work. This is due to the nature of SEL, which combines many quite diverse measures. We identified 17 distinct measures, which we grouped in four overarching categories, as follows:

Academic Competence

  • Academic performance
  • Academic engagement

Problem Behaviors

  • Aggression/misconduct
  • Bullying
  • Disruptive behavior
  • Drug/alcohol abuse
  • Sexual/racial harassment or aggression
  • Early/risky sexual behavior

Social Relationships

  • Empathy
  • Interpersonal relationships
  • Pro-social behavior
  • Social skills
  • School climate

Emotional Well-Being

  • Reduction of anxiety/depression
  • Coping skills/stress management
  • Emotional regulation
  • Self-esteem/self-efficacy

Evidence for ESSA reports overall effect sizes and ratings for each of the four categories, as well as the 17 individual measures (which are themselves composed of many measures used by various qualifying studies). So in contrast to reading and math, where programs are rated based on the average of all qualifying  reading or math measures, an SEL program could be rated “strong” in one category, “promising” in another, and “no qualifying evidence” or “qualifying studies found no significant positive effects” on others.

Social-Emotional Learning

The SEL review, led by Sooyeon Byun, Amanda Inns, Cynthia Lake, and Liz Kim at Johns Hopkins University, located 24 SEL programs that both met our inclusion standards and had at least one study that met strong, moderate, or promising standards on at least one of the four categories of outcomes.

There is much more evidence at the elementary and middle school levels than at the high school level. Recognizing that some programs had qualifying outcomes at multiple levels, there were 7 programs with positive evidence for pre-K/K, 10 for 1-2, 13 for 3-6, and 9 for middle school. In contrast, there were only 4 programs with positive effects in senior high schools. Fourteen studies took place in urban locations, 5 in suburbs, and 5 in rural districts.

The outcome variables most often showing positive impacts include social skills (12), school climate (10), academic performance (10), pro-social behavior (8), aggression/misconduct (7), disruptive behavior (7), academic engagement (7), interpersonal relationships (7), anxiety/depression (6), bullying (6), and empathy (5). Fifteen of the programs targeted whole classes or schools, and 9 targeted individual students.

Several programs stood out in terms of the size of the impacts. Take the Lead found effect sizes of +0.88 for social relationships and +0.51 for problem behaviors. Check, Connect, and Expect found effect sizes of +0.51 for emotional well-being, +0.29 for problem behaviors, and +0.28 for academic competence. I Can Problem Solve found effect sizes of +0.57 on school climate. The Incredible Years Classroom and Parent Training Approach reported effect sizes of +.57 for emotional regulation, +0.35 for pro-social behavior, and +0.21 for aggression/misconduct. The related Dinosaur School classroom management model reported effect sizes of +0.31 for aggression/misbehavior. Class-Wide Function-Related Intervention Teams (CW-FIT), an intervention for elementary students with emotional and behavioral disorders, had effect sizes of +0.47 and +0.30 across two studies for academic engagement and +0.38 and +0.21 for disruptive behavior. It also reported effect sizes of +0.37 for interpersonal relationships, +0.28 for social skills, and +0.26 for empathy. Student Success Skills reported effect sizes of +0.30 for problem behaviors, +0.23 for academic competence, and +0.16 for social relationships.

In addition to the 24 highlighted programs, Evidence for ESSA lists 145 programs that were no longer available, had no qualifying studies (e.g., no control group), or had one or more qualifying studies but none that met the ESSA Strong, Moderate, or Promising criteria. These programs can be found by clicking on the “search” bar.

There are many problems inherent to interpreting research on social-emotional skills. One is that some programs may appear more effective than others because they use measures such as self-report, or behavior ratings by the teachers who taught the program. In contrast, studies that used more objective measures, such as independent observations or routinely collected data, may obtain smaller impacts. Also, SEL studies typically measure many outcomes and only a few may have positive impacts.

In the coming months, we will be doing analyses and looking for patterns in the data, and will have more to say about overall generalizations. For now, the new SEL section provides a guide to what we know now about individual programs, but there is much more to learn about this important topic.

Attendance

Our attendance review was led by Chenchen Shi, Cynthia Lake, and Amanda Inns. It located ten attendance programs that met our standards. Only three of these reported on chronic absenteeism, which refers to students missing more than 10% of days. Many more focused on average daily attendance (ADA). Among programs focused on average daily attendance, a Milwaukee elementary school program called SPARK had the largest impact (ES=+0.25). This is not an attendance program per se, but it uses AmeriCorps members to provide tutoring services across the school, as well as involving families. SPARK has been shown to have strong effects on reading, as well as its impressive effects on attendance. Positive Action is another schoolwide approach, in this case focused on SEL. It has been found in two major studies in grades K-8 to improve student reading and math achievement, as well as overall attendance, with a mean effect size of +0.20.

The one program to report data on both ADA and chronic absenteeism is called Attendance and Truancy Intervention and Universal Procedures, or ATI-UP. It reported an effect size in grades K-6 of +0.19 for ADA and +0.08 for chronic attendance. Talent Development High School (TDHS) is a ninth grade intervention program that provides interdisciplinary learning communities and “double dose” English and math classes for students who need them. TDHS reported an effect size of +0.17.

An interesting approach with a modest effect size but very modest cost is now called EveryDay Labs (formerly InClass Today). This program helps schools organize and implement a system to send postcards to parents reminding them of the importance of student attendance. If students start missing school, the postcards include this information as well. The effect size across two studies was a respectable +0.16.

As with SEL, we will be doing further work to draw broader lessons from research on attendance in the coming months. One pattern that seems clear already is that effective attendance improvement models work on building close relationships between at-risk students and concerned adults. None of the effective programs primarily uses punishment to improve attendance, but instead they focus on providing information to parents and students and on making it clear to students that they are welcome in school and missed when they are gone.

Both SEL and attendance are topics of much discussion right now, and we hope these new sections will be useful and timely in helping schools make informed choices about how to improve social-emotional and attendance outcomes for all students.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.