A Mathematical Mystery

My colleagues and I wrote a review of research on elementary mathematics (Pellegrini, Lake, Inns, & Slavin, 2018). I’ve written about it before, but I wanted to hone in on one extraordinary set of findings.

In the review, there were 12 studies that evaluated programs that focused on providing professional development for elementary teachers of mathematics content and mathematics –-specific pedagogy. I was sure that this category would find positive effects on student achievement, but it did not. The most remarkable (and depressing) finding involved the huge year-long Intel study in which 80 teachers received 90 hours of very high-quality in-service during the summer, followed by an additional 13 hours of group discussions of videos of the participants’ class lessons. Teachers using this program were compared to 85 control teachers. After all this, students in the Intel classes scored slightly worse than controls on standardized measures (Garet et al., 2016).

If the Intel study were the only disappointment, one might look for flaws in their approach or their evaluation design or other things specific to that study. But as I noted earlier, all 12 of the studies of this kind failed to find positive effects, and the mean effect size was only +0.04 (n.s.).

Lest anyone jump to the conclusion that nothing works in elementary mathematics, I would point out that this is not the case. The most impactful category was tutoring programs, so that’s a special case. But the second most impactful category had many features in common with professional development focused on mathematics content and pedagogy, but had an average effect size of +0.25. This category consisted of programs focused on classroom management and motivation: Cooperative learning, classroom management strategies using group contingencies, and programs focusing on social emotional learning.

So there are successful strategies in elementary mathematics, and they all provided a lot of professional development. Yet programs for mathematics content and pedagogy, all of which also provided a lot of professional development, did not show positive effects in high-quality evaluations.

I have some ideas about what may be going on here, but I advance them cautiously, as I am not certain about them.

The theory of action behind professional development focused on mathematics content and pedagogy assumes that elementary teachers have gaps in their understanding of mathematics content and mathematics-specific pedagogy. But perhaps whatever gaps they have are not so important. Here is one example. Leading mathematics educators today take a very strong view that fractions should never be taught using pizza slices, but only using number lines. The idea is that pizza slices are limited to certain fractional concepts, while number lines are more inclusive of all uses of fractions. I can understand and, in concept, support this distinction. But how much difference does it make? Students who are learning fractions can probably be divided into three pizza slices. One slice represents students who understand fractions very well, however they are presented, and another slice consists of students who have no earthly idea about fractions. The third slice consists of students who could have learned fractions if it were taught with number lines but not pizzas. The relative sizes of these slices vary, but I’d guess the third slice is the smallest. Whatever it is, the number of students whose success depends on fractions vs. number lines is unlikely to be large enough to shift the whole group mean very much, and that is what is reported in evaluations of mathematics approaches. For example, if the “already got it” slice is one third of all students, and the “probably won’t get it” slice is also one third, the slice consisting of students who might get the concept one way but not the other is also one third. If the effect size for the middle slice were as high as an improbable +0.20, the average for all students would be less than +0.07, averaging across the whole pizza.

blog_2-14-19_slices_500x333

A related possibility relates to teachers’ knowledge. Assume that one slice of teachers already knows a lot of the content before the training. Another slice is not going to learn or use it. The third slice, those who did not know the content before but will use it effectively after training, is the only slice likely to show a benefit, but this benefit will be swamped by the zero effects for the teachers who already knew the content and those who will not learn or use it.

If teachers are standing at the front of the class explaining mathematical concepts, such as proportions, a certain proportion of students are learning the content very well and a certain proportion are bored, terrified, or just not getting it. It’s hard to imagine that the successful students are gaining much from a change of content or pedagogy, and only a small proportion of the unsuccessful students will all of a sudden understand what they did not understand before, just because it is explained better. But imagine that instead of only changing content, the teacher adopts cooperative learning. Now the students are having a lot of fun working with peers. Struggling students have an opportunity to ask for explanations and help in a less threatening environment, and they get a chance to see and ultimately absorb how their more capable teammates approach and solve difficult problems. The already high-achieving students may become even higher achieving, because as every teacher knows, explanation helps the explainer as much as the student receiving the explanation.

The point I am making is that the findings of our mathematics review may reinforce a general lesson we take away from all of our reviews: Subtle treatments produce subtle (i.e., small) impacts. Students quickly establish themselves as high or average or low achievers, after which time it is difficult to fundamentally change their motivations and approaches to learning. Making modest changes in content or pedagogy may not be enough to make much difference for most students. Instead, dramatically changing motivation, providing peer assistance, and making mathematics more fun and rewarding, seems more likely to make a significant change in learning than making subtle changes in content or pedagogy. That is certainly what we have found in systematic reviews of elementary mathematics and elementary and secondary reading.

Whatever the student outcomes are compared to controls, there may be good reason to improve mathematics content and pedagogy. But if we are trying to improve achievement for all students, the whole pizza, we need to use methods that make a more profound impact on all students. And that is true any way you slice it.

References

Garet, M. S., Heppen, J. B., Walters, K., Parkinson, J., Smith, T. M., Song, M., & Borman, G. D. (2016). Focusing on mathematical knowledge: The impact of content-intensive teacher professional development (NCEE 2016-4010). Washington, DC: U.S. Department of Education.

Pellegrini, M., Inns, A., Lake, C., & Slavin, R. E. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the Society for Research on Effective Education, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Advertisements

Systems

What came first? The can or the can opener?

The answer to this age-old question is that the modern can and can opener were invented at exactly the same moment. This had to be true because a can without a can opener (yes, they existed) is of very little value, and a can opener without a can is the sound of one hand clapping (i.e., less than worthless).

The can and the can opener are together a system. Between them, they make it possible to preserve, transport, and distribute foods.

blog_2-7-19_canopening_333x500

In educational innovation, we frequently talk as though individual variables are sufficient to improve student achievement. You hear things like “more time-good,” “more technology-good,” and so on. Any of these factors can be effective as part of a system of innovations, or useless or harmful without other aligned components. As one example, consider time. A recent Florida study provided an extra hour each day for reading instruction, 180 hours over the course of a year, at a cost per student of $800 per student, or $300,000-$400,000 per school. The effect on reading performance, compared to schools that did not receive additional time, was very small (effect size =+0.09). In contrast, time used for one-to-one or one-to-small group tutoring by teaching assistants for example, can have a much larger impact on reading in elementary schools (effect size=+0.29), at about half the cost. As a system, cost-effective tutoring requires a coordinated combination of time, training for teaching assistants, use of proven materials, and monitoring of progress. Separately, each of these factors is nowhere near as effective as all of them taken together in a coordinated system. Each is a can with no can opener, or a can opener with no can: The sound of one hand clapping. Together, they can be very effective.

The importance of systems explains why programs are so important. Programs invariably combine individual elements to attempt to improve student outcomes. Not all programs are effective, of course, but those that have been proven to work have hit upon a balanced combination of instructional methods, classroom organization, professional development, technology, and supportive materials that, if implemented together with care and attention, have been proven to work. The opposite of a program is a “variable,” such as “time” or “technology,” that educators try to use with few consistent, proven links to other elements.

All successful human enterprises, such as schools, involve many individual variables. Moving these enterprises forward in effectiveness can rarely be done by changing one variable. Instead, we have to design coordinated plans to improve outcomes. A can opener can’t, a can can’t, but together, a can opener and a can can.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Tutoring Could Benefit Students Who Do Not Need It

If you’ve been following my blogs, or if you know research on tutoring, you know that tutoring is hugely beneficial to the students who receive it. Recent research in both reading and math is finding important impacts of forms of tutoring that are much less expensive and scalable than the one-to-one tutoring by certified teachers that was once dominant. A review of research my colleagues and I did on effective programs for struggling readers found a mean effect size of +0.29 for one-to-small group tutoring provided by teaching assistants, across six studies of five programs involving grades K-5 (Inns, Lake, Pellegrini, & Slavin, 2018). Looking across the whole tutoring literature, in math as well as reading, positive outcomes of less expensive forms of tutoring are reliable and robust.

My focus today, however, is not on children who receive tutoring. It’s on all the other children. How does tutoring for the one third to one half of students in typical Title I schools who struggle in reading or math benefit the remaining students who were doing fine?

Imagine that Title I elementary schools had an average of three teaching assistants providing one-to-four tutoring in 7 daily sessions. This would enable them to serve 84 students each day, or perhaps 252 over the course of the year. Here is how this could benefit all children.

blog_1-31-19_tutorsnkids_500x333

Photo credit: Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

Eliminating within-class ability grouping.

Teachers justifiably complain about the difficulty of teaching highly diverse classes. Historically, they have dealt with diversity, especially in reading, by assigning students to top, middle, and low ability groups, so that they can provide appropriate levels of instruction for each group. Managing multiple ability groups is very difficult, because two-thirds of the class has to do seatwork (paper or digital) during follow-up time, while the teacher is working with another reading group. The seatwork cannot be challenging, because if it were, students would be asking questions, and the whole purpose of this seatwork is to keep students quiet so the teacher can teach a reading group. As a result, kids do what they do when they are bored and the teacher is occupied. It’s not pretty.

Sufficient high-quality one-to-four reading tutoring could add an effect size of at least +0.29 to the reading performance of every student in the low reading group. The goal would be to move the entire low group to virtual equality with the middle group. So some low achievers might need more and some less tutoring, and a few might need one-to-one tutoring rather than one-to-four. If the low and middle reading groups could be made similar in reading performance, teachers could dispense with within-class grouping entirely, and teach the whole class as one “reading group.” Eliminating seatwork, this would give every reading class three times as much valuable instructional time. This would be likely to benefit learning for students in the (former) middle and high groups directly (due to more high quality teaching), as well as taking a lot of stress off of the teacher, making the classroom more efficient and pleasant for all.

Improving behavior.

Ask any teacher who are the students who are most likely to act out in his or her class. It’s the low achievers. How could it be otherwise? Low achievers take daily blows to their self-esteem, and need to assert themselves in areas other than academics. One such “Plan B” for low achievers is misbehavior. If all students were succeeding in reading and math, improvements in behavior seem very likely. This would benefit all. I remember that my own very well-behaved daughter frequently came home from school very upset because other students misbehaved and got in trouble for it. Improved behavior due to greater success for low achievers would be beneficial to struggling readers themselves, but also to their classmates.

Improved outcomes in other subjects.

Most struggling students have problems in reading and math, and these are the only subjects in which tutoring is ever provided. Yet students who struggle in reading or math are likely to also have trouble in science, social studies, and other subjects, and these problems are likely to disrupt teaching and learning in those subjects as well. If all could succeed in reading and math, this would surely have an impact on other subjects, for non-struggling as well as struggling students.

Contributing to the teacher pipeline.

In the plan I’ve discussed previously, teaching assistants providing tutoring would mostly be ones with Bachelor’s degrees but not teaching certificates. These tutors would provide an ideal source of candidates for accelerated certification programs. Tutors who have apparent potential could be invited to enroll in such programs. The teachers developed in this way would be a benefit to all schools and all students in the district.  This aspect would be of particular value in inner city or rural areas that rely on teachers who grew up nearby and have roots in the area, as these districts usually have trouble attracting and maintaining outsiders.

Reducing special education and retention.

A likely outcome of successful tutoring would be to reduce retentions and special education placements. This would be of great benefit to the students not retained or not sent to special education, but also to the school as a whole, which would save a great deal of money.

Ultimately, I think every teacher, every student, and every parent would love to see every low reading group improve in performance enough to eliminate the need for reading groups. The process to get to this happy state of affairs is straightforward and likely to succeed wherever it is tried. Wouldn’t a whole school and a whole school system full of success be a great thing for all students, not just the low achievers?

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Replication

The holy grail of science is replication. If a finding cannot be repeated, then it did not happen in the first place. There is a reason that the humor journal in the hard sciences is called the Journal of Irreproducible Results. For scientists, results that are irreproducible are inherently laughable, therefore funny. In many hard science experiments, replication is pretty much guaranteed. If you heat an iron bar, it gets longer. If you cross parents with the same recessive gene, one quarter of their progeny will express the recessive trait (think blue eyes).

blog_1-24-19_bunnies_500x363

In educational research, we care about replication just as much as our colleagues in the lab coats across campus. However, when we’re talking about evaluating instructional programs and practices, replication is a lot harder, because students and schools differ. Positive outcomes obtained in one experiment may or may not replicate in a second trial. Sometimes this is true because the first experiment had features known to contribute to bias: small sample sizes, brief study durations, extraordinary amounts of resources or expert time to help the experimental schools or classes, use of measures made by the developers or researchers or otherwise overaligned with the experimental group (but not the control group), or use of matched rather than randomized assignment to conditions, can all contribute to successful-appearing outcomes in a first experiment. Second or third experiments are more likely to be larger, longer, and more stringent than the first study, and therefore may not replicate. Even when the first study has none of these problems, it may not replicate because of differences in the samples of schools, teachers, or students, or for other, perhaps unknowable problems. A change in the conditions of education may cause a failure to replicate. Our Success for All whole-school reform model has been found to be effective many times, mostly by third party evaluators. However, Success for All has always specified a full-time facilitator and at least one tutor for each school. An MDRC i3 evaluation happened to fall in the middle of the recession, and schools, which were struggling to afford classroom teachers, could not afford facilitators or tutors. The results were still positive on some measures, especially for low achievers, but the effect sizes were less than half of what others had found in many studies. Stuff happens.

Replication has taken on more importance recently because the ESSA evidence standards only require a single positive study. To meet the strong, moderate, or promising standards, programs must have at least one “well-designed and well-implemented” study using randomized (strong), matched (moderate), or correlational (promising) designs and finding significantly positive outcomes. Based on the “well-designed and well-implemented” language, our Evidence for ESSA website requires features of experiments similar to those also required by the What Works Clearinghouse (WWC). These requirements make it difficult to be approved, but they remove many of the experimental design features that typically cause first studies to greatly overstate program impacts: small size, brief durations, overinvolved experimenters, and developer-made measures. They put (less rigorous) matched and correlational studies in lower categories. So one study that meets ESSA or Evidence for ESSA requirements is at least likely to be a very good study. But many researchers have expressed discomfort with the idea that a single study could qualify a program for one of the top ESSA categories, especially if (as sometimes happens) there is one study with a positive outcomes and many with zero or at least nonsignificant outcomes.

The pragmatic problem is that if ESSA had required even two studies showing positive outcomes, this would wipe out a very large proportion of current programs. If research continues to identify effective programs, it should only be a matter of time before ESSA (or its successors) requires more than one study with a positive outcomes.

However, in the current circumstance, there is a way researchers and educators might at least estimate the replicability of given programs when they have only a single study with a significant positive outcomes. This would involve looking at the findings for entire genres of programs. The logic here is that if a program has only one ESSA-qualifying study, but it closely resembles other programs that also have positive outcomes, that program should be taken a lot more seriously than a program that obtained a positive outcome that differs considerably from outcomes of very similar programs.

As one example, there is much evidence from many studies by many researchers indicating positive effects of one-to-one and one-to-small group tutoring, in reading and mathematics. If a tutoring program has only one study, but this one study has significant positive findings, I’d say thumbs up. I’d say the same about cooperative learning approaches, classroom management strategies using behavioral principles, and many others, where a whole category of programs has had positive outcomes.

In contrast, if a program has a single positive outcome and there are few if any similar approaches that obtained positive outcomes, I’d be much more cautious. An example might be textbooks in mathematics, which rarely make any difference because control groups are also likely to be using textbooks, and textbooks considerably resemble each other. In our recent elementary mathematics review (Pellegrini, Lake, Inns, & Slavin, 2018), only one textbook program available in the U.S. had positive outcomes (out of 16 studies). As another example, there have been several large randomized evaluations of the use of interim assessments. Only one of them found positive outcomes. I’d be very cautious about putting much faith in benchmark assessments based on this single anomalous finding.

Looking for findings from similar studies is facilitated by looking at reviews we make available at www.bestevidence.org. These consist of reviews of research organized by categories of programs. Looking for findings from similar programs won’t help with the ESSA law, which often determines its ratings based on the findings of a single study, regardless of other findings on the same program or similar programs. However, for educators and researchers who really want to find out what works, I think checking similar programs is not quite as good as finding direct replication of positive findings on the same programs, but perhaps, as we like to say, close enough for social science.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Tutoring Works. But Let’s Learn How It Can Work Better and Cheaper

I was once at a meeting of the British Education Research Association, where I had been invited to participate in a debate about evidence-based reform. We were having what journalists often call “a frank exchange of views” in a room packed to the rafters.

At one point in the proceedings, a woman stood up and, in a furious tone of voice, informed all and sundry that (I’m paraphrasing here) “we don’t need to talk about all this (very bad word). Every child should just get Reading Recovery.” She then stomped out.

I don’t know how widely her view was supported in the room or anywhere else in Britain or elsewhere, but what struck me at the time, and what strikes even more today, is the degree to which Reading Recovery has long defined, and in many ways limited, discussions about tutoring. Personally, I have nothing against Reading Recovery, and I have always admired the commitment Reading Recovery advocates have had to professional development and to research. I’ve also long known that the evidence for Reading Recovery is very impressive, but you’d be amazed if one-to-one tutoring by well-trained teachers did not produce positive outcomes. On the other hand, Reading Recovery insists on one-to-one instruction by certified teachers with a lot of cost for all that admirable professional development, so it is very expensive. A British study estimated the cost per child at $5400 (in 2018 dollars). There are roughly one million Year 1 students in the U.K., so if the angry woman had her way, they’d have to come up with the equivalent of $5.4 billion a year. In the U.S., it would be more like $27 billion a year. I’m not one to shy away from very expensive proposals if they provide also extremely effective services and there are no equally effective alternatives. But shouldn’t we be exploring alternatives?

If you’ve been following my blogs on tutoring, you’ll be aware that, at least at the level of research, the Reading Recovery monopoly on tutoring has been broken in many ways. Reading Recovery has always insisted on certified teachers, but many studies have now shown that well-trained teaching assistants can do just as well, in mathematics as well as reading. Reading Recovery has insisted that tutoring should just be for first graders, but numerous studies have now shown positive outcomes of tutoring through seventh grade, in both reading and mathematics. Reading Recovery has argued that its cost was justified by the long-lasting impacts of first-grade tutoring, but their own research has not documented long-lasting outcomes. Reading Recovery is always one-to-one, of course, but now there are numerous one-to-small group programs, including a one-to-three adaptation of Reading Recovery itself, that produce very good effects. Reading Recovery has always just been for reading, but there are now more than a dozen studies showing positive effects of tutoring in math, too.

blog_12-20-18_tutornkid_500x333

All of this newer evidence opens up new possibilities for tutoring that were unthinkable when Reading Recovery ruled the tutoring roost alone. If tutoring can be effective using teaching assistants and small groups, then it is becoming a practicable solution to a much broader range of learning problems. It also opens up a need for further research and development specific to the affordances and problems of tutoring. For example, tutoring can be done a lot less expensively than $5,400 per child, but it is still expensive. We created and evaluated a one-to-six, computer-assisted tutoring model that produced effect sizes of around +0.40 for $500 per child. Yet I just got a study from the Education Endowment Fund (EEF) in England evaluating one-to-three math tutoring by college students and recent graduates. They only provided tutoring one hour per week for 12 weeks, to sixth graders. The effect size was much smaller (ES=+0.19), but the cost was only about $150 per child.

I am not advocating this particular solution, but isn’t it interesting? The EEF also evaluated another means of making tutoring inexpensive, using online tutors from India and Sri Lanka, and another, using cross-age peer tutors, both in math. Both failed miserably, but isn’t that interesting?

I can imagine a broad range of approaches to tutoring, designed to enhance outcomes, minimize costs, or both. Out of that research might come a diversity of approaches that might be used for different purposes. For example, students in deep trouble, headed for special education, surely need something different from what is needed by students with less serious problems. But what exactly is it that is needed in each situation?

In educational research, reliable positive effects of any intervention are rare enough that we’re usually happy to celebrate anything that works. We might say, “Great, tutoring works! But we knew that.”  However, if tutoring is to become a key part of every school’s strategies to prevent or remediate learning problems, then knowing that “tutoring works” is not enough. What kind of tutoring works for what purposes?  Can we use technology to make tutors more effective? How effective could tutoring be if it is given all year or for multiple years? Alternatively, how effective could we make small amounts of tutoring? What is the optimal group size for small group tutoring?

We’ll never satisfy the angry woman who stormed out of my long-ago symposium at BERA. But for those who can have an open mind about the possibilities, building on the most reliable intervention we have for struggling learners and creating and evaluating effective and cost-effective tutoring approaches seems like a worthwhile endeavor.

Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Government Plays an Essential Role in Diffusion of Innovations

Lately I’ve been hearing a lot of concern in reform circles about how externally derived evidence can truly change school practices and improve outcomes. Surveys of principals, for example, routinely find that principals rarely consult research in making key decisions, including decisions about adopting materials, software, or professional development intended to improve student outcomes. Instead, principals rely on their friends in similar schools serving similar students. In the whole process, research rarely comes up, and if it does, it is often generic research on how children learn rather than high-quality evaluations of specific programs they might adopt.

Principals and other educational leaders have long been used to making decisions without consulting research. It would be difficult to expect otherwise, because of three conditions that have prevailed roughly from the beginning of time to very recently: a) There was little research of practical value on practical programs; b) The research that did exist was of uncertain quality, and school leaders did not have the time or training to determine studies’ validity; c) There were no resources provided to schools to help them adopt proven programs, so doing so required that they spend their own scarce resources.

Under these conditions, it made sense for principals to ask around among their friends before selecting programs or practices. When no one knows anything about a program’s effectiveness, why not ask your friends, who at least (presumably) have your best interests at heart and know your context? Since conditions a, b, and c have defined the context for evidence use nearly up to the present, it is not surprising that school leaders have built a culture of distrust for anyone outside of their own circle when it comes to choosing programs.

However, all three of conditions a, b, and c have changed substantially in recent years, and they are continuing to change in a positive direction at a rapid rate:

a) High-quality research on practical programs for elementary and secondary schools is growing at an extraordinary rate. As shown in Figure 1, the number of rigorous randomized or quasi-experimental studies in elementary and secondary reading and in elementary math have skyrocketed since about 2003, due mostly to investments by the Institute for Education Sciences (IES) and Investing in Innovation (i3). There has been a similar explosion of evidence in England, due to funding from the Education Endowment Foundation (EEF). Clearly, we know a lot more about which programs work and which do not than we once did.

blog_1-10-19_graph2_1063x650

b) Principals, teachers, and the public can now easily find reliable and accessible information on practical programs on the What Works Clearinghouse (WWC), Evidence for ESSA, and other sites. No one can complain any more that information is inaccessible or incomprehensible.

c) Encouragement and funding are becoming available for schools eager to use proven programs. Most importantly, the federal ESSA law is providing school improvement funding for low-achieving schools that agree to implement programs that meet the top three ESSA evidence standards (strong, moderate, or promising). ESSA also provides preference points for applications for certain sources of federal funding if they promise to use the money to implement proven programs. Some states have extended the same requirement to apply to eligibility for state funding for schools serving students who are disadvantaged or are ethnic or linguistic minorities. Even schools that do not meet any of these demographic criteria are, in many states, being encouraged to use proven programs.

blog_1-10-19_uscapitol_500x375

Photo credit: Jorge Gallo [Public domain], from Wikimedia Commons

I think the current situation is like that which must have existed in, say, 1910, with cars and airplanes. Anyone could see that cars and airplanes were the future. But I’m sure many horse-owners pooh-poohed the whole thing. “Sure there are cars,” they’d say, “but who will build all those paved roads? Sure there are airplanes, but who will build airports?” The answer was government, which could see the benefits to the entire economy of systems of roads and airports to meet the needs of cars and airplanes.

Government cannot solve all problems, but it can create conditions to promote adoption and use of proven innovations. And in education, federal, state, and local governments are moving rapidly to do this. Principals may still prefer to talk to other principals, and that’s fine. But with ever more evidence on ever more programs and with modest restructuring of funds governments are already awarding, conditions are coming together to utterly transform the role of evidence in educational practice.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.