Would Your School or District Like to Participate in Research?

As research becomes more influential in educational practice, it becomes important that studies take place in all kinds of schools. However, this does not happen. In particular, the large-scale quantitative research evaluating practical solutions for schools tends to take place in large, urban districts near major research universities. Sometimes they take place in large, suburban districts near major research universities. This is not terribly surprising, because in order to meet the highest standards of the What Works Clearinghouse or Evidence for ESSA, a study of a school-level program will need 40 to 50 schools willing to be assigned at random to either use a new program or to serve as a control group.

Naturally, researchers want to have to deal with a small number of districts (to avoid having to deal with many different district-level rules and leaders), so they try to sign up districts in which they might find 40 or 50 schools willing to participate, or perhaps split between two or three districts at most. But there are not that many districts with that number of schools. Further, researchers do not want to spend their time or money flying around to visit schools, so they usually try to find schools close to home.

As a result of these dynamics, of course, it is easy to predict where high-quality quantitative research on innovative programs is not going to take place very often. Small districts (even urban ones) can be hard to serve, but the main category of schools left out of big studies are ones in rural districts. This is not only unfair, but it deprives rural schools of a robust evidence base for practice. Also, it can be a good thing for schools and districts anywhere to participate in research. Typically, schools are paired and assigned at random to treatment or control groups. Treatment groups get the treatment, and control schools usually get some incentive, such as money, or an opportunity to use the innovative treatment a year after the experiment is over. So why should some places get all this attention and opportunity, while others complain that they never get to participate and that there are few programs evaluated in districts like theirs?

I have a solution to propose for this problem: A “Registry of Districts and Schools Seeking Research Opportunities.” The idea is that district leaders or principals could list information about themselves and the kinds of research they might be willing to host in their schools or districts. Researchers seeking district or school partners for proposals or funded projects could post invitations for participation. In this way, researchers could find out about districts they might never have otherwise considered, and district and school leaders could find out about research opportunities. Sort of like a dating site, but adapted to the interests of researchers and potential research partners (i.e., no photos would be required).

blog_6-28-18_scientists_500x424
Scientists consulting a registry of volunteer participants.

If this idea interests you, or if you would like to participate, please write to Susan Davis at sdavi168@jh.edu . If you wish, you can share any opinions and ideas about how such a registry might best accomplish its goals. If you represent a district or school and are interested in participating in research, tell us, and I’ll see what I can do.

If I get lots of encouragement, we might create such a directory and operate it on behalf of all districts, schools, and researchers, to benefit students. I’ll look forward to hearing from you!

 This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Compared to What? Getting Control Groups Right

Several years ago, I had a grant from the National Science Foundation to review research on elementary science programs. I therefore got to attend NSF conferences for principal investigators. At one such conference, we were asked to present poster sessions. The group next to mine was showing an experiment in science education that had remarkably large effect sizes. I got to talking with the very friendly researcher, and discovered that the experiment involved a four-week unit on a topic in middle school science. I think it was electricity. Initially, I was very excited, electrified even, but then I asked a few questions about the control group.

“Of course there was a control group,” he said. “They would have taught electricity too. It’s pretty much a required portion of middle school science.”

Then I asked, “When did the control group teach about electricity?”

“We had no way of knowing,” said my new friend.

“So it’s possible that they had a four-week electricity unit before the time when your program was in use?”

“Sure,” he responded.

“Or possibly after?”

“Could have been,” he said. “It would have varied.”

Being the nerdy sort of person I am, I couldn’t just let this go.

“I assume you pretested students at the beginning of your electricity unit and at the end?”

“Of course.”

“But wouldn’t this create the possibility that control classes that received their electricity unit before you began would have already finished the topic, so they would make no more progress in this topic during your experiment?”

“…I guess so.”

“And,” I continued, “students who received their electricity instruction after your experiment would make no progress either because they had no electricity instruction between pre- and posttest?”

I don’t recall how the conversation ended, but the point is, wonderful though my neighbor’s science program might be, the science achievement outcome of his experiment were, well, meaningless.

In the course of writing many reviews of research, my colleagues and I encounter misuses of control groups all the time, even in articles in respected journals written by well-known researchers. So I thought I’d write a blog on the fundamental issues involved in using control groups properly, and the ways in which control groups are often misused.

The purpose of a control group

The purpose of a control group in any experiment, randomized or matched, is to provide a valid estimate of what the experimental group would have achieved had it not received the experimental treatment, or if the study had not taken place at all. Through random assignment or matching, the experimental and control groups are essentially equal at pretest on all important variables (e.g., pretest scores, demographics), and nothing happens in the course of the experiment to upset this initial equality.

How control groups go wrong

Inequality in opportunities to learn tested content. Often, experiments appear to be legitimate (e.g., experimental and control groups are well matched at pretest), but the design contains major bias, because the content being taught in the experimental group is not the same as the content taught in the control group, and the final outcome measure is aligned to what the experimental group was taught but not what the control group was taught. My story at the start of this blog was an example of this. Between pre- and posttest, all students in the experimental group were learning about electricity, but many of those in the control group had already completed electricity or had not received it yet, so they might have been making great progress on other topics, which were not tested, but were unlikely to make much progress on the electricity content that was tested. In this case, the experimental and control groups could be said to be unequal in opportunities to learn electricity. In such a case, it matters little what the exact content or teaching methods were for the experimental program. Teaching a lot more about electricity is sure to add to learning of that topic regardless of how it is taught.

There are many other circumstances in which opportunities to learn are unequal. Many studies use unusual content, and then use tests partially or completely aligned to this unusual content, but not to what the control group was learning. Another common case is where experimental students learn something involving use of technology, but the control group uses paper and pencil to learn the same content. If the final test is given on the technology used by the experimental but not the control group, the potential for bias is obvious.

blog_2-20-20_schoolstudy_500x333 (2)

Unequal opportunities to learn (as a source of bias in experiments) relates to a topic I’ve written a lot about. Use of developer- or researcher-made outcome measures may introduce unequal opportunities to learn, because these measures are more aligned with what the experimental group was learning than what the control group was learning. However, the problem of unequal opportunities to learn is broader than that of developer/researcher-made measures. For example, the story that began this blog illustrated serious bias, but the measure could have been an off-the-shelf, valid measure of electricity concepts.

Problems with control groups that arise during the experiment. Many problems with control groups only arise after an experiment is under way, or completed. These involve situations in which there are different numbers of students/classes/schools that are not counted in the analysis. Usually, these are cases in which, in theory, experimental and control groups have equal opportunities to learn the tested content at the beginning of the experiment. However, some number of students assigned to the experimental group do not participate in the experiment enough to be considered to have truly received the treatment. Typical examples of this include after-school and summer-school programs. A group of students is randomly assigned to receive after-school services, for example, but perhaps only 60% of the students actually show up, or attend enough days to constitute sufficient participation. The problem is that the researchers know exactly who attended and who did not in the experimental group, but they have no idea which control students would or would not have attended if the control group had had the opportunity. The 40% of students who did not attend can probably be assumed to be less highly motivated, lower achieving, have less supportive parents, or to possess other characteristics that, on average, may identify students who are less likely to do well than students in general. If the researchers drop these 40% of students, the remaining 60% who did participate are likely (on average) to be more motivated, higher achieving, and so on, so the experimental program may look a lot more effective than it truly is. This kind of problem comes up quite often in studies of technology programs, because researchers can easily find out how often students in the experimental group actually logged in and did the required work. If they drop students who did not use the technology as prescribed, then the remaining students who did use the technology as intended are likely to perform better than control students, who will be a mix of students who would or would not have used the technology if they’d had the chance. Because these control groups contain more and less motivated students, while the experimental group only contains the students who were motivated to use the technology, the experimental group may have a huge advantage.

Problems of this kind can be avoided by using intent to treat (ITT) methods, in which all students who were pretested remain in the sample and are analyzed whether or not they used the software or attended the after-school program. Both the What Works Clearinghouse and Evidence for ESSA require use of ITT models in situations of this kind. The problem is that use of ITT analyses almost invariably reduces estimates of effect sizes, but to do otherwise may introduce quite a lot of bias in favor of the experimental groups.

Experiments without control groups

Of course, there are valid research designs that do not require use of control groups at all. These include regression discontinuity designs (in which long-term data trends are studied to see if there is a sharp change at the point when a treatment is introduced) and single-case experimental designs (in which as few as one student/class/school is observed frequently to see what happens when treatment conditions change). However, these designs have their own problems, and single case designs are rarely used outside of special education.

Control groups are essential in most rigorous experimental research in education, and with proper design they can do what they were intended to do with little bias. Education researchers are becoming increasingly sophisticated about fair use of control groups. Next time I go to an NSF conference, for example, I hope I won’t see posters on experiments that compare students who received an experimental treatment to those who did not even receive instruction on the same topic between pretest and posttest.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Getting Schools Excited About Participating in Research

If America’s school leaders are ever going to get excited about evidence, they need to participate in it. It’s not enough to just make school leaders aware of programs and practices. Instead, they need to serve as sites for experiments evaluating programs that they are eager to implement, or at least have friends or peers nearby who are doing so.

The U.S. Department of Education has funded quite a lot of research on attractive programs A lot of the studies they have funded have not shown positive impacts, but many have been found to be effective. Those effective programs could provide a means of engaging many schools in rigorous research, while at the same time serving as examples of how evidence can help schools improve their results.

Here is my proposal. It quite often happens that some part of the U.S. Department of Education wants to expand the use of proven programs on a given topic. For example, imagine that they wanted to expand use of proven reading programs for struggling readers in elementary schools, or proven mathematics programs in Title I middle schools.

Rather than putting out the usual request for proposals, the Department might announce that schools could qualify for funding to implement a qualifying proven program, but in order to participate they had to agree to participate in an evaluation of the program. They would have to identify two similar schools from a district, or from neighboring districts, that would agree to participate if their proposal is successful. One school in each pair would be assigned at random to use a given program in the first year or two, and the second school could start after the one- or two-year evaluation period was over. Schools would select from a list of proven programs and choose one that seems appropriate to their needs.

blog_2-6-20_celebrate_500x334            Many pairs of schools would be funded to use each proven program, so across all schools involved, this would create many large, randomized experiments. Independent evaluation groups would carry out the experiments. Students in participating schools would be pretested at the beginning of the evaluation period (one or two years), and posttested at the end, using tests independent of the developers or researchers.

There are many attractions to this plan. First, large randomized evaluations on promising programs could be carried out nationwide in real schools under normal conditions. Second, since the Department was going to fund expansion of promising programs anyway, the additional cost might be minimal, just the evaluation cost. Third, the experiment would provide a side-by-side comparison of many programs focusing on high-priority topics in very diverse locations. Fourth, the school leaders would have the opportunity to select the program they want, and would be motivated, presumably, to put energy into high-quality implementation. At the end of such a study, we would know a great deal about which programs really work in ordinary circumstances with many types of students and schools. But just as importantly, the many schools that participated would have had a positive experience, implementing a program they believe in and finding out in their own schools what outcomes the program can bring them. Their friends and peers would be envious and eager to get into the next study.

A few sets of studies of this kind could build a constituency of educators that might support the very idea of evidence. And this could transform the evidence movement, providing it with a national, enthusiastic audience for research.

Wouldn’t that be great?

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Queasy about Quasi-Experiments? How Rigorous Quasi-Experiments Can Minimize Bias

I once had a statistics professor who loved to start discussions of experimental design with the following:

“First, pick your favorite random number.”

Obviously, if you pick a favorite random number, it isn’t random. I was recalling this bit of absurdity recently when discussing with colleagues the relative value of randomized experiments (RCTs) and matched studies, or quasi-experimental designs (QED). In randomized experiments, students, teachers, classes, or schools are assigned at random to experimental or control conditions. In quasi-experiments, a group of students, teachers, classes, or schools is identified as the experimental group, and then other schools are located (usually in the same districts) and then matched on key variables, such as prior test scores, percent free lunch, ethnicity, and perhaps other factors. The ESSA evidence standards, the What Works Clearinghouse, Evidence for ESSA, and most methodologists favor randomized experiments over QEDs, but there are situations in which RCTs are not feasible. In a recent “Straight Talk on Evidence,” Jon Baron discussed how QEDs can approach the usefulness of RCTs. In this blog, I build on Baron’s article and go further into strategies for getting the best, most unbiased results possible from QEDs.

Randomized and quasi-experimental studies are very similar in most ways. Both almost always compare experimental and control schools that were very similar on key performance and demographic factors. Both use the same statistics, and require the same number of students or clusters for adequate power. Both apply the same logic, that the control group mean represents a good approximation of what the experimental group would have achieved, on average, if the experiment had never taken place.

However, there is one big difference between randomized and quasi-experiments. In a well-designed randomized experiment, the experimental and control groups can be assumed to be equal not only on observed variables, such as pretests and socio-economic status, but also on unobserved variables. The unobserved variables we worry most about have to do with selection bias. How did it happen (in a quasi-experiment) that the experimental group chose to use the experimental treatment, or was assigned to the experimental treatment? If a set of schools decided to use the experimental treatment on their own, then these schools might be composed of teachers or principals who are more oriented toward innovation, for example. Or if the experimental treatment is difficult, the teachers who would choose it might be more hard-working. If it is expensive, then perhaps the experimental schools have more money. Any of these factors could bias the study toward finding positive effects, because schools that have teachers who are motivated or hard-working, in schools with more resources, might perform better than control schools with or without the experimental treatment.

blog_1-16-20_normalcurve_500x333

Because of this problem of selection bias, studies that use quasi-experimental designs generally have larger effect sizes than do randomized experiments. Cheung & Slavin (2016) studied the effects of methodological features of studies on effect sizes. They obtained effect sizes from 645 studies of elementary and secondary reading, mathematics, and science, as well as early childhood programs. These studies had already passed a screening in which they would have been excluded if they had serious design flaws. The results were as follows:

  No. of studies Mean effect size
Quasi-experiments 449 +0.23
Randomized experiments 196 +0.16

Clearly, mean effect sizes were larger in the quasi-experiments, suggesting the possibility that there was bias. Compared to factors such as sample size and use of developer- or researcher-made measures, the amount of effect size inflation in quasi-experiments was modest, and some meta-analyses comparing randomized and quasi-experimental studies have found no difference at all.

Relative Advantages of Randomized and Quasi-Experiments

Because of the problems of selection bias, randomized experiments are preferred to quasi-experiments, all other factors being equal. However, there are times when quasi-experiments may be necessary for practical reasons. For example, it can be easier to recruit and serve schools in a quasi-experiment, and it can be less expensive. A randomized experiment requires that schools be recruited with the promise that they will receive an exciting program. Yet half of them will instead be in a control group, and to keep them willing to sign up, they may be given a lot of money, or an opportunity to receive the program later on. In a quasi-experiment, the experimental schools all get the treatment they want, and control schools just have to agree to be tested.  A quasi-experiment allows schools in a given district to work together, instead of insisting that experimental and control schools both exist in each district. This better simulates the reality schools are likely to face when a program goes into dissemination. If the problems of selection bias can be minimized, quasi-experiments have many attractions.

An ideal design for quasi-experiments would obtain the same unbiased outcomes as a randomized evaluation of the same treatment might do. The purpose of this blog is to discuss ways to minimize bias in quasi-experiments.

In practice, there are several distinct forms of quasi-experiments. Some have considerable likelihood of bias. However, others have much less potential for bias. In general, quasi-experiments to avoid are forms of post-hoc, or after-the-fact designs, in which determination of experimental and control groups takes place after the experiment. Quasi-experiments with much less likelihood of bias are pre-specified designs, in which experimental and control schools, classrooms, or students are identified and registered in advance. In the following sections, I will discuss these very different types of quasi-experiments.

Post-Hoc Designs

Post-hoc designs generally identify schools, teachers, classes, or students who participated in a given treatment, and then find matches for each in routinely collected data, such as district or school standardized test scores, attendance, or retention rates. The routinely collected data (such as state test scores or attendance) are collected as pre-and posttests from school records, so it may be that neither experimental nor control schools’ staffs are even aware that the experiment happened.

Post-hoc designs sound valid; the experimental and control groups were well matched at pretest, so if the experimental group gained more than the control group, that indicates an effective treatment, right?

Not so fast. There is much potential for bias in this design. First, the experimental schools are almost invariably those that actually implemented the treatment. Any schools that dropped out or (even worse) any that were deemed not to have implemented the treatment enough have disappeared from the study. This means that the surviving schools were different in some important way from those that dropped out. For example, imagine that in a study of computer-assisted instruction, schools were dropped if fewer than 50% of students used the software as much as the developers thought they should. The schools that dropped out must have had characteristics that made them unable to implement the program sufficiently. For example, they might have been deficient in teachers’ motivation, organization, skill with technology, or leadership, all factors that might also impact achievement with or without the computers. The experimental group is only keeping the “best” schools, but the control schools will represent the full range, from best to worst. That’s bias. Similarly, if individual students are included in the experimental group only if they actually used the experimental treatment a certain amount, that introduces bias, because the students who did not use the treatment may be less motivated, have lower attendance, or have other deficits.

As another example, developers or researchers may select experimental schools that they know did exceptionally well with the treatment. Then they may find control schools that match on pretest. The problem is that there could be unmeasured characteristics of the experimental schools that could cause these schools to get good results even without the treatment. This introduces serious bias. This is a particular problem if researchers pick experimental or control schools from a large database. The schools will be matched at pretest, but since the researchers may have many potential control schools to choose among, they may use selection rules that, while they maintain initial equality, introduce bias. The readers of the study might never be able to find out if this happened.

Pre-Specified Designs

The best way to minimize bias in quasi-experiments is to identify experimental and control schools in advance (as contrasted with post hoc), before the treatment is applied. After experimental and control schools, classes, or students are identified and matched on pretest scores and other factors, the names of schools, teachers, and possibly students on each list should be registered on the Registry of Efficacy and Effectiveness Studies. This way, all schools (and all students) involved in the study are counted in intent-to-treat (ITT) analyses, just as is expected in randomized studies. The total effect of the treatment is based on this list, even if some schools or students dropped out along the way. An ITT analysis reflects the reality of program effects, because it is rare that all schools or students actually use educational treatments. Such studies also usually report effects of treatment on the treated (TOT), focusing on schools and students who did implement for treatment, but such analyses are of only minor interest, as they are known to reflect bias in favor of the treatment group.

Because most government funders in effect require use of random assignment, the number of quasi-experiments is rapidly diminishing. All things being equal, randomized studies should be preferred. However, quasi-experiments may better fit the practical realities of a given treatment or population, and as such, I hope there can be a place for rigorous quasi-experiments. We need not be so queasy about quasi-experiments if they are designed to minimize bias.

References

Baron, J. (2019, December 12). Why most non-RCT program evaluation findings are unreliable (and a way to improve them). Washington, DC: Arnold Ventures.

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

 This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Nobel Experiments

The world of evidence-based policy just got some terrific news. Abhijit Banerjee and Esther Duflo, of MIT, and Michael Kremer of Harvard, were recently awarded the Nobel Prize in economics.

This award honors extraordinary people doing extraordinary work to alleviate poverty in developing countries. I heard Esther Duflo speak at the Society for Research on Effective Education, and saw her amazing Ted Talk on the research that won the Nobel (delivered before they knew this was going to happen). I strongly suggest you view her speech, at https://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en

But the importance of this award goes far beyond its recognition of the scholars who received it. It celebrates the same movement toward evidence-based policy represented by the Institute for Education Sciences, Education Innovation Research, the Arnold Foundation, and others in the U.S., the Education Endowment Foundation in the U.K., and this blog. It also celebrates the work of researchers in education, psychology, sociology, as well as economics, who are committed to using rigorous research to advance human progress. The Nobel awardees represent the international development wing of this movement, largely funded by the World Bank, the InterAmerica Development Bank, and other international aid organizations.

In her Ted Talk, Esther Duflo explains the grand strategy she and her colleagues pursue. They take major societal problems in developing countries, break them down into solvable parts, and then use randomized experiments to test solutions to those parts. Along with Dr. Banerjee (her husband) and Michael Kremer, she first did a study that found that ensuring that students in India had textbooks made no difference in learning. They then successfully tested a plan to provide inexpensive tutors and, later, computers, to help struggling readers in India (Banerjee, Cole, Duflo, & Linden, 2007). One fascinating series of studies tested the cost-effectiveness of various educational treatments in developing countries. The winner? Curing children of intestinal worms. Based on this and other research, the Carter Foundation embarked on a campaign that has virtually eradicated Guinea worm worldwide.

blog_11-7-19_classroomIndia_500x333

Dr. Duflo and her colleagues later tested variations in programs to provide malaria-inhibiting bed nets in developing countries in which malaria is the number one killer of children, especially those less than five years old. Were outcomes best if bed nets (retail cost= $3) were free, or only discounted to varying degrees? Many economists and policy makers worried that people who paid nothing for bed nets would not value them, or might use them for other purposes. But the randomized study found that without question, free bed nets were more often obtained and used than were discounted ones, potentially saving thousands of children’s lives.

For those of us who work in evidence-based education, the types of experiments being done by the Nobel laureates are entirely familiar, even though they have practical aspects quite different from the ones we encounter when we work in the U.S. or the U.K., for example. However, we are far from a majority among researchers in our own countries, and we face major struggles to continue to insist on randomized experiments as the criterion of effectiveness. I’m sure people working in international development face equal challenges. This is why this Nobel Prize in economics means a lot to all of us. People pay a lot of attention to Nobel Prizes, and there isn’t one in educational research, so having a Nobel shared by economists whose main contribution is in the use of randomized experiments to solve questions of great practical and policy importance, including studies in education itself, may be the closest we’ll ever get to Nobel recognition for the principle espoused by many in applied research in psychology, sociology, and education, as it is by many economists.

Nobel Prizes are often used to send a message, to support important new developments in research as well as to recognize deserving researchers who are leaders in this area. This was clearly the case with this award. The Nobel announcement makes it clear how the work of the Nobel laureates has transformed their field, to the point that “their experimental research methodologies entirely dominate developmental economics.”  I hope this event will add further credibility and awareness to the idea that rigorous evidence is a key lever for change that matters in the lives of people

 

Reference

Banerjee, A., Cole, S., Duflo, E., & Linden, L. (2007). Remedying education: Evidence from two randomized experiments in India. The Quarterly Journal of Economics, 122 (3), 1235-1264.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.

Preschool is Not Magic. Here’s What Is.

If there is one thing that everyone knows about policy-relevant research in education, it is this: Participation in high-quality preschool programs (at age 4) has substantial and lasting effects on students’ academic and life success, especially for students from disadvantaged homes. The main basis for this belief is the findings of the famous Perry Preschool program, which randomly assigned 128 disadvantaged youngsters in Ypsilanti, Michigan, to receive intensive preschool services or not to receive these services. The Perry Preschool study found positive effects at the end of preschool, and long-term positive impacts on outcomes such as high school graduation, dependence on welfare, arrest rates, and employment (Schweinhart, Barnes, & Weikart, 1993).

blog_8-2-18_magicboy_500x333

But prepare to be disappointed.

Recently, a new study has reported a very depressing set of outcomes. Lipsey, Farran, & Durkin (2018) published a large, randomized study evaluating Tennessee’s statewide preschool program. 2990 four year olds were randomly assigned to participate in preschool, or not. As in virtually all preschool studies, children who were randomly assigned to preschool scored much better than those who were assigned to the control group. But these results diminished in kindergarten, and by first grade, no positive effects could be detected. By third grade, the control group actually scored significantly higher than the former preschool students in math and science, and non-significantly higher in reading!

Jon Baron of the Laura and John Arnold Foundation wrote an insightful commentary on this study, noting that when such a large, well-done, long-term, randomized study is reported, we have to take the results seriously, even if they disagree with our most cherished beliefs. At the end of Baron’s brief summary was a commentary by Dale Farran and Mark Lipsey, two the study’s authors, telling the story of the hostile reception to their paper in the early childhood research community and the difficulties they had getting this exemplary experiment published.

Clearly, the Tennessee study was a major disappointment. How could preschool have no lasting effects for disadvantaged children?

Having participated in several research reviews on this topic (e.g., Chambers, Cheung, & Slavin, 2016), as well as some studies of my own, I have several observations to make.

Although this may have been the first large, randomized evaluation of a state-funded preschool program in the U.S., there have been many related studies that have had the same results. These include a large, randomized study of 5000 children assigned to Head Start or not (Puma et al., 2010), which also found positive outcomes at the end of the pre-K year, but only scattered lasting effects after pre-K. Very similar outcomes (positive pre-k outcomes with little or no lasting impact) have been found in a randomized evaluation of a national program called Sure Start in England (Melhuish, Belsky, & Leyland, 2010), and one in Australia (Claessens & Garrett, 2014).

Ironically, the Perry Preschool study itself failed to find lasting impacts, until students were in high school. That is, its outcomes were similar to those of the Tennessee, Head Start, Sure Start, and Australian studies, for the first 12 years of the study. So I suppose it is possible that someday, the participants in the Tennessee study will show a major benefit of having attended preschool. However, this seems highly doubtful.

It is important to note that some large studies of preschool attendance do find positive and lasting effects. However, these are invariably matched, non-experimental studies of children who happened to attend preschool, compared to others who did not. The problem with such studies is that it is essentially impossible to statistically control for all the factors that would lead parents to enroll their child in preschool, or not to do so. So lasting effects of preschool may just be lasting effects of having the good fortune to be born into the sort of family that would enroll its children in preschool.

What Should We Do if Preschool is Not Magic?

Let’s accept for the moment the hard (likely) reality that one year of preschool is not magic, and is unlikely to have lasting effects of the kind reported by the Perry Preschool study (and no other randomized studies.) Do we give up?

No.  I would argue that rather than considering preschool magic-or-nothing, we should think of it the same way we think about any other grade in school. That is, a successful school experience should not be one terrific year, but fourteen years (pre-k to 12) of great instruction using proven programs and practices.

First comes the preschool year itself, or the two year period including pre-k and kindergarten. There are many programs that have been shown in randomized studies to be successful over that time span, in comparison to control groups of children who are also in school (see Chambers, Cheung, & Slavin, 2016). Then comes reading instruction in grades K-1, where randomized studies have also validated many whole-class, small group, and one-to-one tutoring methods (Inns et al., 2018). And so on. There are programs proven to be effective in randomized experiments, at least for reading and math, for every grade level, pre-k to 12.

The time has long passed since all we had in our magic hat was preschool. We now have quite a lot. If we improve our schools one grade at a time and one subject at a time, we can see accumulating gains, ones that do not require waiting for miracles. And then we can work steadily toward improving what we can offer children every year, in every subject, in every type of school.

No one ever built a cathedral by waving a wand. Instead, magnificent cathedrals are built one stone at a time. In the same way, we can build a solid structure of learning using proven programs every year.

References

Baron, J. (2018). Large randomized controlled trial finds state pre-k program has adverse effects on academic achievement. Straight Talk on Evidence. Retrieved from http://www.straighttalkonevidence.org/2018/07/16/large-randomized-controlled-trial-finds-state-pre-k-program-has-adverse-effects-on-academic-achievement/

Chambers, B., Cheung, A., & Slavin, R. (2016). Literacy and language outcomes of balanced and developmental-constructivist approaches to early childhood education: A systematic review. Educational Research Review 18, 88-111.

Claessens, A., & Garrett, R. (2014). The role of early childhood settings for 4-5 year old children in early academic skills and later achievement in Australia. Early Childhood Research Quarterly, 29, (4), 550-561.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Lipsey, Farran, & Durkin (2018). Effects of the Tennessee Prekindergarten Program on children’s achievement and behavior through third grade. Early Childhood Research Quarterly. https://doi.org/10.1016/j.ecresq.2018.03.005

Melhuish, E., Belsky, J., & Leyland, R. (2010). The impact of Sure Start local programmes on five year olds and their families. London: Jessica Kingsley.

Puma, M., Bell, S., Cook, R., & Heid, C. (2010). Head Start impact study: Final report.  Washington, DC: U.S. Department of Health and Human Services.

Schweinhart, L. J., Barnes, H. V., & Weikart, D. P. (1993). Significant benefits: The High/Scope Perry Preschool study through age 27 (Monographs of the High/Scope Educational Research Foundation No. 10) Ypsilanti, MI: High/Scope Press.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Little Sleepers: Long-Term Effects of Preschool

In education research, a “sleeper effect” is not a way to get all of your preschoolers to take naps. Instead, it is an outcome of a program that appears not immediately after the end of the program, but some time afterwards, usually a year or more. For example, the mother of all sleeper effects was the Perry Preschool study, which found positive outcomes at the end of preschool but no differences throughout elementary school. Then positive follow-up outcomes began to show up on a variety of important measures in high school and beyond.

Sleeper effects are very rare in education research. To see why, consider a study of a math program for third graders that found no differences between program and control students at the end of third grade, but then a large and significant difference popped up in fourth grade or later. Long-term effects of effective programs are often seen, but how can there be long-term effects if there are no short-term effects on the way? Sleeper effects are so rare that many early childhood researchers have serious doubts about the validity of the long-term Perry Preschool findings.

I was thinking about sleeper effects recently because we have recently added preschool studies to our Evidence for ESSA website. In reviewing the key studies, I was once again reading an extraordinary 2009 study by Mark Lipsey and Dale Farran.

The study randomly assigned Head Start classes in rural Tennessee to one of three conditions. Some were assigned to use a program called Bright Beginnings, which had a strong pre-literacy focus. Some were assigned to use Creative Curriculum, a popular constructive/developmental curriculum with little emphasis on literacy. The remainder were assigned to a control group, in which teachers used whatever methods they ordinarily used.

Note that this design is different from the usual preschool studies frequently reported in the newspaper, which compare preschool to no preschool. In this study, all students were in preschool. What differed is only how they were taught.

The results immediately after the preschool program were not astonishing. Bright Beginnings students scored best on literacy and language measures (average effect size = +0.21 for literacy, +0.11 for language), though the differences were not significant at the school level. There were no differences at all between Creative Curriculum and control schools.

Where the outcomes became interesting was in the later years. Ordinarily in education research, outcomes measured after the treatments have finished diminish over time. In the Bright Beginnings/Creative Curriculum study the outcomes were measured again when students were in third grade, four years after they left school. Most students could be located because the test was the Tennessee standardized test, so scores could be found as long as students were still in Tennessee schools.

On third grade reading, former Bright Beginnings students now scored significantly better than former controls, and the difference was statistically significant and substantial (effect size = +0.27).

In a review of early childhood programs at www.bestevidence.org, our team found that across 16 programs emphasizing literacy as well as language, effect sizes did not diminish in literacy at the end of kindergarten, and they actually doubled on language measures (from +0.08 in preschool to +0.15 in kindergarten).

If sleeper effects (or at least maintenance on follow-up) are so rare in education research, why did they appear in these studies of preschool? There are several possibilities.

The most likely explanation is that it is difficult to measure outcomes among four year-olds. They can be squirrely and inconsistent. If a pre-kindergarten program had a true and substantial impact on children’s literacy or language, measures at the end of preschool may not detect it as well as measures a year later, because kindergartners and kindergarten skills are easier to measure.

Whatever the reason, the evidence suggests that effects of particular preschool approaches may show up later than the end of preschool. This observation, and specifically the Bright Beginnings evaluation, may indicate that in the long run it matters a great deal how students are taught in preschool. Until we find replicable models of preschool, or pre-k to 3 interventions, that have long-term effects on reading and other outcomes, we cannot sleep. Our little sleepers are counting on us to ensure them a positive future.

This blog is sponsored by the Laura and John Arnold Foundation

Pilot Studies: On the Path to Solid Evidence

This week, the Education Technology Industry Network (ETIN), a division of the Software & Information Industry Association (SIIA), released an updated guide to research methods, authored by a team at Empirical Education Inc. The guide is primarily intended to help software companies understand what is required for studies to meet current standards of evidence.

In government and among methodologists and well-funded researchers, there is general agreement about the kind of evidence needed to establish the effectiveness of an education program intended for broad dissemination. To meet its top rating (“meets standards without reservations”) the What Works Clearinghouse (WWC) requires an experiment in which schools, classes, or students are assigned at random to experimental or control groups, and it has a second category (“meets standards with reservations”) for matched studies.

These WWC categories more or less correspond to the Every Student Succeeds Act (ESSA) evidence standards (“strong” and “moderate” evidence of effectiveness, respectively), and ESSA adds a third category, “promising,” for correlational studies.

Our own Evidence for ESSA website follows the ESSA guidelines, of course. The SIIA guidelines explain all of this.

Despite the overall consensus about the top levels of evidence, the problem is that doing studies that meet these requirements is expensive and time-consuming. Software developers, especially small ones with limited capital, often do not have the resources or the patience to do such studies. Any organization that has developed something new may not want to invest substantial resources into large-scale evaluations until they have some indication that the program is likely to show well in a larger, longer, and better-designed evaluation. There is a path to high-quality evaluations, starting with pilot studies.

The SIIA Guide usefully discusses this problem, but I want to add some further thoughts on what to do when you can’t afford a large randomized study.

1. Design useful pilot studies. Evaluators need to make a clear distinction between full-scale evaluations, intended to meet WWC or ESSA standards, and pilot studies (the SIIA Guidelines call these “formative studies”), which are just meant for internal use, both to assess the strengths or weaknesses of the program and to give an early indicator of whether or not a program is ready for full-scale evaluation. The pilot study should be a miniature version of the large study. But whatever its findings, it should not be used in publicity. Results of pilot studies are important, but by definition a pilot study is not ready for prime time.

An early pilot study may be just a qualitative study, in which developers and others might observe classes, interview teachers, and examine computer-generated data on a limited scale. The problem in pilot studies is at the next level, when developers want an early indication of effects on achievement, but are not ready for a study likely to meet WWC or ESSA standards.

2. Worry about bias, not power. Small, inexpensive studies pose two types of problems. One is the possibility of bias, discussed in the next section. The other is lack of power, mostly meaning having a large enough sample to determine that a potentially meaningful program impact is statistically significant, or unlikely to have happened by chance. To understand this, imagine that your favorite baseball team adopts a new strategy. After the first ten games, the team is doing better than it did last year, in comparison to other teams, but this could have happened by chance. After 100 games? Now the results are getting interesting. If 10 teams all adopt the strategy next year and they all see improvements on average? Now you’re headed toward proof.

During the pilot process, evaluators might compare multiple classes or multiple schools, perhaps assigned at random to experimental and control groups. There may not be enough classes or schools for statistical significance yet, but if the mini-study avoids bias, the results will at least be in the ballpark (so to speak).

3. Avoid bias. A small experiment can be fine as a pilot study, but every effort should be made to avoid bias. Otherwise, the pilot study will give a result far more positive than the full-scale study will, defeating the purpose of doing a pilot.

Examples of common sources of biases in smaller studies are as follows.

a. Use of measures made by developers or researchers. These measures typically produce greatly inflated impacts.

b. Implementation of gold-plated versions of the program. . In small pilot studies, evaluations often implement versions of the program that could never be replicated. Examples include providing additional staff time that could not be repeated at scale.

c. Inclusion of highly motivated teachers or students in the experimental group, which gets the program, but not the control group. For example, matched studies of technology often exclude teachers who did not implement “enough” of the program. The problem is that the full-scale experiment (and real life) include all kinds of teachers, so excluding teachers who could not or did not want to engage with technology overstates the likely impact at scale in ordinary schools. Even worse, excluding students who did not use the technology enough may bias the study toward more capable students.

d. Learn from pilots. Evaluators, developers, and disseminators should learn as much as possible from pilots. Observations, interviews, focus groups, and other informal means should be used to understand what is working and what is not, so when the program is evaluated at scale, it is at its best.

 

***

As evidence becomes more and more important, publishers and software developers will increasingly be called upon to prove that their products are effective. However, no program should have its first evaluation be a 50-school randomized experiment. Such studies are indeed the “gold standard,” but jumping from a two-class pilot to a 50-school experiment is a way to guarantee failure. Software developers and publishers should follow a path that leads to a top-tier evaluation, and learn along the way how to ensure that their programs and evaluations will produce positive outcomes for students at the end of the process.

 

This blog is sponsored by the Laura and John Arnold Foundation