Hummingbirds and Horses: On Research Reviews

Once upon a time, there was a very famous restaurant, called The Hummingbird.   It was known the world over for its unique specialty: Hummingbird Stew.  It was expensive, but customers were amazed that it wasn’t more expensive. How much meat could be on a tiny hummingbird?  You’d have to catch dozens of them just for one bowl of stew.

One day, an experienced restauranteur came to The Hummingbird, and asked to speak to the owner.  When they were alone, the visitor said, “You have quite an operation here!  But I have been in the restaurant business for many years, and I have always wondered how you do it.  No one can make money selling Hummingbird Stew!  Tell me how you make it work, and I promise on my honor to keep your secret to my grave.  Do you…mix just a little bit?”

blog_8-8-19_hummingbird_500x359

The Hummingbird’s owner looked around to be sure no one was listening.   “You look honest,” he said. “I will trust you with my secret.  We do mix in a bit of horsemeat.”

“I knew it!,” said the visitor.  “So tell me, what is the ratio?”

“One to one.”

“Really!,” said the visitor.  “Even that seems amazingly generous!”

“I think you misunderstand,” said the owner.  “I meant one hummingbird to one horse!”

In education, we write a lot of reviews of research.  These are often very widely cited, and can be very influential.  Because of the work my colleagues and I do, we have occasion to read a lot of reviews.  Some of them go to great pains to use rigorous, consistent methods, to minimize bias, to establish clear inclusion guidelines, and to follow them systematically.  Well- done reviews can reveal patterns of findings that can be of great value to both researchers and educators.  They can serve as a form of scientific inquiry in themselves, and can make it easy for readers to understand and verify the review’s findings.

However, all too many reviews are deeply flawed.  Frequently, reviews of research make it impossible to check the validity of the findings of the original studies.  As was going on at The Hummingbird, it is all too easy to mix unequal ingredients in an appealing-looking stew.   Today, most reviews use quantitative syntheses, such as meta-analyses, which apply mathematical procedures to synthesize findings of many studies.  If the individual studies are of good quality, this is wonderfully useful.  But if they are not, readers often have no easy way to tell, without looking up and carefully reading many of the key articles.  Few readers are willing to do this.

Recently, I have been looking at a lot of recent reviews, all of them published, often in top journals.  One published review only used pre-post gains.  Presumably, if the reviewers found a study with a control group, they would have ignored the control group data!  Not surprisingly, pre-post gains produce effect sizes far larger than experimental-control, because pre-post analyses ascribe to the programs being evaluated all of the gains that students would have made without any particular treatment.

I have also recently seen reviews that include studies with and without control groups (i.e., pre-post gains), and those with and without pretests.  Without pretests, experimental and control groups may have started at very different points, and these differences just carry over to the posttests.  Accepting this jumble of experimental designs, a review makes no sense.  Treatments evaluated using pre-post designs will almost always look far more effective than those that use experimental-control comparisons.

Many published reviews include results from measures that were made up by program developers.  We have documented that analyses using such measures produce outcomes that are two, three, or sometimes four times those involving independent measures, even within the very same studies (see Cheung & Slavin, 2016). We have also found far larger effect sizes from small studies than from large studies, from very brief studies rather than longer ones, and from published studies rather than, for example, technical reports.

The biggest problem is that in many reviews, the designs of the individual studies are never described sufficiently to know how much of the (purported) stew is hummingbirds and how much is horsemeat, so to speak. As noted earlier, readers often have to obtain and analyze each cited study to find out whether the review’s conclusions are based on rigorous research and how many are not. Many years ago, I looked into a widely cited review of research on achievement effects of class size.  Study details were lacking, so I had to find and read the original studies.   It turned out that the entire substantial effect of reducing class size was due to studies of one-to-one or very small group tutoring, and even more to a single study of tennis!   The studies that reduced class size within the usual range (e.g., comparing reductions from 24 to 12) had very small achievement  impacts, but averaging in studies of tennis and one-to-one tutoring made the class size effect appear to be huge. Funny how averaging in a horse or two can make a lot of hummingbirds look impressive.

It would be great if all reviews excluded studies that used procedures known to inflate effect sizes, but at bare minimum, reviewers should be routinely required to include tables showing critical details, and then analyzed to see if the reported outcomes might be due to studies that used procedures suspected to inflate effects. Then readers could easily find out how much of that lovely-looking hummingbird stew is really made from hummingbirds, and how much it owes to a horse or two.

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Educational Policies vs. Educational Programs: Evidence from France

Ask any parent what their kids say when they ask them what they did in school today. Invariably, they respond, “Nuffin,” or some equivalent. My four-year-old granddaughter always says, “I played with my fwends.” All well and good.

However, in educational policy, policy makers often give the very same answer when asked, “What did the schools not using the (insert latest policy darling) do?”

“Nuffin’”. Or they say, “Whatever they usually do.” There’s nothing wrong with the latter answer if it’s true. But given the many programs now known to improve student achievement (see www.evidenceforessa.org), why don’t evaluators compare outcomes of new policy initiatives to those of proven educational programs known to improve the same outcomes the policy innovation is supposed to improve, perhaps at far lower cost per student? The evaluations should also compare to “business as usual,” but adding proven programs to evaluations of large policy innovations would help avoid declaring policy innovations to be successful when they are in fact just slightly more effective than “business as usual,” and much less effective or less cost-effective than alternative proven approaches? For example, when evaluating charter schools, why not routinely compare them to whole-school reform models that have similar objectives? When evaluating extending the school day or school year to help high-poverty schools, why not compare these innovations to using the same amount of additional money to hiring tutors to use proven tutoring models to help struggling students? In evaluating policies in which students are held back if they do not read at grade level by third grade, why not compare these approaches to intensive phonics instruction and tutoring in grades K-3, which are known to greatly improve student reading achievement?

blog_7-25-19_LeoandAdaya_375x500
There is nuffin like a good fwend.

As one example of research comparing a policy intervention to a promising educational intervention, I recently saw a very interesting pair of studies from France. Ecalle, Gomes, Auphan, Cros, & Magnan (2019) compared two interventions applied in special priority areas with high poverty levels. Both interventions focused on reading in first grade.

One of the interventions involved halving class size, from approximately 24 students to 12. The other provided intensive reading instruction in small groups (4-6 children) to students who were struggling in reading, as well as less intensive interventions to larger groups (10-12 students). Low achievers got two 30-minute interventions each day for a year, while the higher-performing readers got one 30-minute intervention each day. In both cases, the focus of instruction was on phonics. In all cases, the additional interventions were provided by the students’ usual teachers.

The students in small classes were compared to students in ordinary-sized classes, while the students in the educational intervention were compared to students in same-sized classes who did not get the group interventions. Similar measures and analyses were used in both comparisons.

The results were nearly identical for the class size policy and the educational intervention. Halving class size had effect sizes of +0.14 for word reading and +0.22 for spelling. Results for the educational intervention were +0.13 for word reading, +0.12 for spelling, +0.14 for a group test of reading comprehension, +0.32 for an individual test of comprehension, and +0.19 for fluency.

These studies are less than perfect in experimental design, but they are nevertheless interesting. Most importantly, the class size policy required an additional teacher for each class of 24. Using Maryland annual teacher salaries and benefits ($84,000), that means the cost in our state would be about $3500 per student. The educational intervention required one day of training and some materials. There was virtually no difference in outcomes, but the differences in cost were staggering.

The class size policy was mandated by the Ministry of Education. The educational intervention was offered to schools and provided by a university and a non-profit. As is so often the case, the policy intervention was simplistic, easy to describe in the newspaper, and minimally effective. The class size policy reminds me of a Florida program that extended the school schedule by an hour every day in high-poverty schools, mainly to provide more time for reading instruction. The cost per child was about $800 per year. The outcomes were minimal (ES=+0.05).

After many years of watching what schools do and reviewing research on outcomes of innovations, I find it depressing that policies mandated on a substantial scale are so often found to be ineffective. They are usually far more expensive than much more effective, rigorously evaluated programs that are, however, a bit more difficult to describe, and rarely arouse great debate in the political arena. It’s not that anyone is opposed to the educational intervention, but it is a lot easier to carry a placard saying “Reduce Class Size Now!” than to carry one saying “Provide Intensive Phonics in Small Groups with More Supplemental Teaching for the Lowest Achievers Now!” The latter just does not fit on a placard, and though easy to understand if explained, it does not lend itself to easy communication. Actually, there are much more effective first grade interventions than the one evaluated in France (see www.evidenceforessa.org). At a cost much less than $3500 per student, several one-to-one tutoring programs using well-trained teaching assistants as tutors would have been able to produce an effect size of more than +0.50 for all first graders on average. This would even fit on a placard: “Tutoring Now!”

I am all in favor of trying out policy innovations. But when parents of kids in a proven-program comparison group are asked what they did in school today, they shouldn’t say “nuffin’”. They should say, “My tooter taught me to read. And I played with my fwends.”

References

Ecalle, J., Gomes, C., Auphan, P., Cros, L., & Magnan, A. (2019). Effects of policy and educational interventions intended to reduce difficulties in literacy skills in grade 1. Studies in Educational Evaluation, 61, 12-20.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Charter Schools? Smarter Schools? Why Not Both?

I recently saw an editorial in the May 29 Washington Post, entitled “Denying Poor Children a Chance,” a pro-charter school opinion piece that makes dire predictions about the damage to poor and minority students that would follow if charter expansion were to be limited.  In education, it is common to see evidence-free opinions for and against charter schools, so I was glad to see actual data in the Post editorial.   In my view, if charter schools could routinely and substantially improve student outcomes, especially for disadvantaged students, I’d be a big fan.  My response to charter schools is the same as my response to everything else in education: Show me the evidence.

The Washington Post editorial cited a widely known 2015 Stanford CREDO study comparing urban charter schools to matched traditional public schools (TPS) in the same districts.  Evidence always attracts my attention, so I decided to look into this and other large, multi-district studies. Despite the Post’s enthusiasm for the data, the average effect size was only +0.055 for math and +0.04 for reading.  By anyone’s standards, these are very, very small outcomes.  Outcomes for poor, urban, African American students were somewhat higher, at +0.08 for math and +0.06 for reading, but on the other hand, average effect sizes for White students were negative, averaging -0.05 for math and -0.02 for reading.  Outcomes were also negative for Native American students: -0.10 for math, zero for reading.  With effect sizes so low, these small differences are probably just different flavors of zero.  A CREDO (2013) study of charter schools in 27 states, including non-urban as well as urban schools, found average effect sizes of +0.01 for math and -0.01 for reading. How much smaller can you get?

In fact, the CREDO studies have been widely criticized for using techniques that inflate test scores in charter schools.  They compare students in charter schools to students in traditional public schools, matching on pretests and ethnicity.  This ignores the obvious fact that students in charter schools chose to go there, or their parents chose for them to go.  There is every reason to believe that students who choose to attend charter schools are, on average, higher-achieving, more highly motivated, and better behaved than students who stay in traditional public schools.  Gleason et al. (2010) found that students who applied to charter schools started off 16 percentage points higher in reading and 13 percentage points higher in math than others in the same schools who did not apply.  Applicants were more likely to be White and less likely to be African American or Hispanic, and they were less likely to qualify for free lunch.  Self-selection is a particular problem in studies of students who choose or are sent to “no-excuses” charters, such as KIPP or Success Academies, because the students or their parents know students will be held to very high standards of behavior and accomplishment, and may be encouraged to leave the school if they do not meet those standards (this is not a criticism of KIPP or Success Academies, but when such charter systems use lotteries to select students, the students who show up for the lotteries were at least motivated to participate in a lottery to attend a very demanding school).

Well-designed studies of charter schools usually focus on schools that use lotteries to select students, and then they compare the students who were successful in the lottery to those who were not so lucky.  This eliminates the self-selection problem, as students were selected by a random process.  The CREDO studies do not do this, and this may be why their studies report higher (though still very small) effect sizes than those reported by syntheses of studies of students who all applied to charters, but may have been “lotteried in” or “lotteried out” at random.  A very rigorous WWC synthesis of such studies by Gleason et al. (2010) found that middle school students who were lotteried into charter schools in 32 states performed non-significantly worse than those lotteried out, in math (ES=-0.06) and in reading (ES=-0.08).  A 2015 update of the WWC study found very similar, slightly negative outcomes in reading and math.

It is important to note that “no-excuses” charter schools, mentioned earlier, have had more positive outcomes than other charters.  A recent review of lottery studies by Cheng et al. (2017) found effect sizes of +0.25 for math and +0.17 for reading.  However, such “no-excuses” charters are a tiny percentage of all charters nationwide.

blog_6-5-19_schoolmortorbd_500x422

Other meta-analyses of studies of achievement outcomes of charter schools also exist, but none found effect sizes as high as the CREDO urban study.  The means of +0.055 for math and +0.04 for reading represent upper bounds for effects of urban charter schools.

Charter Schools or Smarter Schools?

So far, every study of achievement effects of charters has focused on impacts of charters on achievement compared to those of traditional public schools.  However, this should not be the only question.  “Charters” and “non-charters” do not exhaust the range of possibilities.

What if we instead ask this question: Among the range of programs available, which are most likely to be most effective at scale?

To illustrate the importance of this question, consider a study in England, which evaluated a program called Engaging Parents Through Mobile Phones.  The program involves texting parents on cell phones to alert them to upcoming tests, inform them about whether students are completing their homework, and tell them what students were being taught in school.  A randomized evaluation (Miller et al, 2017) found effect sizes of +0.06 for math and +0.03 for reading, remarkably similar to the urban charter school effects reported by CREDO (2015).  The cost of the mobile phone program was £6 per student per year, or $7.80.  If you like the outcomes of charter schools, might you prefer to get the same outcomes for $7.80 per child per year, without all the political, legal, and financial stresses of charter schools?

The point here is that rather than arguing about the size of small charter effects, one could consider charters a “treatment” and compare them to other proven approaches.  In our Evidence for ESSA website, we list 112 reading and math programs that meet ESSA standards for “Strong,” “Moderate,” or “Promising” evidence of effectiveness.  Of these, 107 had effect sizes larger than those CREDO (2015) reports for urban charter schools.  In both math and reading, there are many programs with average effect sizes of +0.20, +0.30, up to more than +0.60.  If applied as they were in the research, the best of these programs could, for example, entirely overcome Black-White and Hispanic-White achievement gaps in one or two years.

A few charter school networks have their own proven educational approaches, but the many charters that do not have proven programs should be looking for them.  Most proven programs work just as well in charter schools as they do in traditional public schools, so there is no reason existing charter schools should not proactively seek proven programs to increase their outcomes.  For new charters, wouldn’t it make sense for chartering agencies to encourage charter applicants to systematically search for and propose to adopt programs that have strong evidence of effectiveness?  Many charter schools already use proven programs.  In fact, there are several that specifically became charters to enable them to adopt or maintain our Success for All whole-school reform program.

There is no reason for any conflict between charter schools and smarter schools.  The goal of every school, regardless of its governance, should be to help students achieve their full potential, and every leader of a charter or non-charter school would agree with this. Whatever we think about governance, all schools, traditional or charter, should get smarter, using proven programs of all sorts to improve student outcomes.

References

Cheng, A., Hitt, C., Kisida, B., & Mills, J. N. (2017). “No excuses” charter schools: A meta-analysis of the experimental evidence on student achievement. Journal of School Choice, 11 (2), 209-238.

Clark, M.A., Gleason, P. M., Tuttle, C. C., & Silverberg, M. K., (2015). Do charter schools improve student achievement? Educational Evaluation and Policy Analysis, 37 (4), 419-436.

Gleason, P.M., Clark, M. A., Tuttle, C. C., & Dwoyer, E. (2010).The evaluation of charter school impacts. Washington, DC: What Works Clearinghouse.

Miller, S., Davison, J, Yohanis, J., Sloan, S., Gildea, A., & Thurston, A. (2016). Texting parents: Evaluation report and executive summary. London: Education Endowment Foundation.

Washington Post: Denying poor children a chance. [Editorial]. (May 29, 2019). The Washington Post, A16.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

What Works in Teaching Writing?

“I’ve learned that people will forget what you said, people will forget what you did, but people will never forget how you made them feel. The idea is to write it so that people hear it and it slides through the brain and goes straight to the heart.”   -Maya Angelou

It’s not hard to make an argument that creative writing is the noblest of all school subjects. To test this, try replacing the word “write” in this beautiful quotation from Maya Angelou with “read” or “compute.” Students must be proficient in reading and mathematics and other subjects, of course, but in what other subject must learners study how to reach the emotions of their readers?

blog_3-21-19_mangelou2_394x500

Good writing is the mark of an educated person. Perhaps especially in the age of electronic communications, we know most of the people we know largely through their writing. Job applications depend on the ability of the applicant to make themselves interesting to someone they’ve never seen. Every subject–science, history, reading, and many more–requires its own exacting types of writing.

Given the obvious importance of writing in people’s lives, you’d naturally expect that writing would occupy a central place in instruction. But you’d be wrong. Before secondary school, writing plays third fiddle to the other two of the 3Rs, reading and ‘rithmetic, and in secondary school, writing is just one among many components of English. College professors, employers, and ordinary people complain incessantly about the poor writing skills of today’s youth. The fact is that writing is not attended to as much as it should be, and the results are apparent to all.

Not surprisingly, the inadequate focus on writing in U.S. schools extends to an inadequate focus on research on this topic as well. My colleagues and I recently carried out a review of research on secondary reading programs. We found 69 studies that met rigorous inclusion criteria (Baye, Lake, Inns, & Slavin, in press). Recently, our group completed a review of secondary writing using similar inclusion standards, under funding from the Education Endowment Foundation in England (Slavin, Lake, Inns, Baye, Dachet, & Haslam, 2019). Yet we found only 14 qualifying studies, of which 11 were in secondary schools (we searched down to third grade).

To be fair, our inclusion standards were pretty tough. We required that studies compare experimental groups to randomized or matched control groups on measures independent of the experimental treatment. Tests could not have been made up by teachers or researchers, and they could not be scored by the teachers who taught the classes. Experimental and control groups had to be well-matched at pretest and have nearly equal attrition (loss of subjects over time). Studies had to have a duration of at least 12 weeks. Studies could include students with IEPs, but they could not be in self-contained, special education settings.

We divided the studies into three categories. One was studies of writing process models, in which students worked together to plan, draft, revise, and edit compositions in many genres. A very similar category was cooperative learning models, most of which also used a plan-draft-revise-edit cycle, but placed a strong emphasis on use of cooperative learning teams. A third category was programs that balanced writing with reading instruction.

Remarkably, the average effect sizes of each of the three categories were virtually identical, with a mean effect size of +0.18. There was significant variation within categories, however. In the writing process category, the interesting story concerned a widely used U.S. program, Self-Regulated Strategy Development (SRSD), evaluated in two qualifying studies in England. In one, the program was implemented in rural West Yorkshire and had huge impacts on struggling writers, the students for whom SRSD was designed. The effect size was +0.74. However, in a much larger study in urban Leeds and Lancashire, outcomes were not so positive (ES= +0.01), although effects were largest for struggling writers. There were many studies of SRSD in the U.S, but none of them qualified, due to a lack of control group, brief experiments, measures made up by researchers, and located in all-special education classrooms.

Three programs that emphasize cooperative learning had notably positive impacts. These were Writing Wings (ES = +0.13), Student Team Writing (ES = +0.38), and Expert 21 (ES = +0.58).

Among programs emphasizing reading and writing, two had a strong focus on English learners: Pathway (ES = +0.32) and ALIAS (ES = +0.18). Another two approaches had an explicit focus on preparing students for freshman English: College Ready Writers Program (ES = +0.18) and Expository Reading and Writing Course (ES = =0.13).

Looking across all categories, there were several factors common to successful programs that stood out:

  • Cooperative Learning. Cooperative learning usually aids learning in all subjects, but it makes particular sense in writing, as a writing team gives students opportunities to give and receive feedback on their compositions, facilitating their efforts to gain insight into how their peers think about writing, and giving them a sympathetic and ready audience for their writing.
  • Writing Process. Teaching students step-by-step procedures to work with others to plan, draft, revise, and edit compositions in various genres appears to be very beneficial. The first steps focus on helping students get their ideas down on paper without worrying about mechanics, while the later stages help students progressively improve the structure, organization, grammar, and punctuation of their compositions. These steps help students reluctant to write at all to take risks at the outset, confident that they will have help from peers and teachers to progressively improve their writing.
  • Motivation and Joy in Self-Expression. In the above quote, Maya Angelou talks about the importance in writing of “sliding through the brain to get to the heart.” But to the writer, this process must work the other way, too. Good writing starts in the heart, with an urge to say something of importance. The brain shapes writing to make it readable, but writing must start with a message that the writer cares about. This principle is demonstrated most obviously in writing process and cooperative learning models, where every effort is made to motivate students to find exciting and interesting topics to share with their peers. In programs balancing reading and writing, reading is used to help students have something important to write.
  • Extensive Professional Development. Learning to teach writing well is not easy. Teachers need opportunities to learn new strategies and to apply them in their own writing. All of the successful writing programs we identified in our review provided extensive, motivating, and cooperative professional development, often designed as much to help teachers catch the spirit of writing as to follow a set of procedures.

Our review of writing research found that there is considerable consensus in how to teach writing. There were more commonalities than differences across the categories. Effects were generally positive, however, because control teachers were not using these consensus strategies, or were not doing so with the skills imparted by the professional development characteristic of all of the successful approaches.

We cannot expect writing instruction to routinely produce Maya Angelous or Mark Twains. Great writers add genius to technique. However, we can create legions of good writers, and our students will surely benefit.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Slavin, R. E., Lake, C. Inns, A., Baye, A., Dachet, D., & Haslam, J. (2019). A quantitative synthesis of research on writing approaches in Key Stage 2 and secondary schools. London: Education Endowment Foundation.

Photo credit: Kyle Tsui from Washington, DC, USA [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

A Mathematical Mystery

My colleagues and I wrote a review of research on elementary mathematics (Pellegrini, Lake, Inns, & Slavin, 2018). I’ve written about it before, but I wanted to hone in on one extraordinary set of findings.

In the review, there were 12 studies that evaluated programs that focused on providing professional development for elementary teachers of mathematics content and mathematics –-specific pedagogy. I was sure that this category would find positive effects on student achievement, but it did not. The most remarkable (and depressing) finding involved the huge year-long Intel study in which 80 teachers received 90 hours of very high-quality in-service during the summer, followed by an additional 13 hours of group discussions of videos of the participants’ class lessons. Teachers using this program were compared to 85 control teachers. After all this, students in the Intel classes scored slightly worse than controls on standardized measures (Garet et al., 2016).

If the Intel study were the only disappointment, one might look for flaws in their approach or their evaluation design or other things specific to that study. But as I noted earlier, all 12 of the studies of this kind failed to find positive effects, and the mean effect size was only +0.04 (n.s.).

Lest anyone jump to the conclusion that nothing works in elementary mathematics, I would point out that this is not the case. The most impactful category was tutoring programs, so that’s a special case. But the second most impactful category had many features in common with professional development focused on mathematics content and pedagogy, but had an average effect size of +0.25. This category consisted of programs focused on classroom management and motivation: Cooperative learning, classroom management strategies using group contingencies, and programs focusing on social emotional learning.

So there are successful strategies in elementary mathematics, and they all provided a lot of professional development. Yet programs for mathematics content and pedagogy, all of which also provided a lot of professional development, did not show positive effects in high-quality evaluations.

I have some ideas about what may be going on here, but I advance them cautiously, as I am not certain about them.

The theory of action behind professional development focused on mathematics content and pedagogy assumes that elementary teachers have gaps in their understanding of mathematics content and mathematics-specific pedagogy. But perhaps whatever gaps they have are not so important. Here is one example. Leading mathematics educators today take a very strong view that fractions should never be taught using pizza slices, but only using number lines. The idea is that pizza slices are limited to certain fractional concepts, while number lines are more inclusive of all uses of fractions. I can understand and, in concept, support this distinction. But how much difference does it make? Students who are learning fractions can probably be divided into three pizza slices. One slice represents students who understand fractions very well, however they are presented, and another slice consists of students who have no earthly idea about fractions. The third slice consists of students who could have learned fractions if it were taught with number lines but not pizzas. The relative sizes of these slices vary, but I’d guess the third slice is the smallest. Whatever it is, the number of students whose success depends on fractions vs. number lines is unlikely to be large enough to shift the whole group mean very much, and that is what is reported in evaluations of mathematics approaches. For example, if the “already got it” slice is one third of all students, and the “probably won’t get it” slice is also one third, the slice consisting of students who might get the concept one way but not the other is also one third. If the effect size for the middle slice were as high as an improbable +0.20, the average for all students would be less than +0.07, averaging across the whole pizza.

blog_2-14-19_slices_500x333

A related possibility relates to teachers’ knowledge. Assume that one slice of teachers already knows a lot of the content before the training. Another slice is not going to learn or use it. The third slice, those who did not know the content before but will use it effectively after training, is the only slice likely to show a benefit, but this benefit will be swamped by the zero effects for the teachers who already knew the content and those who will not learn or use it.

If teachers are standing at the front of the class explaining mathematical concepts, such as proportions, a certain proportion of students are learning the content very well and a certain proportion are bored, terrified, or just not getting it. It’s hard to imagine that the successful students are gaining much from a change of content or pedagogy, and only a small proportion of the unsuccessful students will all of a sudden understand what they did not understand before, just because it is explained better. But imagine that instead of only changing content, the teacher adopts cooperative learning. Now the students are having a lot of fun working with peers. Struggling students have an opportunity to ask for explanations and help in a less threatening environment, and they get a chance to see and ultimately absorb how their more capable teammates approach and solve difficult problems. The already high-achieving students may become even higher achieving, because as every teacher knows, explanation helps the explainer as much as the student receiving the explanation.

The point I am making is that the findings of our mathematics review may reinforce a general lesson we take away from all of our reviews: Subtle treatments produce subtle (i.e., small) impacts. Students quickly establish themselves as high or average or low achievers, after which time it is difficult to fundamentally change their motivations and approaches to learning. Making modest changes in content or pedagogy may not be enough to make much difference for most students. Instead, dramatically changing motivation, providing peer assistance, and making mathematics more fun and rewarding, seems more likely to make a significant change in learning than making subtle changes in content or pedagogy. That is certainly what we have found in systematic reviews of elementary mathematics and elementary and secondary reading.

Whatever the student outcomes are compared to controls, there may be good reason to improve mathematics content and pedagogy. But if we are trying to improve achievement for all students, the whole pizza, we need to use methods that make a more profound impact on all students. And that is true any way you slice it.

References

Garet, M. S., Heppen, J. B., Walters, K., Parkinson, J., Smith, T. M., Song, M., & Borman, G. D. (2016). Focusing on mathematical knowledge: The impact of content-intensive teacher professional development (NCEE 2016-4010). Washington, DC: U.S. Department of Education.

Pellegrini, M., Inns, A., Lake, C., & Slavin, R. E. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the Society for Research on Effective Education, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Replication

The holy grail of science is replication. If a finding cannot be repeated, then it did not happen in the first place. There is a reason that the humor journal in the hard sciences is called the Journal of Irreproducible Results. For scientists, results that are irreproducible are inherently laughable, therefore funny. In many hard science experiments, replication is pretty much guaranteed. If you heat an iron bar, it gets longer. If you cross parents with the same recessive gene, one quarter of their progeny will express the recessive trait (think blue eyes).

blog_1-24-19_bunnies_500x363

In educational research, we care about replication just as much as our colleagues in the lab coats across campus. However, when we’re talking about evaluating instructional programs and practices, replication is a lot harder, because students and schools differ. Positive outcomes obtained in one experiment may or may not replicate in a second trial. Sometimes this is true because the first experiment had features known to contribute to bias: small sample sizes, brief study durations, extraordinary amounts of resources or expert time to help the experimental schools or classes, use of measures made by the developers or researchers or otherwise overaligned with the experimental group (but not the control group), or use of matched rather than randomized assignment to conditions, can all contribute to successful-appearing outcomes in a first experiment. Second or third experiments are more likely to be larger, longer, and more stringent than the first study, and therefore may not replicate. Even when the first study has none of these problems, it may not replicate because of differences in the samples of schools, teachers, or students, or for other, perhaps unknowable problems. A change in the conditions of education may cause a failure to replicate. Our Success for All whole-school reform model has been found to be effective many times, mostly by third party evaluators. However, Success for All has always specified a full-time facilitator and at least one tutor for each school. An MDRC i3 evaluation happened to fall in the middle of the recession, and schools, which were struggling to afford classroom teachers, could not afford facilitators or tutors. The results were still positive on some measures, especially for low achievers, but the effect sizes were less than half of what others had found in many studies. Stuff happens.

Replication has taken on more importance recently because the ESSA evidence standards only require a single positive study. To meet the strong, moderate, or promising standards, programs must have at least one “well-designed and well-implemented” study using randomized (strong), matched (moderate), or correlational (promising) designs and finding significantly positive outcomes. Based on the “well-designed and well-implemented” language, our Evidence for ESSA website requires features of experiments similar to those also required by the What Works Clearinghouse (WWC). These requirements make it difficult to be approved, but they remove many of the experimental design features that typically cause first studies to greatly overstate program impacts: small size, brief durations, overinvolved experimenters, and developer-made measures. They put (less rigorous) matched and correlational studies in lower categories. So one study that meets ESSA or Evidence for ESSA requirements is at least likely to be a very good study. But many researchers have expressed discomfort with the idea that a single study could qualify a program for one of the top ESSA categories, especially if (as sometimes happens) there is one study with a positive outcomes and many with zero or at least nonsignificant outcomes.

The pragmatic problem is that if ESSA had required even two studies showing positive outcomes, this would wipe out a very large proportion of current programs. If research continues to identify effective programs, it should only be a matter of time before ESSA (or its successors) requires more than one study with a positive outcomes.

However, in the current circumstance, there is a way researchers and educators might at least estimate the replicability of given programs when they have only a single study with a significant positive outcomes. This would involve looking at the findings for entire genres of programs. The logic here is that if a program has only one ESSA-qualifying study, but it closely resembles other programs that also have positive outcomes, that program should be taken a lot more seriously than a program that obtained a positive outcome that differs considerably from outcomes of very similar programs.

As one example, there is much evidence from many studies by many researchers indicating positive effects of one-to-one and one-to-small group tutoring, in reading and mathematics. If a tutoring program has only one study, but this one study has significant positive findings, I’d say thumbs up. I’d say the same about cooperative learning approaches, classroom management strategies using behavioral principles, and many others, where a whole category of programs has had positive outcomes.

In contrast, if a program has a single positive outcome and there are few if any similar approaches that obtained positive outcomes, I’d be much more cautious. An example might be textbooks in mathematics, which rarely make any difference because control groups are also likely to be using textbooks, and textbooks considerably resemble each other. In our recent elementary mathematics review (Pellegrini, Lake, Inns, & Slavin, 2018), only one textbook program available in the U.S. had positive outcomes (out of 16 studies). As another example, there have been several large randomized evaluations of the use of interim assessments. Only one of them found positive outcomes. I’d be very cautious about putting much faith in benchmark assessments based on this single anomalous finding.

Looking for findings from similar studies is facilitated by looking at reviews we make available at www.bestevidence.org. These consist of reviews of research organized by categories of programs. Looking for findings from similar programs won’t help with the ESSA law, which often determines its ratings based on the findings of a single study, regardless of other findings on the same program or similar programs. However, for educators and researchers who really want to find out what works, I think checking similar programs is not quite as good as finding direct replication of positive findings on the same programs, but perhaps, as we like to say, close enough for social science.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.