What Works in Teaching Writing?

“I’ve learned that people will forget what you said, people will forget what you did, but people will never forget how you made them feel. The idea is to write it so that people hear it and it slides through the brain and goes straight to the heart.”   -Maya Angelou

It’s not hard to make an argument that creative writing is the noblest of all school subjects. To test this, try replacing the word “write” in this beautiful quotation from Maya Angelou with “read” or “compute.” Students must be proficient in reading and mathematics and other subjects, of course, but in what other subject must learners study how to reach the emotions of their readers?

blog_3-21-19_mangelou2_394x500

Good writing is the mark of an educated person. Perhaps especially in the age of electronic communications, we know most of the people we know largely through their writing. Job applications depend on the ability of the applicant to make themselves interesting to someone they’ve never seen. Every subject–science, history, reading, and many more–requires its own exacting types of writing.

Given the obvious importance of writing in people’s lives, you’d naturally expect that writing would occupy a central place in instruction. But you’d be wrong. Before secondary school, writing plays third fiddle to the other two of the 3Rs, reading and ‘rithmetic, and in secondary school, writing is just one among many components of English. College professors, employers, and ordinary people complain incessantly about the poor writing skills of today’s youth. The fact is that writing is not attended to as much as it should be, and the results are apparent to all.

Not surprisingly, the inadequate focus on writing in U.S. schools extends to an inadequate focus on research on this topic as well. My colleagues and I recently carried out a review of research on secondary reading programs. We found 69 studies that met rigorous inclusion criteria (Baye, Lake, Inns, & Slavin, in press). Recently, our group completed a review of secondary writing using similar inclusion standards, under funding from the Education Endowment Foundation in England (Slavin, Lake, Inns, Baye, Dachet, & Haslam, 2019). Yet we found only 14 qualifying studies, of which 11 were in secondary schools (we searched down to third grade).

To be fair, our inclusion standards were pretty tough. We required that studies compare experimental groups to randomized or matched control groups on measures independent of the experimental treatment. Tests could not have been made up by teachers or researchers, and they could not be scored by the teachers who taught the classes. Experimental and control groups had to be well-matched at pretest and have nearly equal attrition (loss of subjects over time). Studies had to have a duration of at least 12 weeks. Studies could include students with IEPs, but they could not be in self-contained, special education settings.

We divided the studies into three categories. One was studies of writing process models, in which students worked together to plan, draft, revise, and edit compositions in many genres. A very similar category was cooperative learning models, most of which also used a plan-draft-revise-edit cycle, but placed a strong emphasis on use of cooperative learning teams. A third category was programs that balanced writing with reading instruction.

Remarkably, the average effect sizes of each of the three categories were virtually identical, with a mean effect size of +0.18. There was significant variation within categories, however. In the writing process category, the interesting story concerned a widely used U.S. program, Self-Regulated Strategy Development (SRSD), evaluated in two qualifying studies in England. In one, the program was implemented in rural West Yorkshire and had huge impacts on struggling writers, the students for whom SRSD was designed. The effect size was +0.74. However, in a much larger study in urban Leeds and Lancashire, outcomes were not so positive (ES= +0.01), although effects were largest for struggling writers. There were many studies of SRSD in the U.S, but none of them qualified, due to a lack of control group, brief experiments, measures made up by researchers, and located in all-special education classrooms.

Three programs that emphasize cooperative learning had notably positive impacts. These were Writing Wings (ES = +0.13), Student Team Writing (ES = +0.38), and Expert 21 (ES = +0.58).

Among programs emphasizing reading and writing, two had a strong focus on English learners: Pathway (ES = +0.32) and ALIAS (ES = +0.18). Another two approaches had an explicit focus on preparing students for freshman English: College Ready Writers Program (ES = +0.18) and Expository Reading and Writing Course (ES = =0.13).

Looking across all categories, there were several factors common to successful programs that stood out:

  • Cooperative Learning. Cooperative learning usually aids learning in all subjects, but it makes particular sense in writing, as a writing team gives students opportunities to give and receive feedback on their compositions, facilitating their efforts to gain insight into how their peers think about writing, and giving them a sympathetic and ready audience for their writing.
  • Writing Process. Teaching students step-by-step procedures to work with others to plan, draft, revise, and edit compositions in various genres appears to be very beneficial. The first steps focus on helping students get their ideas down on paper without worrying about mechanics, while the later stages help students progressively improve the structure, organization, grammar, and punctuation of their compositions. These steps help students reluctant to write at all to take risks at the outset, confident that they will have help from peers and teachers to progressively improve their writing.
  • Motivation and Joy in Self-Expression. In the above quote, Maya Angelou talks about the importance in writing of “sliding through the brain to get to the heart.” But to the writer, this process must work the other way, too. Good writing starts in the heart, with an urge to say something of importance. The brain shapes writing to make it readable, but writing must start with a message that the writer cares about. This principle is demonstrated most obviously in writing process and cooperative learning models, where every effort is made to motivate students to find exciting and interesting topics to share with their peers. In programs balancing reading and writing, reading is used to help students have something important to write.
  • Extensive Professional Development. Learning to teach writing well is not easy. Teachers need opportunities to learn new strategies and to apply them in their own writing. All of the successful writing programs we identified in our review provided extensive, motivating, and cooperative professional development, often designed as much to help teachers catch the spirit of writing as to follow a set of procedures.

Our review of writing research found that there is considerable consensus in how to teach writing. There were more commonalities than differences across the categories. Effects were generally positive, however, because control teachers were not using these consensus strategies, or were not doing so with the skills imparted by the professional development characteristic of all of the successful approaches.

We cannot expect writing instruction to routinely produce Maya Angelous or Mark Twains. Great writers add genius to technique. However, we can create legions of good writers, and our students will surely benefit.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Slavin, R. E., Lake, C. Inns, A., Baye, A., Dachet, D., & Haslam, J. (2019). A quantitative synthesis of research on writing approaches in Key Stage 2 and secondary schools. London: Education Endowment Foundation.

Photo credit: Kyle Tsui from Washington, DC, USA [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)]

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

A Mathematical Mystery

My colleagues and I wrote a review of research on elementary mathematics (Pellegrini, Lake, Inns, & Slavin, 2018). I’ve written about it before, but I wanted to hone in on one extraordinary set of findings.

In the review, there were 12 studies that evaluated programs that focused on providing professional development for elementary teachers of mathematics content and mathematics –-specific pedagogy. I was sure that this category would find positive effects on student achievement, but it did not. The most remarkable (and depressing) finding involved the huge year-long Intel study in which 80 teachers received 90 hours of very high-quality in-service during the summer, followed by an additional 13 hours of group discussions of videos of the participants’ class lessons. Teachers using this program were compared to 85 control teachers. After all this, students in the Intel classes scored slightly worse than controls on standardized measures (Garet et al., 2016).

If the Intel study were the only disappointment, one might look for flaws in their approach or their evaluation design or other things specific to that study. But as I noted earlier, all 12 of the studies of this kind failed to find positive effects, and the mean effect size was only +0.04 (n.s.).

Lest anyone jump to the conclusion that nothing works in elementary mathematics, I would point out that this is not the case. The most impactful category was tutoring programs, so that’s a special case. But the second most impactful category had many features in common with professional development focused on mathematics content and pedagogy, but had an average effect size of +0.25. This category consisted of programs focused on classroom management and motivation: Cooperative learning, classroom management strategies using group contingencies, and programs focusing on social emotional learning.

So there are successful strategies in elementary mathematics, and they all provided a lot of professional development. Yet programs for mathematics content and pedagogy, all of which also provided a lot of professional development, did not show positive effects in high-quality evaluations.

I have some ideas about what may be going on here, but I advance them cautiously, as I am not certain about them.

The theory of action behind professional development focused on mathematics content and pedagogy assumes that elementary teachers have gaps in their understanding of mathematics content and mathematics-specific pedagogy. But perhaps whatever gaps they have are not so important. Here is one example. Leading mathematics educators today take a very strong view that fractions should never be taught using pizza slices, but only using number lines. The idea is that pizza slices are limited to certain fractional concepts, while number lines are more inclusive of all uses of fractions. I can understand and, in concept, support this distinction. But how much difference does it make? Students who are learning fractions can probably be divided into three pizza slices. One slice represents students who understand fractions very well, however they are presented, and another slice consists of students who have no earthly idea about fractions. The third slice consists of students who could have learned fractions if it were taught with number lines but not pizzas. The relative sizes of these slices vary, but I’d guess the third slice is the smallest. Whatever it is, the number of students whose success depends on fractions vs. number lines is unlikely to be large enough to shift the whole group mean very much, and that is what is reported in evaluations of mathematics approaches. For example, if the “already got it” slice is one third of all students, and the “probably won’t get it” slice is also one third, the slice consisting of students who might get the concept one way but not the other is also one third. If the effect size for the middle slice were as high as an improbable +0.20, the average for all students would be less than +0.07, averaging across the whole pizza.

blog_2-14-19_slices_500x333

A related possibility relates to teachers’ knowledge. Assume that one slice of teachers already knows a lot of the content before the training. Another slice is not going to learn or use it. The third slice, those who did not know the content before but will use it effectively after training, is the only slice likely to show a benefit, but this benefit will be swamped by the zero effects for the teachers who already knew the content and those who will not learn or use it.

If teachers are standing at the front of the class explaining mathematical concepts, such as proportions, a certain proportion of students are learning the content very well and a certain proportion are bored, terrified, or just not getting it. It’s hard to imagine that the successful students are gaining much from a change of content or pedagogy, and only a small proportion of the unsuccessful students will all of a sudden understand what they did not understand before, just because it is explained better. But imagine that instead of only changing content, the teacher adopts cooperative learning. Now the students are having a lot of fun working with peers. Struggling students have an opportunity to ask for explanations and help in a less threatening environment, and they get a chance to see and ultimately absorb how their more capable teammates approach and solve difficult problems. The already high-achieving students may become even higher achieving, because as every teacher knows, explanation helps the explainer as much as the student receiving the explanation.

The point I am making is that the findings of our mathematics review may reinforce a general lesson we take away from all of our reviews: Subtle treatments produce subtle (i.e., small) impacts. Students quickly establish themselves as high or average or low achievers, after which time it is difficult to fundamentally change their motivations and approaches to learning. Making modest changes in content or pedagogy may not be enough to make much difference for most students. Instead, dramatically changing motivation, providing peer assistance, and making mathematics more fun and rewarding, seems more likely to make a significant change in learning than making subtle changes in content or pedagogy. That is certainly what we have found in systematic reviews of elementary mathematics and elementary and secondary reading.

Whatever the student outcomes are compared to controls, there may be good reason to improve mathematics content and pedagogy. But if we are trying to improve achievement for all students, the whole pizza, we need to use methods that make a more profound impact on all students. And that is true any way you slice it.

References

Garet, M. S., Heppen, J. B., Walters, K., Parkinson, J., Smith, T. M., Song, M., & Borman, G. D. (2016). Focusing on mathematical knowledge: The impact of content-intensive teacher professional development (NCEE 2016-4010). Washington, DC: U.S. Department of Education.

Pellegrini, M., Inns, A., Lake, C., & Slavin, R. E. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the Society for Research on Effective Education, Washington, DC.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Replication

The holy grail of science is replication. If a finding cannot be repeated, then it did not happen in the first place. There is a reason that the humor journal in the hard sciences is called the Journal of Irreproducible Results. For scientists, results that are irreproducible are inherently laughable, therefore funny. In many hard science experiments, replication is pretty much guaranteed. If you heat an iron bar, it gets longer. If you cross parents with the same recessive gene, one quarter of their progeny will express the recessive trait (think blue eyes).

blog_1-24-19_bunnies_500x363

In educational research, we care about replication just as much as our colleagues in the lab coats across campus. However, when we’re talking about evaluating instructional programs and practices, replication is a lot harder, because students and schools differ. Positive outcomes obtained in one experiment may or may not replicate in a second trial. Sometimes this is true because the first experiment had features known to contribute to bias: small sample sizes, brief study durations, extraordinary amounts of resources or expert time to help the experimental schools or classes, use of measures made by the developers or researchers or otherwise overaligned with the experimental group (but not the control group), or use of matched rather than randomized assignment to conditions, can all contribute to successful-appearing outcomes in a first experiment. Second or third experiments are more likely to be larger, longer, and more stringent than the first study, and therefore may not replicate. Even when the first study has none of these problems, it may not replicate because of differences in the samples of schools, teachers, or students, or for other, perhaps unknowable problems. A change in the conditions of education may cause a failure to replicate. Our Success for All whole-school reform model has been found to be effective many times, mostly by third party evaluators. However, Success for All has always specified a full-time facilitator and at least one tutor for each school. An MDRC i3 evaluation happened to fall in the middle of the recession, and schools, which were struggling to afford classroom teachers, could not afford facilitators or tutors. The results were still positive on some measures, especially for low achievers, but the effect sizes were less than half of what others had found in many studies. Stuff happens.

Replication has taken on more importance recently because the ESSA evidence standards only require a single positive study. To meet the strong, moderate, or promising standards, programs must have at least one “well-designed and well-implemented” study using randomized (strong), matched (moderate), or correlational (promising) designs and finding significantly positive outcomes. Based on the “well-designed and well-implemented” language, our Evidence for ESSA website requires features of experiments similar to those also required by the What Works Clearinghouse (WWC). These requirements make it difficult to be approved, but they remove many of the experimental design features that typically cause first studies to greatly overstate program impacts: small size, brief durations, overinvolved experimenters, and developer-made measures. They put (less rigorous) matched and correlational studies in lower categories. So one study that meets ESSA or Evidence for ESSA requirements is at least likely to be a very good study. But many researchers have expressed discomfort with the idea that a single study could qualify a program for one of the top ESSA categories, especially if (as sometimes happens) there is one study with a positive outcomes and many with zero or at least nonsignificant outcomes.

The pragmatic problem is that if ESSA had required even two studies showing positive outcomes, this would wipe out a very large proportion of current programs. If research continues to identify effective programs, it should only be a matter of time before ESSA (or its successors) requires more than one study with a positive outcomes.

However, in the current circumstance, there is a way researchers and educators might at least estimate the replicability of given programs when they have only a single study with a significant positive outcomes. This would involve looking at the findings for entire genres of programs. The logic here is that if a program has only one ESSA-qualifying study, but it closely resembles other programs that also have positive outcomes, that program should be taken a lot more seriously than a program that obtained a positive outcome that differs considerably from outcomes of very similar programs.

As one example, there is much evidence from many studies by many researchers indicating positive effects of one-to-one and one-to-small group tutoring, in reading and mathematics. If a tutoring program has only one study, but this one study has significant positive findings, I’d say thumbs up. I’d say the same about cooperative learning approaches, classroom management strategies using behavioral principles, and many others, where a whole category of programs has had positive outcomes.

In contrast, if a program has a single positive outcome and there are few if any similar approaches that obtained positive outcomes, I’d be much more cautious. An example might be textbooks in mathematics, which rarely make any difference because control groups are also likely to be using textbooks, and textbooks considerably resemble each other. In our recent elementary mathematics review (Pellegrini, Lake, Inns, & Slavin, 2018), only one textbook program available in the U.S. had positive outcomes (out of 16 studies). As another example, there have been several large randomized evaluations of the use of interim assessments. Only one of them found positive outcomes. I’d be very cautious about putting much faith in benchmark assessments based on this single anomalous finding.

Looking for findings from similar studies is facilitated by looking at reviews we make available at www.bestevidence.org. These consist of reviews of research organized by categories of programs. Looking for findings from similar programs won’t help with the ESSA law, which often determines its ratings based on the findings of a single study, regardless of other findings on the same program or similar programs. However, for educators and researchers who really want to find out what works, I think checking similar programs is not quite as good as finding direct replication of positive findings on the same programs, but perhaps, as we like to say, close enough for social science.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.

Succeeding Faster in Education

“If you want to increase your success rate, double your failure rate.” So said Thomas Watson, the founder of IBM. What he meant, of course, is that people and organizations thrive when they try many experiments, even though most experiments fail. Failing twice as often means trying twice as many experiments, leading to twice as many failures—but also, he was saying, many more successes.

blog_9-20-18_TJWatson_500x488
Thomas Watson

In education research and innovation circles, many people know this quote, and use it to console colleagues who have done an experiment that did not produce significant positive outcomes. A lot of consolation is necessary, because most high-quality experiments in education do not produce significant positive outcomes. In studies funded by the Institute for Education Sciences (IES), Investing in Innovation (i3), and England’s Education Endowment Foundation (EEF), all of which require very high standards of evidence, fewer than 20% of experiments show significant positive outcomes.

The high rate of failure in educational experiments is often shocking to non-researchers, especially the government agencies, foundations, publishers, and software developers who commission the studies. I was at a conference recently in which a Peruvian researcher presented the devastating results of an experiment in which high-poverty, mostly rural schools in Peru were randomly assigned to receive computers for all of their students, or to continue with usual instruction. The Peruvian Ministry of Education was so confident that the computers would be effective that they had built a huge model of the specific computers used in the experiment and attached it to the Ministry headquarters. When the results showed no positive outcomes (except for the ability to operate computers), the Ministry quietly removed the computer statue from the top of their building.

Improving Success Rates

Much as I believe Watson’s admonition (“fail more”), there is another principle that he was implying, or so I expect: We have to learn from failure, so we can increase the rate of success. It is not realistic to expect government to continue to invest substantial funding in high-quality educational experiments if the success rate remains below 20%. We have to get smarter, so we can succeed more often. Fortunately, qualitative measures, such as observations, interviews, and questionnaires, are becoming required elements of funded research, facilitating finding out what happened so that researchers can find out what went wrong. Was the experimental program faithfully implemented? Were there unexpected responses toward the program by teachers or students?

In the course of my work reviewing positive and disappointing outcomes of educational innovations, I’ve noticed some patterns that often predict that a given program is likely or unlikely to be effective in a well-designed evaluation. Some of these are as follows.

  1. Small changes lead to small (or zero) impacts. In every subject and grade level, researchers have evaluated new textbooks, in comparison to existing texts. These almost never show positive effects. The reason is that textbooks are just not that different from each other. Approaches that do show positive effects are usually markedly different from ordinary practices or texts.
  2. Successful programs almost always provide a lot of professional development. The programs that have significant positive effects on learning are ones that markedly improve pedagogy. Changing teachers’ daily instructional practices usually requires initial training followed by on-site coaching by well-trained and capable coaches. Lots of PD does not guarantee success, but minimal PD virtually guarantees failure. Sufficient professional development can be expensive, but education itself is expensive, and adding a modest amount to per-pupil cost for professional development and other requirements of effective implementation is often the best way to substantially enhance outcomes.
  3. Effective programs are usually well-specified, with clear procedures and materials. Rarely do programs work if they are unclear about what teachers are expected to do, and helped to do it. In the Peruvian study of one-to-one computers, for example, students were given tablet computers at a per-pupil cost of $438. Teachers were expected to figure out how best to use them. In fact, a qualitative study found that the computers were considered so valuable that many teachers locked them up except for specific times when they were to be used. They lacked specific instructional software or professional development to create the needed software. No wonder “it” didn’t work. Other than the physical computers, there was no “it.”
  4. Technology is not magic. Technology can create opportunities for improvement, but there is little understanding of how to use technology to greatest effect. My colleagues and I have done reviews of research on effects of modern technology on learning. We found near-zero effects of a variety of elementary and secondary reading software (Inns et al., 2018; Baye et al., in press), with a mean effect size of +0.05 in elementary reading and +0.00 in secondary. In math, effects were slightly more positive (ES=+0.09), but still quite small, on average (Pellegrini et al., 2018). Some technology approaches had more promise than others, but it is time that we learned from disappointing as well as promising applications. The widespread belief that technology is the future must eventually be right, but at present we have little reason to believe that technology is transformative, and we don’t know which form of technology is most likely to be transformative.
  5. Tutoring is the most solid approach we have. Reviews of elementary reading for struggling readers (Inns et al., 2018) and secondary struggling readers (Baye et al., in press), as well as elementary math (Pellegrini et al., 2018), find outcomes for various forms of tutoring that are far beyond effects seen for any other type of treatment. Everyone knows this, but thinking about tutoring falls into two camps. One, typified by advocates of Reading Recovery, takes the view that tutoring is so effective for struggling first graders that it should be used no matter what the cost. The other, also perhaps thinking about Reading Recovery, rejects this approach because of its cost. Yet recent research on tutoring methods is finding strategies that are cost-effective and feasible. First, studies in both reading (Inns et al., 2018) and math (Pellegrini et al., 2018) find no difference in outcomes between certified teachers and paraprofessionals using structured one-to-one or one-to-small group tutoring models. Second, although one-to-one tutoring is more effective than one-to-small group, one-to-small group is far more cost-effective, as one trained tutor can work with 4 to 6 students at a time. Also, recent studies have found that tutoring can be just as effective in the upper elementary and middle grades as in first grade, so this strategy may have broader applicability than it has in the past. The real challenge for research on tutoring is to develop and evaluate models that increase cost-effectiveness of this clearly effective family of approaches.

The extraordinary advances in the quality and quantity of research in education, led by investments from IES, i3, and the EEF, have raised expectations for research-based reform. However, the modest percentage of recent studies meeting current rigorous standards of evidence has caused disappointment in some quarters. Instead, all findings, whether immediately successful or not, should be seen as crucial information. Some studies identify programs ready for prime time right now, but the whole body of work can and must inform us about areas worthy of expanded investment, as well as areas in need of serious rethinking and redevelopment. The evidence movement, in the form it exists today, is completing its first decade. It’s still early days. There is much more we can learn and do to develop, evaluate, and disseminate effective strategies, especially for students in great need of proven approaches.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

 Photo credit: IBM [CC BY-SA 3.0  (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Rethinking Technology in Education

Antonine de Saint Exupéry, in his 1931 classic Night Flight, had a wonderful line about early airmail service in Patagonia, South America:

“When you are crossing the Andes and your engine falls out, well, there’s nothing to do but throw in your hand.”

blog_10-4-18_Saint_Exupery_363x500

I had reason to think about this quote recently, as I was attending a conference in Santiago, Chile, the presumed destination of the doomed pilot. The conference focused on evidence-based reform in education.

Three of the papers described large scale, randomized evaluations of technology applications in Latin America, funded by the Inter-American Development Bank (IDB). Two of them documented disappointing outcomes of large-scale, traditional uses of technology. One described a totally different application.

One of the studies, reported by Santiago Cueto (Cristia et al., 2017), randomly assigned 318 high-poverty, mostly rural primary schools in Peru to receive sturdy, low-cost, practical computers, or to serve as a control group. Teachers were given great latitude in how to use the computers, but limited professional development in how to use them as pedagogical resources. Worse, the computers had software with limited alignment to the curriculum, and teachers were expected to overcome this limitation. Few did. Outcomes were essentially zero in reading and math.

In another study (Berlinski & Busso, 2017), the IDB funded a very well-designed study in 85 schools in Costa Rica. Schools were randomly assigned to receive one of five approaches. All used the same content on the same schedule to teach geometry to seventh graders. One group used traditional lectures and questions with no technology. The others used active learning, active learning plus interactive whiteboards, active learning plus a computer lab, or active learning plus one computer per student. “Active learning” emphasized discussions, projects, and practical exercises.

On a paper-and-pencil test covering the content studied by all classes, all four of the experimental groups scored significantly worse than the control group. The lowest performance was seen in the computer lab condition, and, worst of all, the one computer per child condition.

The third study, in Chile (Araya, Arias, Bottan, & Cristia, 2018), was funded by the IDB and the International Development Research Center of the Canadian government. It involved a much more innovative and unusual application of technology. Fourth grade classes within 24 schools were randomly assigned to experimental or control conditions. In the experimental group, classes in similar schools were assigned to serve as competitors to each other. Within the math classes, students studied with each other and individually for a bi-monthly “tournament,” in which students in each class were individually given questions to answer on the computers. Students were taught cheers and brought to fever pitch in their preparations. The participating classes were compared to the control classes, which studied the same content using ordinary methods. All classes, experimental and control, were studying the national curriculum on the same schedule, and all used computers, so all that differed was the tournaments and the cooperative studying to prepare for the tournaments.

The outcomes were frankly astonishing. The students in the experimental schools scored much higher on national tests than controls, with an effect size of +0.30.

The differences in the outcomes of these three approaches are clear. What might explain them, and what do they tell us about applications of technology in Latin America and anywhere?

In Peru, the computers were distributed as planned and generally functioned, but teachers receive little professional development. In fact, teachers were not given specific strategies for using the computers, but were expected to come up with their own uses for them.

The Costa Rica study did provide computer users with specific approaches to math and gave teachers much associated professional development. Yet the computers may have been seen as replacements for teachers, and the computers may just not have been as effective as teachers. Alternatively, despite extensive PD, all four of the experimental approaches were very new to the teachers and may have not been well implemented.

In contrast, in the Chilean study, tournaments and cooperative study were greatly facilitated by the computers, but the computers were not central to program effectiveness. The theory of action emphasized enhanced motivation to engage in cooperative study of math. The computers were only a tool to achieve this goal. The tournament strategy resembles a method from the 1970s called Teams-Games-Tournaments (TGT) (DeVries & Slavin, 1978). TGT was very effective, but was complicated for teachers to use, which is why it was not widely adopted. In Chile, computers helped solve the problems of complexity.

It is important to note that in the United States, technology solutions are also not producing major gains in student achievement. Reviews of research on elementary reading (ES=+0.05; Inns et al. 2018) and secondary reading (ES= -0.01; Baye et al., in press) have reported near-zero effects of technology-assisted effects of technology-assisted approaches. Outcomes in elementary math are only somewhat better, averaging an effect size of +0.09 (Pellegrini et al., 2018).

The findings of these rigorous studies of technology in the U.S. and Latin America lead to a conclusion that there is nothing magic about technology. Applications of technology can work if the underlying approach is sound. Perhaps it is best to consider which non-technology approaches are proven or likely to increase learning, and only then imagine how technology could make effective methods easier, less expensive, more motivating, or more instructionally effective. As an analogy, great audio technology can make a concert more pleasant or audible, but the whole experience still depends on great composition and great performances. Perhaps technology in education should be thought of in a similar enabling way, rather than as the core of innovation.

St. Exupéry’s Patagonian pilots crossing the Andes had no “Plan B” if their engines fell out. We do have many alternative ways to put technology to work or to use other methods, if the computer-assisted instruction strategies that have dominated technology since the 1970s keep showing such small or zero effects. The Chilean study and certain exceptions to the overall pattern of research findings in the U.S. suggest appealing “Plans B.”

The technology “engine” is not quite falling out of the education “airplane.” We need not throw in our hand. Instead, it is clear that we need to re-engineer both, to ask not what is the best way to use technology, but what is the best way to engage, excite, and instruct students, and then ask how technology can contribute.

Photo credit: Distributed by Agence France-Presse (NY Times online) [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

References

Araya, R., Arias, E., Bottan, N., & Cristia, J. (2018, August 23). Conecta Ideas: Matemáticas con motivatión social. Paper presented at the conference “Educate with Evidence,” Santiago, Chile.

Baye, A., Lake, C., Inns, A., & Slavin, R. (in press). Effective reading programs for secondary students. Reading Research Quarterly.

Berlinski, S., & Busso, M. (2017). Challenges in educational reform: An experiment on active learning in mathematics. Economics Letters, 156, 172-175.

Cristia, J., Ibarraran, P., Cueto, S., Santiago, A., & Severín, E. (2017). Technology and child development: Evidence from the One Laptop per Child program. American Economic Journal: Applied Economics, 9 (3), 295-320.

DeVries, D. L., & Slavin, R. E. (1978). Teams-Games-Tournament:  Review of ten classroom experiments. Journal of Research and Development in Education, 12, 28-38.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018, March 3). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Pellegrini, M., Inns, A., & Slavin, R. (2018, March 3). Effective programs in elementary mathematics: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Small Studies, Big Problems

Everyone knows that “good things come in small packages.” But in research evaluating practical educational programs, this saying does not apply. Small studies are very susceptible to bias. In fact, among all the factors that can inflate effect sizes in educational experiments, small sample size is among the most powerful. This problem is widely known, and in reviewing large and small studies, most meta-analysts solve the problem by requiring minimum sample sizes and/or weighting effect sizes by their sample sizes. Problem solved.

blog_9-13-18_presents_500x333

For some reason, the What Works Clearinghouse (WWC) has so far paid little attention to sample size. It has not weighted by sample size in computing mean effect sizes, although the WWC is talking about doing this in the future. It has not even set minimums for sample size for its reviews. I know of one accepted study with a total sample size of 12 (6 experimental, 6 control). These procedures greatly inflate WWC effect sizes.

As one indication of the problem, our review of 645 studies of reading, math, and science studies accepted by the Best Evidence Encyclopedia (www.bestevidence.org) found that studies with fewer than 250 subjects had twice the effect sizes of those with more than 250 (effect sizes=+0.30 vs. +0.16). Comparing studies with fewer than 100 students to those with more than 3000, the ratio was 3.5 to 1 (see Cheung & Slavin [2016] at http://www.bestevidence.org/word/methodological_Sept_21_2015.pdf). Several other studies have found the same effect.

Using data from the What Works Clearinghouse reading and math studies, obtained by graduate student Marta Pellegrini (2017), sample size effects were also extraordinary. The mean effect size for sample sizes of 60 or less was +0.37; for samples of 60-250, +0.29; and for samples of more than 250, +0.13. Among all design factors she studied, small sample size made the most difference in outcomes, rivaled only by researcher/developer-made measures. In fact, sample size is more pernicious, because while reviewers can exclude researcher/developer-made measures within a study and focus on independent measures, a study with a small sample has the same problem for all measures. Also, because small-sample studies are relatively inexpensive, there are quite a lot of them, so reviews that fail to attend to sample size can greatly over-estimate overall mean effect sizes.

My colleague Amanda Inns (2018) recently analyzed WWC reading and math studies to find out why small studies produce such inflate outcomes. There are many reasons small-sample studies may produce such large effect sizes. One is that in small studies, researchers can provide extraordinary amounts of assistance or support to the experimental group. This is called “superrealization.” Another is that when studies with small sample sizes find null effects, the studies tend not to be published or made available at all, deemed a “pilot” and forgotten. In contrast, a large study is likely to have been paid for by a grant, which will produce a report no matter what the outcome. There has long been an understanding that published studies produce much higher effect sizes than unpublished studies, and one reason is that small studies are rarely published if their outcomes are not significant.

Whatever the reasons, there is no doubt that small studies greatly overstate effect sizes. In reviewing research, this well-known fact has long led meta-analysts to weight effect sizes by their sample sizes (usually using an inverse variance procedure). Yet as noted earlier, the WWC does not do this, but just averages effect sizes across studies without taking sample size into account.

One example of the problem of ignoring sample size in averaging is provided by Project CRISS. CRISS was evaluated in two studies. One had 231 students. On a staff-developed “free recall” measure, the effect size was +1.07. The other study had 2338 students, and an average effect size on standardized measures of -0.02. Clearly, the much larger study with an independent outcome measure should have swamped the effects of the small study with a researcher-made measure, but this is not what happened. The WWC just averaged the two effect sizes, obtaining a mean of +0.53.

How might the WWC set minimum sample sizes for studies to be included for review? Amanda Inns proposed a minimum of 60 students (at least 30 experimental and 30 control) for studies that analyze at the student level. She suggests a minimum of 12 clusters (6 and 6), such as classes or schools, for studies that analyze at the cluster level.

In educational research evaluating school programs, good things come in large packages. Small studies are fine as pilots, or for descriptive purposes. But when you want to know whether a program works in realistic circumstances, go big or go home, as they say.

The What Works Clearinghouse should exclude very small studies and should use weighting based on sample sizes in computing means. And there is no reason it should not start doing these things now.

References

Inns, A. & Slavin, R. (2018 August). Do small studies add up in the What Works Clearinghouse? Paper presented at the meeting of the American Psychological Association, San Francisco, CA.

Pellegrini, M. (2017, August). How do different standards lead to different conclusions? A comparison between meta-analyses of two research centers. Paper presented at the European Conference on Educational Research (ECER), Copenhagen, Denmark.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.