Preschool is Not Magic. Here’s What Is.

If there is one thing that everyone knows about policy-relevant research in education, it is this: Participation in high-quality preschool programs (at age 4) has substantial and lasting effects on students’ academic and life success, especially for students from disadvantaged homes. The main basis for this belief is the findings of the famous Perry Preschool program, which randomly assigned 128 disadvantaged youngsters in Ypsilanti, Michigan, to receive intensive preschool services or not to receive these services. The Perry Preschool study found positive effects at the end of preschool, and long-term positive impacts on outcomes such as high school graduation, dependence on welfare, arrest rates, and employment (Schweinhart, Barnes, & Weikart, 1993).

blog_8-2-18_magicboy_500x333

But prepare to be disappointed.

Recently, a new study has reported a very depressing set of outcomes. Lipsey, Farran, & Durkin (2018) published a large, randomized study evaluating Tennessee’s statewide preschool program. 2990 four year olds were randomly assigned to participate in preschool, or not. As in virtually all preschool studies, children who were randomly assigned to preschool scored much better than those who were assigned to the control group. But these results diminished in kindergarten, and by first grade, no positive effects could be detected. By third grade, the control group actually scored significantly higher than the former preschool students in math and science, and non-significantly higher in reading!

Jon Baron of the Laura and John Arnold Foundation wrote an insightful commentary on this study, noting that when such a large, well-done, long-term, randomized study is reported, we have to take the results seriously, even if they disagree with our most cherished beliefs. At the end of Baron’s brief summary was a commentary by Dale Farran and Mark Lipsey, two the study’s authors, telling the story of the hostile reception to their paper in the early childhood research community and the difficulties they had getting this exemplary experiment published.

Clearly, the Tennessee study was a major disappointment. How could preschool have no lasting effects for disadvantaged children?

Having participated in several research reviews on this topic (e.g., Chambers, Cheung, & Slavin, 2016), as well as some studies of my own, I have several observations to make.

Although this may have been the first large, randomized evaluation of a state-funded preschool program in the U.S., there have been many related studies that have had the same results. These include a large, randomized study of 5000 children assigned to Head Start or not (Puma et al., 2010), which also found positive outcomes at the end of the pre-K year, but only scattered lasting effects after pre-K. Very similar outcomes (positive pre-k outcomes with little or no lasting impact) have been found in a randomized evaluation of a national program called Sure Start in England (Melhuish, Belsky, & Leyland, 2010), and one in Australia (Claessens & Garrett, 2014).

Ironically, the Perry Preschool study itself failed to find lasting impacts, until students were in high school. That is, its outcomes were similar to those of the Tennessee, Head Start, Sure Start, and Australian studies, for the first 12 years of the study. So I suppose it is possible that someday, the participants in the Tennessee study will show a major benefit of having attended preschool. However, this seems highly doubtful.

It is important to note that some large studies of preschool attendance do find positive and lasting effects. However, these are invariably matched, non-experimental studies of children who happened to attend preschool, compared to others who did not. The problem with such studies is that it is essentially impossible to statistically control for all the factors that would lead parents to enroll their child in preschool, or not to do so. So lasting effects of preschool may just be lasting effects of having the good fortune to be born into the sort of family that would enroll its children in preschool.

What Should We Do if Preschool is Not Magic?

Let’s accept for the moment the hard (likely) reality that one year of preschool is not magic, and is unlikely to have lasting effects of the kind reported by the Perry Preschool study (and no other randomized studies.) Do we give up?

No.  I would argue that rather than considering preschool magic-or-nothing, we should think of it the same way we think about any other grade in school. That is, a successful school experience should not be one terrific year, but fourteen years (pre-k to 12) of great instruction using proven programs and practices.

First comes the preschool year itself, or the two year period including pre-k and kindergarten. There are many programs that have been shown in randomized studies to be successful over that time span, in comparison to control groups of children who are also in school (see Chambers, Cheung, & Slavin, 2016). Then comes reading instruction in grades K-1, where randomized studies have also validated many whole-class, small group, and one-to-one tutoring methods (Inns et al., 2018). And so on. There are programs proven to be effective in randomized experiments, at least for reading and math, for every grade level, pre-k to 12.

The time has long passed since all we had in our magic hat was preschool. We now have quite a lot. If we improve our schools one grade at a time and one subject at a time, we can see accumulating gains, ones that do not require waiting for miracles. And then we can work steadily toward improving what we can offer children every year, in every subject, in every type of school.

No one ever built a cathedral by waving a wand. Instead, magnificent cathedrals are built one stone at a time. In the same way, we can build a solid structure of learning using proven programs every year.

References

Baron, J. (2018). Large randomized controlled trial finds state pre-k program has adverse effects on academic achievement. Straight Talk on Evidence. Retrieved from http://www.straighttalkonevidence.org/2018/07/16/large-randomized-controlled-trial-finds-state-pre-k-program-has-adverse-effects-on-academic-achievement/

Chambers, B., Cheung, A., & Slavin, R. (2016). Literacy and language outcomes of balanced and developmental-constructivist approaches to early childhood education: A systematic review. Educational Research Review 18, 88-111.

Claessens, A., & Garrett, R. (2014). The role of early childhood settings for 4-5 year old children in early academic skills and later achievement in Australia. Early Childhood Research Quarterly, 29, (4), 550-561.

Inns, A., Lake, C., Pellegrini, M., & Slavin, R. (2018). Effective programs for struggling readers: A best-evidence synthesis. Paper presented at the annual meeting of the Society for Research on Educational Effectiveness, Washington, DC.

Lipsey, Farran, & Durkin (2018). Effects of the Tennessee Prekindergarten Program on children’s achievement and behavior through third grade. Early Childhood Research Quarterly. https://doi.org/10.1016/j.ecresq.2018.03.005

Melhuish, E., Belsky, J., & Leyland, R. (2010). The impact of Sure Start local programmes on five year olds and their families. London: Jessica Kingsley.

Puma, M., Bell, S., Cook, R., & Heid, C. (2010). Head Start impact study: Final report.  Washington, DC: U.S. Department of Health and Human Services.

Schweinhart, L. J., Barnes, H. V., & Weikart, D. P. (1993). Significant benefits: The High/Scope Perry Preschool study through age 27 (Monographs of the High/Scope Educational Research Foundation No. 10) Ypsilanti, MI: High/Scope Press.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Little Sleepers: Long-Term Effects of Preschool

In education research, a “sleeper effect” is not a way to get all of your preschoolers to take naps. Instead, it is an outcome of a program that appears not immediately after the end of the program, but some time afterwards, usually a year or more. For example, the mother of all sleeper effects was the Perry Preschool study, which found positive outcomes at the end of preschool but no differences throughout elementary school. Then positive follow-up outcomes began to show up on a variety of important measures in high school and beyond.

Sleeper effects are very rare in education research. To see why, consider a study of a math program for third graders that found no differences between program and control students at the end of third grade, but then a large and significant difference popped up in fourth grade or later. Long-term effects of effective programs are often seen, but how can there be long-term effects if there are no short-term effects on the way? Sleeper effects are so rare that many early childhood researchers have serious doubts about the validity of the long-term Perry Preschool findings.

I was thinking about sleeper effects recently because we have recently added preschool studies to our Evidence for ESSA website. In reviewing the key studies, I was once again reading an extraordinary 2009 study by Mark Lipsey and Dale Farran.

The study randomly assigned Head Start classes in rural Tennessee to one of three conditions. Some were assigned to use a program called Bright Beginnings, which had a strong pre-literacy focus. Some were assigned to use Creative Curriculum, a popular constructive/developmental curriculum with little emphasis on literacy. The remainder were assigned to a control group, in which teachers used whatever methods they ordinarily used.

Note that this design is different from the usual preschool studies frequently reported in the newspaper, which compare preschool to no preschool. In this study, all students were in preschool. What differed is only how they were taught.

The results immediately after the preschool program were not astonishing. Bright Beginnings students scored best on literacy and language measures (average effect size = +0.21 for literacy, +0.11 for language), though the differences were not significant at the school level. There were no differences at all between Creative Curriculum and control schools.

Where the outcomes became interesting was in the later years. Ordinarily in education research, outcomes measured after the treatments have finished diminish over time. In the Bright Beginnings/Creative Curriculum study the outcomes were measured again when students were in third grade, four years after they left school. Most students could be located because the test was the Tennessee standardized test, so scores could be found as long as students were still in Tennessee schools.

On third grade reading, former Bright Beginnings students now scored significantly better than former controls, and the difference was statistically significant and substantial (effect size = +0.27).

In a review of early childhood programs at www.bestevidence.org, our team found that across 16 programs emphasizing literacy as well as language, effect sizes did not diminish in literacy at the end of kindergarten, and they actually doubled on language measures (from +0.08 in preschool to +0.15 in kindergarten).

If sleeper effects (or at least maintenance on follow-up) are so rare in education research, why did they appear in these studies of preschool? There are several possibilities.

The most likely explanation is that it is difficult to measure outcomes among four year-olds. They can be squirrely and inconsistent. If a pre-kindergarten program had a true and substantial impact on children’s literacy or language, measures at the end of preschool may not detect it as well as measures a year later, because kindergartners and kindergarten skills are easier to measure.

Whatever the reason, the evidence suggests that effects of particular preschool approaches may show up later than the end of preschool. This observation, and specifically the Bright Beginnings evaluation, may indicate that in the long run it matters a great deal how students are taught in preschool. Until we find replicable models of preschool, or pre-k to 3 interventions, that have long-term effects on reading and other outcomes, we cannot sleep. Our little sleepers are counting on us to ensure them a positive future.

This blog is sponsored by the Laura and John Arnold Foundation

Pilot Studies: On the Path to Solid Evidence

This week, the Education Technology Industry Network (ETIN), a division of the Software & Information Industry Association (SIIA), released an updated guide to research methods, authored by a team at Empirical Education Inc. The guide is primarily intended to help software companies understand what is required for studies to meet current standards of evidence.

In government and among methodologists and well-funded researchers, there is general agreement about the kind of evidence needed to establish the effectiveness of an education program intended for broad dissemination. To meet its top rating (“meets standards without reservations”) the What Works Clearinghouse (WWC) requires an experiment in which schools, classes, or students are assigned at random to experimental or control groups, and it has a second category (“meets standards with reservations”) for matched studies.

These WWC categories more or less correspond to the Every Student Succeeds Act (ESSA) evidence standards (“strong” and “moderate” evidence of effectiveness, respectively), and ESSA adds a third category, “promising,” for correlational studies.

Our own Evidence for ESSA website follows the ESSA guidelines, of course. The SIIA guidelines explain all of this.

Despite the overall consensus about the top levels of evidence, the problem is that doing studies that meet these requirements is expensive and time-consuming. Software developers, especially small ones with limited capital, often do not have the resources or the patience to do such studies. Any organization that has developed something new may not want to invest substantial resources into large-scale evaluations until they have some indication that the program is likely to show well in a larger, longer, and better-designed evaluation. There is a path to high-quality evaluations, starting with pilot studies.

The SIIA Guide usefully discusses this problem, but I want to add some further thoughts on what to do when you can’t afford a large randomized study.

1. Design useful pilot studies. Evaluators need to make a clear distinction between full-scale evaluations, intended to meet WWC or ESSA standards, and pilot studies (the SIIA Guidelines call these “formative studies”), which are just meant for internal use, both to assess the strengths or weaknesses of the program and to give an early indicator of whether or not a program is ready for full-scale evaluation. The pilot study should be a miniature version of the large study. But whatever its findings, it should not be used in publicity. Results of pilot studies are important, but by definition a pilot study is not ready for prime time.

An early pilot study may be just a qualitative study, in which developers and others might observe classes, interview teachers, and examine computer-generated data on a limited scale. The problem in pilot studies is at the next level, when developers want an early indication of effects on achievement, but are not ready for a study likely to meet WWC or ESSA standards.

2. Worry about bias, not power. Small, inexpensive studies pose two types of problems. One is the possibility of bias, discussed in the next section. The other is lack of power, mostly meaning having a large enough sample to determine that a potentially meaningful program impact is statistically significant, or unlikely to have happened by chance. To understand this, imagine that your favorite baseball team adopts a new strategy. After the first ten games, the team is doing better than it did last year, in comparison to other teams, but this could have happened by chance. After 100 games? Now the results are getting interesting. If 10 teams all adopt the strategy next year and they all see improvements on average? Now you’re headed toward proof.

During the pilot process, evaluators might compare multiple classes or multiple schools, perhaps assigned at random to experimental and control groups. There may not be enough classes or schools for statistical significance yet, but if the mini-study avoids bias, the results will at least be in the ballpark (so to speak).

3. Avoid bias. A small experiment can be fine as a pilot study, but every effort should be made to avoid bias. Otherwise, the pilot study will give a result far more positive than the full-scale study will, defeating the purpose of doing a pilot.

Examples of common sources of biases in smaller studies are as follows.

a. Use of measures made by developers or researchers. These measures typically produce greatly inflated impacts.

b. Implementation of gold-plated versions of the program. . In small pilot studies, evaluations often implement versions of the program that could never be replicated. Examples include providing additional staff time that could not be repeated at scale.

c. Inclusion of highly motivated teachers or students in the experimental group, which gets the program, but not the control group. For example, matched studies of technology often exclude teachers who did not implement “enough” of the program. The problem is that the full-scale experiment (and real life) include all kinds of teachers, so excluding teachers who could not or did not want to engage with technology overstates the likely impact at scale in ordinary schools. Even worse, excluding students who did not use the technology enough may bias the study toward more capable students.

d. Learn from pilots. Evaluators, developers, and disseminators should learn as much as possible from pilots. Observations, interviews, focus groups, and other informal means should be used to understand what is working and what is not, so when the program is evaluated at scale, it is at its best.

 

***

As evidence becomes more and more important, publishers and software developers will increasingly be called upon to prove that their products are effective. However, no program should have its first evaluation be a 50-school randomized experiment. Such studies are indeed the “gold standard,” but jumping from a two-class pilot to a 50-school experiment is a way to guarantee failure. Software developers and publishers should follow a path that leads to a top-tier evaluation, and learn along the way how to ensure that their programs and evaluations will produce positive outcomes for students at the end of the process.

 

This blog is sponsored by the Laura and John Arnold Foundation

Why Rigorous Studies Get Smaller Effect Sizes

When I was a kid, I was a big fan of the hapless Washington Senators. They were awful. Year after year, they were dead last in the American League. They were the sort of team that builds diehard fans not despite but because of their hopelessness. Every once in a while, kids I knew would snap under the pressure and start rooting for the Baltimore Orioles. We shunned them forever, right up to this day.

With the Senators, any reason for hope was prized, and we were all very excited when some hotshot batter was brought up from the minor leagues. But they almost always got whammed, sent back down or traded but never heard from again. I’m sure this happens in every team. In fact, I just saw an actual study comparing batting averages for batters in their last year in the minors to their first year in the majors. The difference was dramatic. In the majors, the very same batters had much lower averages. The impact was equivalent to an effect size of -0.70. That’s huge. I’d call this effect the Curse of the Major Leagues.

Why am I carrying on about baseball? I think it provides an analogy to explain why large, randomized experiments in education have characteristically lower effect sizes than experiments that are quasi-experiments, smaller, or (especially) both.

In baseball, batting averages decline because the competition is tougher. The pitchers are faster, the fielders are better, and maybe the minor league parks are smaller, I don’t know. In education, large randomized experiments are tougher competition, too. Randomized experiments are tougher because the experimenter doesn’t get the benefit of self-selection by the schools or teachers choosing the program. In a randomized experiment everyone has to start fresh at the beginning of the study, so the experimenter does not get the benefit of working with teachers who may already be experienced in the experimental program.

In larger studies, the experimenter has more difficulty controlling every variable to ensure high-quality implementation. Large studies are more likely to use standardized tests rather than researcher-made tests. If these are state tests used for accountability, the control group can be assumed to be trying just as much as the experimental group to improve students’ scores on the objectives taught on those tests.

What these problems mean is that when a program is evaluated in a large randomized study, and the results are significantly positive, this is cause for real celebration because the program had to overcome much tougher competition. The successful program is far more likely to work in realistic settings at serious scale because it has been tested under more life-like conditions. Other experimental designs are also valuable, of course, if only because they act like the minor leagues, nurturing promising prospects and then sending the best to the majors where their mettle will really be tested. In a way, this is exactly the tiered evidence strategy used in Investing in Innovation (i3) and in the Institute for Education Sciences (IES) Goal 2-3-4 progression. In both cases, smaller grants are made available for development projects, which are nurtured and, if they show promise, may be funded at a higher level and sent to the majors (validation, scale-up) for rigorous, large-scale evaluation.

The Curse of the Major Leagues was really just the product of a system for fairly and efficiently bringing the best players into the major leagues. The same idea is the brightest hope we have for offering schools throughout the U.S. the very best instructional programs on a meaningful scale. After all those years rooting for the Washington Senators, I’m delighted to see something really powerful coming from our actual Senators in Washington. And I don’t mean baseball!

Seeking Jewels, Not Boulders: Learning to Value Small, Well-Justified Effect Sizes

One of the most popular exhibits in the Smithsonian Museum of Natural History is the Hope Diamond, one of the largest and most valuable in the world. It’s always fun to see all the kids flow past it saying how wonderful it would be to own the Hope Diamond, how beautiful it is, and how such a small thing could make you rich and powerful.

The diamonds are at the end of the Hall of Minerals, which is crammed full of exotic minerals from all over the world. These are beautiful, rare, and amazing in themselves, yet most kids rush past them to get to the diamonds. But no one, ever, evaluates the minerals against one another according to their size. No one ever says, “you can have your Hope Diamond, but I’d rather have this giant malachite or feldspar.” Just getting into the Smithsonian, kids go by boulders on the National Mall far larger than anything in the Hall of Minerals, perhaps climbing on them but otherwise ignoring them completely.

Yet in educational research, we often focus on the size of study effects without considering their value. In a recent blog, I presented data from a paper with my colleague Alan Cheung analyzing effect sizes from 611 studies evaluating reading, math, and science programs, K-12, that met the inclusion standards of our Best Evidence Encyclopedia. One major finding was that in randomized evaluations with sample sizes of 250 students (10 classes) or more, the average effect size across 87 studies was only +0.11. Smaller randomized studies had effect sizes averaging +0.22, large matched quasi-experiments +0.17, and small quasi-experiments, +0.32. In this blog, I want to say more about how these findings should make us think differently about effect sizes as we increasingly apply evidence to policy and practice.

Large randomized experiments (RCTs) with significant positive outcomes are the diamonds of educational research: rare, often flawed, but incredibly valuable. The reason they are so valuable is that such studies are the closest indication of what will happen when a given program goes out into the real world of replication. Randomization removes the possibility that self-selection may account for program effects. The larger the sample size, the less likely it is that the experimenter or developer could closely monitor each class and mentor each teacher beyond what would be feasible in real-life scale up. Most large-scale RCTs use clustering, which usually means that the treatment and randomization take place at the level of the whole school. A cluster randomized experiment at the school level might require recruiting 40 to 50 schools, perhaps serving 20,000 to 25,000 students. Yet such studies might nevertheless be too “small” to detect an effect size of, say, 0.15, because it is the number of clusters, not the number of students, that matters most!

The problem is that we have been looking for much larger effect sizes, and all too often not finding them. Traditionally, researchers recruit enough schools or classes to reliably detect an effect size as small as +0.20. This means that many studies report effect sizes that turn out to be larger than average for large RCTs, but are not statistically significant (at p<.05), because they are less than +0.20. If researchers did recruit samples of schools large enough to detect an effect size of +0.15, this would greatly increase the costs of such studies. Large RCTs are already very expensive, so substantially increasing sample sizes could end up requiring resources far beyond what educational research is likely to see any time in the near future or greatly reducing the number of studies that are funded.

These issues have taken on greater importance recently due to the passage of the Every Student Succeeds Act, or ESSA, which encourages use of programs that meet strong, moderate, or promising levels of evidence. The “strong” category requires that a program have at least one randomized experiment that found a significant positive effect. Such programs are rare.

If educational researchers were mineralogists, we’d be pretty good at finding really big diamonds, but the little, million-dollar diamonds, not so much. This makes no sense in diamonds, and no sense in educational research.

So what do we do? I’m glad you asked. Here are several ways we could proceed to increase the number of programs successfully evaluated in RCTs.

1. For cluster randomized experiments at the school level, something has to give. I’d suggest that for such studies, the p value should be increased to .10 or even .20. A p value of .05 is a long-established convention, indicating that there is only one chance in 20 that the outcomes are due to luck. Yet one chance in 10 (p=.10) may be sufficient in studies likely to have tens of thousands of students.

2. For studies in the past, as well as in the future, replication should be considered the same as large sample size. For example, imagine that two studies of Program X each have 30 schools. Each gets a respectable effect size of +0.20, which would not be significant in either case. Put the two studies together, however, and voila! The combined study of 60 schools would be highly significant, even at p=.05.

3. Government or foundation funders might fund evaluations in stages. The first stage might involve a cluster randomized experiment of, say, 20 schools, which is very unlikely to produce a significant difference. But if the effect size were perhaps 0.20 or more, the funders might fund a second stage of 30 schools. The two samples together, 50 schools, would be enough to detect a small but important effect.

One might well ask why we should be interested in programs that only produce effect sizes of 0.15? Aren’t these effects too small to matter?

The answer is that they are not. Over time, I hope we will learn how to routinely produce better outcomes. Already, we know that much larger impacts are found in studies of certain approaches emphasizing professional development (e.g., cooperative learning, meta cognitive skills) and certain forms of technology. I hope and expect that over time, more studies will evaluate programs using methods like those that have been proven to work, and fewer will evaluate those that do not, thereby raising the average effects we find. But even as they are, small but reliable effect sizes are making meaningful differences in the lives of children, and will make much more meaningful differences as we learn from efforts at the Institute of Education Sciences (IES) and the Investing in Innovation (i3)/Education Innovation and Research (EIR) programs.

Small effect sizes from large randomized experiments are the Hope Diamonds of our profession. They also are the best hope for evidence-based improvements for all students.

Is All That Glitters Gold-Standard?

In the world of experimental design, studies in which students, classes, or schools are assigned at random to experimental or control treatments (randomized clinical trials or RCTs) are often referred to as meeting the “gold standard.” Programs with at least one randomized study with a statistically significant positive outcome on an important measure qualify as having “strong evidence of effectiveness” under the definitions in the Every Student Succeeds Act (ESSA). RCTs virtually eliminate selection bias in experiments. That is, readers don’t have to worry that the teachers using an experimental program might have already been better or more motivated than those who were in the control group. Yet even RCTs can have such serious flaws as to call their outcomes into question.

A recent article by distinguished researchers Alan Ginsburg and Marshall Smith severely calls into question every single elementary and secondary math study accepted by the What Works Clearinghouse (WWC) as “meeting standards without reservations,” which in practice requires a randomized experiment. If they were right, then the whole concept of gold-standard randomized evaluations would go out the window, because the same concerns would apply to all subjects, not just math.

Fortunately, Ginsburg & Smith are mostly wrong. They identify, and then discard, 27 studies accepted by the WWC. In my view, they are right about five. They raise some useful issues about the rest, but not damning ones.

The one area in which I fully agree with Ginsburg & Smith (G&S henceforth) relates to studies that use measures made by the researchers. In a recent paper with Alan Cheung and an earlier one with Nancy Madden, I reported that use of researcher-made tests resulted in greatly overstated effect sizes. Neither WWC nor ESSA should accept such measures.

From this point however, G&S are overly critical. First, they reject all studies in which the developer was one of the report authors. However, the U.S. Department of Education has been requiring third-party evaluations in its larger grants for more than a decade. This is true in IES, i3, and NSF (scale-up) grants for example, and in England’s Education Endowment Foundation (EEF). A developer may be listed as an author, but it’s been a long time since a developer could get his or her thumb on the scale in federally-funded research. Even studies funded by publishers are almost universally using third-party evaluators.

G&S complain that 25 of 27 studies evaluated programs in their first year, compromising fidelity. This is indeed a problem, but it can only affect outcomes in a negative direction. Programs showing positive outcomes in their first year may be particularly powerful.

G&S express concern that half of studies did not state what curriculum the control group was using. This would be nice to know, but does not invalidate a study.

G&S complain that in many cases the amount of instructional time for the experimental group was greater than that for the control group. This could be a problem, but given the findings of research on allocated time, it is unlikely that time alone makes much of a difference in math learning. It may be more sensible to see extra time as a question of cost-effectiveness. Did 30 extra minutes of math per day implementing Program X justify the costs of Program X, including the cost of adding the time? Future studies might evaluate the value added of 30 extra minutes doing ordinary instruction, but does anyone expect this to be a large impact?

Finally, G&S complain that most curricula used in WWC-accepted RCTs are outdated. This could be a serious concern, especially as common core and other college- and career-ready standards are adopted in most states. However, recall that at the time RCTs are done, the experimental and the control groups were subject to the same standards, so if the experimental group did better, it is worth considering as an innovation. The reality is that any program in active dissemination must update its content to meet new standards. A program proven effective before common core and then updated to align with common core standards is not proven for certain to improve common core outcomes, for example, but it is a very good bet. A school or district considering adopting a given proven program might well check to see that it meets current standards, but it would be self-defeating and unnecessary to demand that every program re-prove its effectiveness every time standards change.

Randomized experiments in education are not perfect (neither are randomized experiments in medicine or other fields). However, they provide the best evidence science knows how to produce on the effectiveness of innovations. It is entirely legitimate to raise issues about RCTs, as Ginsburg & Smith do, but rejecting what we do know until perfection is achieved would cut off the best avenue we have for progress toward solid, scientifically defensible reform in our schools.

Educationists and Economists

I used to work part time in England, and I’ve traveled around the world a good bit speaking about evidence-based reform in education and related topics. One of the things I find striking in country after country is that at the higher levels, education is not run by educators. It is run by economists.

In the U.S., this is also true, though it’s somewhat less obvious. The main committees in Congress that deal with education are the House Education and the Workforce Committee and the Senate Health, Education, Labor, and Pensions (HELP) Committee. Did you notice the words “workforce” and “labor”? That’s economists. Further, politicians listen to economists, because they consider them tough-minded, data-driven, and fact-friendly. Economists see education as contributing to the quality of the workforce, now and in the future, and this makes them influential with politicians.

A lot of the policy prescriptions that get widely discussed and implemented broadly are the sorts of things economists love to dream up. For example, they are partial to market incentives, new forms of governance, rewards and punishments, and social impact bonds. Individual economists, and the politicians who listen to them, take diverse positions on these policies, but the point is that economists rather than educators often set the terms of the debates on both sides. As one example, educators have been talking about long-term impacts of quality preschool for 30 years, but when Nobel Prize-winning economist James Heckman took up the call, preschool became a top priority of the Obama Administration.

I have nothing against economists. Some of my best friends are economists. But here is why I am bringing them up.

Evidence-based reform is creating a link between educationists and economists, and thereby to the politicians who listen to them, that did not exist before. Evidence-based reform speaks the language that economists insist on: randomized evaluations of replicable programs and practices. When an educator develops a program, successfully evaluates it at scale, and shows it can be replicated, this gives economists a tangible tool they can show will make a difference in policy. Other research designs are simply not as respected or accepted. But an economist with a proven program in hand has a measurable, powerful means to affect policy and help politicians make wise use of resources.

If we want educational innovation and research to matter to public policy, we have to speak truth to power, in the language of power. And that language is increasingly the language of rigorous evidence. If we keep speaking it, our friends the economists will finally take evidence from educational research seriously, and that is how policy will change to improve outcomes for children on a grand scale.