Little Sleepers: Long-Term Effects of Preschool

In education research, a “sleeper effect” is not a way to get all of your preschoolers to take naps. Instead, it is an outcome of a program that appears not immediately after the end of the program, but some time afterwards, usually a year or more. For example, the mother of all sleeper effects was the Perry Preschool study, which found positive outcomes at the end of preschool but no differences throughout elementary school. Then positive follow-up outcomes began to show up on a variety of important measures in high school and beyond.

Sleeper effects are very rare in education research. To see why, consider a study of a math program for third graders that found no differences between program and control students at the end of third grade, but then a large and significant difference popped up in fourth grade or later. Long-term effects of effective programs are often seen, but how can there be long-term effects if there are no short-term effects on the way? Sleeper effects are so rare that many early childhood researchers have serious doubts about the validity of the long-term Perry Preschool findings.

I was thinking about sleeper effects recently because we have recently added preschool studies to our Evidence for ESSA website. In reviewing the key studies, I was once again reading an extraordinary 2009 study by Mark Lipsey and Dale Farran.

The study randomly assigned Head Start classes in rural Tennessee to one of three conditions. Some were assigned to use a program called Bright Beginnings, which had a strong pre-literacy focus. Some were assigned to use Creative Curriculum, a popular constructive/developmental curriculum with little emphasis on literacy. The remainder were assigned to a control group, in which teachers used whatever methods they ordinarily used.

Note that this design is different from the usual preschool studies frequently reported in the newspaper, which compare preschool to no preschool. In this study, all students were in preschool. What differed is only how they were taught.

The results immediately after the preschool program were not astonishing. Bright Beginnings students scored best on literacy and language measures (average effect size = +0.21 for literacy, +0.11 for language), though the differences were not significant at the school level. There were no differences at all between Creative Curriculum and control schools.

Where the outcomes became interesting was in the later years. Ordinarily in education research, outcomes measured after the treatments have finished diminish over time. In the Bright Beginnings/Creative Curriculum study the outcomes were measured again when students were in third grade, four years after they left school. Most students could be located because the test was the Tennessee standardized test, so scores could be found as long as students were still in Tennessee schools.

On third grade reading, former Bright Beginnings students now scored significantly better than former controls, and the difference was statistically significant and substantial (effect size = +0.27).

In a review of early childhood programs at, our team found that across 16 programs emphasizing literacy as well as language, effect sizes did not diminish in literacy at the end of kindergarten, and they actually doubled on language measures (from +0.08 in preschool to +0.15 in kindergarten).

If sleeper effects (or at least maintenance on follow-up) are so rare in education research, why did they appear in these studies of preschool? There are several possibilities.

The most likely explanation is that it is difficult to measure outcomes among four year-olds. They can be squirrely and inconsistent. If a pre-kindergarten program had a true and substantial impact on children’s literacy or language, measures at the end of preschool may not detect it as well as measures a year later, because kindergartners and kindergarten skills are easier to measure.

Whatever the reason, the evidence suggests that effects of particular preschool approaches may show up later than the end of preschool. This observation, and specifically the Bright Beginnings evaluation, may indicate that in the long run it matters a great deal how students are taught in preschool. Until we find replicable models of preschool, or pre-k to 3 interventions, that have long-term effects on reading and other outcomes, we cannot sleep. Our little sleepers are counting on us to ensure them a positive future.

This blog is sponsored by the Laura and John Arnold Foundation


Getting Past the Dudalakas (And the Yeahbuts)

Phyllis Hunter, a gifted educator, writer, and speaker on the teaching of reading, often speaks about the biggest impediments to education improvement, which she calls the dudalakas. These are excuses for why change is impossible.  Examples are:

Dudalaka         Better students

Dudalaka         Money

Dudalaka         Policy support

Dudalaka         Parent support

Dudalaka         Union support

Dudalaka         Time

Dudalaka is just shorthand for “Due to the lack of.” It’s a close cousin of “yeahbut,” another reflexive response to ideas for improving education practices or policy.

Of course, there are real constraints that teachers and education leaders face that genuinely restrict what they can do. The problem with dudalakas and yeahbuts is not that the objections are wrong, but that they are so often thrown up as a reason not to even think about solutions.

I often participate in dudalaka conversations. Here is a composite. I’m speaking with a principal of an elementary school, who is expressing concern about the large number of students in his school who were struggling in reading. Many of these students were headed for special education. “Could you provide them with tutors?” I ask. “Yes, they get tutors, but we use a small group method that emphasizes oral reading (not the phonics skills that the students are actually lacking) (i.e., yeahbut).”

“Could you change the tutoring to focus on the skills you know students need?”

“Yeahbut our education leadership requires we use this system” (dudalaka political support). Besides, we have so many failing students (dudalaka better students) so we have to work with small groups of students (dudalaka tutors).”

“Could you hire and train paraprofessionals or recruit qualified volunteers to provide personalized tutoring?”

“Yeahbut we’d love to, but we can’t afford them (dudalaka money). Besides, we don’t have time for tutoring (dudalaka time).”

“But you have plenty of time in your afternoon schedule.”

“Yeahbut in the afternoon, children are tired. (Dudalaka better students).”

This conversation is not of course a rational discussion of strategies for solving a serious problem. It is instead an attempt by the principal to find excuses to justify his school’s continuing to do what it is doing now. Dudalakas and yeahbuts are merely ways of passing blame to other people (school leaders, teachers, children, parents, unions, and so on) and to shortages of money, time, and other resources that hold back change. Again, these excuses may or may not be valid in a particular situation, but there is a difference between rejecting potential solutions out of hand (using dudalakas and yeahbuts) as opposed to identifying and then carefully and creatively considering potential solutions. Not every solution will be possible or workable, but if the problem is important, some solution must be found. No matter what.

An average American elementary school with 500 students has an annual budget of approximately $6,000,000 ($12,000 per student). Principals and teachers, superintendents, and state superintendents think their hands are tied by limited resources (dudalaka money). But creativity and commitment to core goals can overcome funding limitations if school and district leaders are willing to use resources differently or activate underutilized resources, or ideally, find a way to obtain more funding.

The people who start off with the very human self-protective dudalakas and yeahbuts may, with time, experience, and encouragement, become huge advocates for change. It’s only natural to start with dudalakas and yeahbuts. What is important is that we don’t end with them.

We know that our children are capable of succeeding at much higher rates than they do today. Yet too many are failing, dudalaka quality implementation of proven programs. Let’s clear away the other dudalakas and yeahbuts, and get down to this one.

This blog is sponsored by the Laura and John Arnold Foundation

The Sweet Land of Carrots: Promoting Evidence with Incentives

Results for America (RFA) released a report in July analyzing the first 17 Every Student Succeeds Act (ESSA) plans submitted by states. RFA was particularly interested in the degree to which evidence of effectiveness was represented in the plans, and the news is generally good. All states discussed evidence (it’s in the law), but many went much further, proposing to award competitive funding to districts to the degree that they propose to adopt programs proven to be effective according to the ESSA evidence standards. This was particularly true of school improvement grants, where the ESSA law requires evidence, but many state plans extended this principle beyond school improvement into other areas.

As an incurable optimist, this all looks very good to me. If state leaders are clear about what qualifies as “proven” under ESSA, and clear about how proper supports are also needed (e.g. needs assessments, high-quality implementation), then this creates an environment in which evidence will, at long last, play an important role in education policy. This was always the intent of the ESSA evidence standards, which were designed to make it easy for states and districts to identify proven programs so that they could incentivize and assist schools in using such programs.

The focus on encouragement, incentives, and high-quality implementation is a hallmark of the evidence elements of ESSA. To greatly oversimplify, ESSA moves education policy from the frightening land of sticks to the sweet land of carrots. Even though ESSA specifies that schools performing in the lowest 5% of their states must select proven programs, schools still have a wide range of choices that meet ESSA evidence standards. Beyond school improvement, Title II, Striving Readers, and other federal programs already provide funds to schools promising to adopt proven programs, or at least provide competitive preference to applicants promising to implement qualifying programs. Instead of the top-down, over-specific mandates of NCLB, ESSA provides incentives to use proven programs, but leaves it up to schools to pick the ones that are most appropriate to their needs.

There’s an old (and surely apocryphal) story about two approaches to introduce innovations. After the potato was introduced to Europe from the New World, the aristocracy realized that potatoes were great peasant food, rich in calories, easy to grow, and capable of thriving in otherwise non-arable land. The problem was, the peasants didn’t want to have anything to do with potatoes.

Catherine the Great of Russia approached the problem by capturing a few peasants, tying them up, and force-feeding them potatoes. “See?” said her minsters. “They ate potatoes and didn’t die.”

Louis XIV of France had a better idea. His minsters planted a large garden with potatoes, just outside of Paris, and posted a very sleepy guard over it. The wily peasants watched the whole process, and when the guard was asleep, they dug up the potatoes, ate them with great pleasure, and told all their friends how great they were. The word spread like wildfire, and soon peasants all over France were planting and eating potatoes.

The potato story is not precisely carrots and sticks, but it contains the core message. No matter how beneficial an innovation may be, there is always a risk and/or a cost in being the first on your block to adopt it. That risk/cost can be overcome if the innovation is super cool, or if early innovators gain status (as in Louis XIV’s potato strategy). Alternatively, or in addition, providing incentives to prime the pump, to get early adopters out promoting innovations to their friends, is a key part of a strategy to spread proven innovations.

What isn’t part of any effective dissemination plan is sticks. If people feel they must adopt particular policies from above, they are likely to be resentful, and to reason that if the government has to force you to do something, there must be something wrong with it. The moment the government stops monitoring compliance or policies change, the old innovations are dropped like, well, hot potatoes. That was the Catherine the Great strategy. The ESSA rules for school improvement do require that schools use proven programs but this is very different from being told which specific programs they must use, since they have a lot of proven programs to choose from. If schools still can choose which program to implement, then those who do make the choice will put all their energy into high-quality implementation. This is why, in our Success for All program, we require a vote of 80% of school staff in favor of program adoption.

My more cynical friends tell me that once again, I’m being overly optimistic. States, districts, and schools will pretend to adopt proven programs to get their money, they say, but won’t actually implement anything, or will do so leaving out key components, such as adequate professional development. I’m realistic enough to know that this will in fact happen in some places. Enthusiastic and informed federal, state, and district leadership will help avoid this problem, but it cannot be avoided entirely.

However, America is a very big country. If just a few states, for example, wholeheartedly adopted pro-evidence policies and provided technical assistance in selecting, implementing, evaluating, and continuously improving proven programs, they would surely have a substantial impact on their students. And other states would start to notice. Pretty soon, proven programs would be spreading like French fries.

I hope the age of the stick is over, and the age of the sweet carrot has arrived. ESSA has contributed to this possibility, but visionary state and district leaders will have to embrace the idea that helping and incentivizing schools to use proven programs is the best way to rapidly expand their use. And expanding well-implemented proven programs is the way to improve student achievement on a state or national scale. The innovations will be adopted, thoughtfully implemented, and sustained for the right reason – because they work for kids.

This blog is sponsored by the Laura and John Arnold Foundation


Maximizing the Promise of “Promising” in ESSA

As anyone who reads my blogs is aware, I’m a big fan of the ESSA evidence standards. Yet there are many questions about the specific meaning of the definitions of strength of evidence for given programs. “Strong” is pretty clear: at least one study that used a randomized design and found a significant positive effect. “Moderate” requires at least one study that used a quasi-experimental design and found significant positive effects. There are important technical questions with these, but the basic intention is clear.

Not so with the third category, “promising.” It sounds clear enough: At least one correlational study with significantly positive effects, controlling for pretests or other variables. Yet what does this mean in practice?

The biggest problem is that correlation does not imply causation. Imagine, for example, that a study found a significant correlation between the numbers of iPads in schools and student achievement. Does this imply that more iPads cause more learning? Or could wealthier schools happen to have more iPads (and children in wealthy families have many opportunities to learn that have nothing to do with their schools buying more iPads)? The ESSA definitions do require controlling for other variables, but correlational studies lend themselves to error when they try to control for big differences.

Another problem is that a correlational study may not specify how much of a given resource is needed to show an effect. In the case of the iPad study, did positive effects depend on one iPad per class, or thirty (one per student)? It’s not at all clear.

Despite these problems, the law clearly defines “promising” as requiring correlational studies, and as law-abiding citizens, we must obey. But the “promising” category also allows for some additional categories of studies that can fill some important gaps that otherwise lurk in the ESSA evidence standards.

The most important category involves studies in which schools or teachers (not individual students) were randomly assigned to experimental or control groups. Current statistical norms require that such studies use multilevel analyses, such as Hierarchical Linear Modeling (HLM). In essence, these are analyses at the cluster level (school or teacher), not the student level. The What Works Clearinghouse (WWC) requires use of statistics like HLM in clustered designs.

The problem is that it takes a lot of schools or teachers to have enough power to find significant effects. As a result, many otherwise excellent studies fail to find significant differences, and are not counted as meeting any standard in the WWC.

The Technical Working Group (TWG) that set the standards for our Evidence for ESSA website suggested a solution to this problem. Cluster randomized studies that fail to find significant effects are re-analyzed at the student level. If the student-level outcome is significantly positive, the program is rated as “promising” under ESSA. Note that all experiments are also correlational studies (just using a variable with only two possible values, experimental or control), and experiments in education almost always control for pretests and other factors, so our procedure meets the ESSA evidence standards’ definition for “promising.”

Another situation in which “promising” is used for “just-missed” experiments is in the case of quasi-experiments. Like randomized experiments, these should be analyzed at the cluster level if treatment was at the school or classroom level. So if a quasi-experiment did not find significantly positive outcomes at the cluster level but did find significant positive effects at the student level, we include it as “promising.”

These procedures are important for the ESSA standards, but they are also useful for programs that are not able to recruit a large enough sample of schools or teachers to do randomized or quasi-experimental studies. For example, imagine that a researcher evaluating a school-wide math program for tenth graders could only afford to recruit and serve 10 schools. She might deliberately use a design in which the 10 schools are randomly assigned to use the innovative math program (n=5) or serve as a control group (n=5). A cluster randomized experiment with only 10 clusters is extremely unlikely to find a significant positive effect at the school level, but with perhaps 1000 students per condition, would be very likely to find a significant effect at the student level, if the program is in fact effective. In this circumstance, the program could be rated, using our standard, as “promising,” an outcome true to the ordinary meaning of the word: not proven, but worth further investigation and investment.

Using the “promising” category in this way may encourage smaller-scale, less well funded researchers to get into the evidence game, albeit at a lower rung on the ladder. But it is not good policy to set such high standards that few programs will qualify. Defining “promising” as we have in Evidence for ESSA does not promise anyone the Promised Land, but it broadens the number of programs schools may select, knowing that they must give up a degree of certainty in exchange for a broader selection of programs.