Evidence for ESSA Celebrates its First Anniversary

Penguin 02 22 18On February 28, 2017 we launched Evidence for ESSA (www.evidenceforessa.org), our website providing the evidence to support educational programs according to the standards laid out in the Every Child Succeeds Act in December, 2015.

Evidence for ESSA began earlier, of course. It really began one day in September, 2016, when I heard leaders of the Institute for Education Sciences (IES) and the What Works Clearinghouse (WWC) announce that the WWC would not be changed to align with the ESSA evidence standards. I realized that no one else was going to create scientifically valid, rapid, and easy-to-use websites providing educators with actionable information on programs meeting ESSA standards. We could do it because our group at Johns Hopkins University, and partners all over the world, had been working for many years creating and updating another website, the Best Evidence Encyclopedia (BEE; www.bestevidence.org).BEE reviews were not primarily designed for practitioners and they did not align with ESSA standards, but at least we were not starting from scratch.

We assembled a group of large membership organizations to advise us and to help us reach thoughtful superintendents, principals, Title I directors, and others who would be users of the final product. They gave us invaluable advice along the way. We also assembled a technical working group (TWG) of distinguished researchers to advise us on key decisions in establishing our website.

It is interesting to note that we have not been able to obtain adequate funding to support Evidence for ESSA. Instead, it is mostly being written by volunteers and graduate students, all of whom are motivated only by a passion for evidence to improve the education of students.

A year after launch, Evidence for ESSA has been used by more than 36,000 unique users, and I hear that it is very useful in helping states and districts meet the ESSA evidence standards.

We get a lot of positive feedback, as well as complaints and concerns, to which we try to respond rapidly. Feedback has been important in changing some of our policies and correcting some errors and we are glad to get it.

At this moment we are thoroughly up-to-date on reading and math programs for grades pre-kindergarten to 12, and we are working on science, writing, social-emotional outcomes, and summer school. We are also continuing to update our more academic BEE reviews, which draw from our work on Evidence for ESSA.

In my view, the evidence revolution in education is truly a revolution. If the ESSA evidence standards ultimately prevail, education will at long last join fields such as medicine and agriculture in a dynamic of practice to development to evaluation to dissemination to better practice, in an ascending spiral that leads to constantly improving practices and outcomes.

In a previous revolution, Thomas Jefferson said, “If I had to choose between government without newspapers and newspapers without government, I’d take the newspapers.” In our evidence revolution in education, Evidence for ESSA, the WWC, and other evidence sources are our “newspapers,” providing the information that people of good will can use to make wise and informed decisions.

Evidence for ESSA is the work of many dedicated and joyful hands trying to provide our profession with the information it needs to improve student outcomes. The joy in it is the joy in seeing teachers, principals, and superintendents see new, attainable ways to serve their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Advertisements

Getting the Best Mileage from Proven Programs

Race carWouldn’t you love to have a car that gets 200 miles to the gallon? Or one that can go hundreds of miles on a battery charge? Or one that can accelerate from zero to sixty twice as fast as any on the road?

Such cars exist, but you can’t have them. They are experimental vehicles or race cars that can only be used on a track or in a lab. They may be made of exotic materials, or may not carry passengers or groceries, or may be dangerous on real roads.

In working on our Evidence for ESSA website (www.evidenceforessa.org), we see a lot of studies that are like these experimental cars. For example, there are studies of programs in which the researcher or her graduate students actually did the teaching, or in which students used innovative technology with one adult helper for every machine or every few machines. Such studies are fine for theory building or as pilots, but we do not accept them for Evidence for ESSA, because they could never be replicated in real schools.

However, there is a much more common situation to which we pay very close attention. These are studies in which, for example, teachers receive a great deal of training and coaching, but an amount that seems replicable, in principle. For example, we would reject a study in which the experimenters taught the program, but not one in which they taught ordinary teachers how to use the program.

In such studies, the problem comes in dissemination. If studies validating a program provided a lot of professional development, we would accept it only if the disseminator provides a similar level of professional development, and their estimates of cost and personnel take this level of professional development into account. We put on our website clear expectations that these services be provided at a level similar to what was provided in the research, if the positive outcomes seen in the research are to be obtained.

The problem is that disseminators often offer schools a form of the program that was never evaluated, to keep costs low. They know that schools don’t like to spend a lot on professional development, and they are concerned that if they require the needed levels of PD or other services or materials, schools won’t buy their program. At the extreme end of this, there are programs that were successfully evaluated using extensive professional development, and then put their teacher’s manual on the web for schools to use for free.

A recent study of a program called Mathalicious illustrated the situation. Mathalicious is an on-line math course for middle school. An evaluation found that teachers randomly assigned to just get a license, with minimal training, did not obtain significant positive impacts, compared to a control group. Those who received extensive on-line training, however, did see a significant improvement in math scores, compared to controls.

When we write our program descriptions, we compare program implementation details in the research to what is said or required on the program’s website. If these do not match, within reason, we try to make it clear what were the key elements necessary for success.

Going back to the car analogy, our procedures eliminate those amazing cars that can only operate on special tracks, but we accept cars that can run on streets, carry children and groceries, and generally do what cars are expected to do. But if outstanding cars require frequent recharging, or premium gasoline, or have other important requirements, we’ll say so, in consultation with the disseminator.

In our view, evidence in education is not for academics, it’s for kids. If there is no evidence that a program as disseminated benefits kids, we don’t want to mislead educators who are trying to use evidence to benefit their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

…But It Was The Very Best Butter! How Tests Can Be Reliable, Valid, and Worthless

I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”

I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.

Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.

Wow.

How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.

Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.

So how can it be that using even the most reliable and valid tests doesn’t solve this problem?

In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”

The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).

Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.

Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).

Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.

Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Half a Worm: Why Education Policy Needs High Evidence Standards

There is a very old joke that goes like this:

What’s the second-worst thing to find in your apple?  A worm.

What’s the worst?  Half a worm.

The ESSA evidence standards provide clearer definitions of “strong,” “moderate,” and “promising” levels of evidence than have ever existed in law or regulation. Yet they still leave room for interpretation.  The problem is that if you define evidence-based too narrowly, too few programs will qualify.  But if you define evidence-based too broadly, it loses its meaning.

We’ve already experienced what happens with a too-permissive definition of evidence.  In No Child Left Behind, “scientifically-based research” was famously mentioned 110 times.  The impact of this, however, was minimal, as everyone soon realized that the term “scientifically-based” could be applied to just about anything.

Today, we are in a much better position than we were in 2002 to insist on relatively strict evidence of effectiveness, both because we have better agreement about what constitutes evidence of effectiveness and because we have a far greater number of programs that would meet a high standard.  The ESSA definitions are a good consensus example.  Essentially, they define programs with “strong evidence of effectiveness” as those with at least one randomized study showing positive impacts using rigorous methods, and “moderate evidence of effectiveness” as those with at least one quasi-experimental study.  “Promising” is less well-defined, but requires at least one correlational study with a positive outcome.

Where the half-a-worm concept comes in, however, is that we should not use a broader definition of “evidence-based”.  For example, ESSA has a definition of “strong theory.”  To me, that is going too far, and begins to water down the concept.  What program in all of education cannot justify a “strong theory of action”?

Further, even in the top categories, there are important questions about what qualifies. In school-level studies, should we insist on school-level analyses (i.e., HLM)? Every methodologist would say yes, as I do, but this is not specified. Should we accept researcher-made measures? I say no, based on a great deal of evidence indicating that such measures inflate effects.

Fortunately, due to investments made by IES, i3, and other funders, the number of programs that meet strict standards has grown rapidly. Our Evidence for ESSA website (www.evidenceforessa.org) has so far identified 101 PK-12 reading and math programs, using strict standards consistent with ESSA definitions. Among these, more than 60% meet the “strong” standard. There are enough proven programs in every subject and grade level to give educators choices among proven programs. And we add more each week.

This large number of programs meeting strict evidence standards means that insisting on rigorous evaluations, within reason, does not mean that we end up with too few programs to choose among. We can have our apple pie and eat it, too.

I’d love to see federal programs of all kinds encouraging use of programs with rigorous evidence of effectiveness.  But I’d rather see a few programs that meet a strict definition of “proven” than to see a lot of programs that only meet a loose definition.  20 good apples are much better than applesauce of dubious origins!

This blog is sponsored by the Laura and John Arnold Foundation

Where Will the Capacity for School-by-School Reform Come From?

In recent months, I’ve had a number of conversations with state and district leaders about implementing the ESSA evidence standards. To its credit, ESSA diminishes federal micromanaging, and gives more autonomy to states and locals, but now that the states and locals are in charge, how are they going to achieve greater success? One state department leader described his situation in ESSA as being like that of a dog who’s been chasing cars for years, and then finally catches one. Now what?

ESSA encourages states and local districts to help schools adopt and effectively implement proven programs. For school improvement, portions of Title II, and Striving Readers, ESSA requires use of proven programs. Initially, state and district folks were worried about how to identify proven programs, though things are progressing on that front (see, for example, www.evidenceforessa.org). But now I’m hearing a lot more concern about capacity to help all those individual schools do needs assessments, select proven programs aligned with their needs, and implement them with thought, care, and knowledgeable application of implementation science.

I’ve been in several meetings where state and local folks ask federal folks how they are supposed to implement ESSA. “Regional educational labs will help you!” they suggest. With all due respect to my friends in the RELs, this is going to be a heavy lift. There are ten of them, in a country with about 52,000 Title I schoolwide projects. So each REL is responsible for, on average, five states, 1,400 districts, and 5,200 high-poverty schools. For this reason, RELs have long been primarily expected to work with state departments. There are just not enough of them to serve many individual districts, much less schools.

State departments of education and districts can help schools select and implement proven programs. For example, they can disseminate information on proven programs, make sure that recommended programs have adequate capacity, and perhaps hold effective methods “fairs” to introduce people in their state to program providers. But states and districts rarely have capacity to implement proven programs themselves. It’s very hard to build state and local capacity to support specific proven programs. For example, due to frequent downturns in state or district funding come, the first departments to be cut back or eliminated often involve professional development. For this reason, few state departments or districts have large, experienced professional development staffs. Further, constant changes in state and local superintendents, boards, and funding levels, make it difficult to build up professional development capacity over a period of years.

Because of these problems, schools have often been left to make up their own approaches to school reform. This happened on a wide scale in the NCLB School Improvement Grants (SIG) program, where federal mandates specified very specific structural changes but left the essentials, teaching, curriculum, and professional development, up to the locals. The MDRC evaluation of SIG schools found that they made no better gains than similar, non-SIG schools.

Yet there is substantial underutilized capacity available to help schools across the U.S. to adopt proven programs. This capacity resides in the many organizations (both non-profit and for-profit) that originally created the proven programs, provided the professional development that caused them to meet the “proven” standard, and likely built infrastructure to ensure quality, sustainability, and growth potential.

The organizations that created proven programs have obvious advantages (their programs are known to work), but they also have several less obvious advantages. One is that organizations built to support a specific program have a dedicated focus on that program. They build expertise on every aspect of the program. As they grow, they hire capable coaches, usually ones who have already shown their skills in implementing or leading the program at the building level. Unlike states and districts that often live in constant turmoil, reform organizations or for-profit professional development organizations are likely to have stable leadership over time. In fact, for a high-poverty school engaged with a program provider, that provider and its leadership may be the only partner stable enough to be likely to be able to help them with their core teaching for many years.

State and district leaders play major roles in accountability, management, quality assurance, and personnel, among many other issues. With respect to implementation of proven programs, they have to set up conditions in which schools can make informed choices, monitor the performance of provider organizations, evaluate outcomes, and ensure that schools have the resources and supports they need. But truly reforming hundreds of schools in need of proven programs one at a time is not realistic for most states and districts, at least not without help. It makes a lot more sense to seek capacity in organizations designed to provide targeted professional development services on proven programs, and then coordinate with these providers to ensure benefits for students.

This blog is sponsored by the Laura and John Arnold Foundation

Little Sleepers: Long-Term Effects of Preschool

In education research, a “sleeper effect” is not a way to get all of your preschoolers to take naps. Instead, it is an outcome of a program that appears not immediately after the end of the program, but some time afterwards, usually a year or more. For example, the mother of all sleeper effects was the Perry Preschool study, which found positive outcomes at the end of preschool but no differences throughout elementary school. Then positive follow-up outcomes began to show up on a variety of important measures in high school and beyond.

Sleeper effects are very rare in education research. To see why, consider a study of a math program for third graders that found no differences between program and control students at the end of third grade, but then a large and significant difference popped up in fourth grade or later. Long-term effects of effective programs are often seen, but how can there be long-term effects if there are no short-term effects on the way? Sleeper effects are so rare that many early childhood researchers have serious doubts about the validity of the long-term Perry Preschool findings.

I was thinking about sleeper effects recently because we have recently added preschool studies to our Evidence for ESSA website. In reviewing the key studies, I was once again reading an extraordinary 2009 study by Mark Lipsey and Dale Farran.

The study randomly assigned Head Start classes in rural Tennessee to one of three conditions. Some were assigned to use a program called Bright Beginnings, which had a strong pre-literacy focus. Some were assigned to use Creative Curriculum, a popular constructive/developmental curriculum with little emphasis on literacy. The remainder were assigned to a control group, in which teachers used whatever methods they ordinarily used.

Note that this design is different from the usual preschool studies frequently reported in the newspaper, which compare preschool to no preschool. In this study, all students were in preschool. What differed is only how they were taught.

The results immediately after the preschool program were not astonishing. Bright Beginnings students scored best on literacy and language measures (average effect size = +0.21 for literacy, +0.11 for language), though the differences were not significant at the school level. There were no differences at all between Creative Curriculum and control schools.

Where the outcomes became interesting was in the later years. Ordinarily in education research, outcomes measured after the treatments have finished diminish over time. In the Bright Beginnings/Creative Curriculum study the outcomes were measured again when students were in third grade, four years after they left school. Most students could be located because the test was the Tennessee standardized test, so scores could be found as long as students were still in Tennessee schools.

On third grade reading, former Bright Beginnings students now scored significantly better than former controls, and the difference was statistically significant and substantial (effect size = +0.27).

In a review of early childhood programs at www.bestevidence.org, our team found that across 16 programs emphasizing literacy as well as language, effect sizes did not diminish in literacy at the end of kindergarten, and they actually doubled on language measures (from +0.08 in preschool to +0.15 in kindergarten).

If sleeper effects (or at least maintenance on follow-up) are so rare in education research, why did they appear in these studies of preschool? There are several possibilities.

The most likely explanation is that it is difficult to measure outcomes among four year-olds. They can be squirrely and inconsistent. If a pre-kindergarten program had a true and substantial impact on children’s literacy or language, measures at the end of preschool may not detect it as well as measures a year later, because kindergartners and kindergarten skills are easier to measure.

Whatever the reason, the evidence suggests that effects of particular preschool approaches may show up later than the end of preschool. This observation, and specifically the Bright Beginnings evaluation, may indicate that in the long run it matters a great deal how students are taught in preschool. Until we find replicable models of preschool, or pre-k to 3 interventions, that have long-term effects on reading and other outcomes, we cannot sleep. Our little sleepers are counting on us to ensure them a positive future.

This blog is sponsored by the Laura and John Arnold Foundation

Maximizing the Promise of “Promising” in ESSA

As anyone who reads my blogs is aware, I’m a big fan of the ESSA evidence standards. Yet there are many questions about the specific meaning of the definitions of strength of evidence for given programs. “Strong” is pretty clear: at least one study that used a randomized design and found a significant positive effect. “Moderate” requires at least one study that used a quasi-experimental design and found significant positive effects. There are important technical questions with these, but the basic intention is clear.

Not so with the third category, “promising.” It sounds clear enough: At least one correlational study with significantly positive effects, controlling for pretests or other variables. Yet what does this mean in practice?

The biggest problem is that correlation does not imply causation. Imagine, for example, that a study found a significant correlation between the numbers of iPads in schools and student achievement. Does this imply that more iPads cause more learning? Or could wealthier schools happen to have more iPads (and children in wealthy families have many opportunities to learn that have nothing to do with their schools buying more iPads)? The ESSA definitions do require controlling for other variables, but correlational studies lend themselves to error when they try to control for big differences.

Another problem is that a correlational study may not specify how much of a given resource is needed to show an effect. In the case of the iPad study, did positive effects depend on one iPad per class, or thirty (one per student)? It’s not at all clear.

Despite these problems, the law clearly defines “promising” as requiring correlational studies, and as law-abiding citizens, we must obey. But the “promising” category also allows for some additional categories of studies that can fill some important gaps that otherwise lurk in the ESSA evidence standards.

The most important category involves studies in which schools or teachers (not individual students) were randomly assigned to experimental or control groups. Current statistical norms require that such studies use multilevel analyses, such as Hierarchical Linear Modeling (HLM). In essence, these are analyses at the cluster level (school or teacher), not the student level. The What Works Clearinghouse (WWC) requires use of statistics like HLM in clustered designs.

The problem is that it takes a lot of schools or teachers to have enough power to find significant effects. As a result, many otherwise excellent studies fail to find significant differences, and are not counted as meeting any standard in the WWC.

The Technical Working Group (TWG) that set the standards for our Evidence for ESSA website suggested a solution to this problem. Cluster randomized studies that fail to find significant effects are re-analyzed at the student level. If the student-level outcome is significantly positive, the program is rated as “promising” under ESSA. Note that all experiments are also correlational studies (just using a variable with only two possible values, experimental or control), and experiments in education almost always control for pretests and other factors, so our procedure meets the ESSA evidence standards’ definition for “promising.”

Another situation in which “promising” is used for “just-missed” experiments is in the case of quasi-experiments. Like randomized experiments, these should be analyzed at the cluster level if treatment was at the school or classroom level. So if a quasi-experiment did not find significantly positive outcomes at the cluster level but did find significant positive effects at the student level, we include it as “promising.”

These procedures are important for the ESSA standards, but they are also useful for programs that are not able to recruit a large enough sample of schools or teachers to do randomized or quasi-experimental studies. For example, imagine that a researcher evaluating a school-wide math program for tenth graders could only afford to recruit and serve 10 schools. She might deliberately use a design in which the 10 schools are randomly assigned to use the innovative math program (n=5) or serve as a control group (n=5). A cluster randomized experiment with only 10 clusters is extremely unlikely to find a significant positive effect at the school level, but with perhaps 1000 students per condition, would be very likely to find a significant effect at the student level, if the program is in fact effective. In this circumstance, the program could be rated, using our standard, as “promising,” an outcome true to the ordinary meaning of the word: not proven, but worth further investigation and investment.

Using the “promising” category in this way may encourage smaller-scale, less well funded researchers to get into the evidence game, albeit at a lower rung on the ladder. But it is not good policy to set such high standards that few programs will qualify. Defining “promising” as we have in Evidence for ESSA does not promise anyone the Promised Land, but it broadens the number of programs schools may select, knowing that they must give up a degree of certainty in exchange for a broader selection of programs.