The Good, the Bad, and the (Un)Promising

The ESSA evidence standards are finally beginning to matter. States are starting the process that will lead them to make school improvement awards to their lowest-achieving schools. The ESSA law is clear that for schools to qualify for these awards, they must agree to implement programs that meet the strong, moderate, or promising levels of the ESSA evidence standards. This is very exciting for those who believe in the power of proven programs to transform schools and benefit children. It is good news for kids, for teachers, and for our profession.

But inevitably, there is bad news with the good. If evidence is to be a standard for government funding, there are bound to be people who disseminate programs lacking high-quality evidence who will seek to bend the definitions to declare themselves “proven.” And there are also bound to be schools and districts that want to keep using what they have always used, or to keep choosing programs based on factors other than evidence, while doing the minimum the law requires.

The battleground is the ESSA “promising” criterion. “Strong” programs are pretty well defined as having significant positive evidence from high-quality randomized studies. “Moderate” programs are pretty well defined as having significant positive evidence from high-quality matched studies. Both “strong” and “moderate” are clearly defined in Evidence for ESSA (www.evidenceforessa.org), and, with a bit of translation, by the What Works Clearinghouse, both of which list specific programs that meet or do not meet these standards.

“Promising,” on the other hand is kind  of . . . squishy. The ESSA evidence standards do define programs meeting “promising” as ones that have statistically significant effects in “well-designed and well-implemented” correlational studies, with controls for inputs (e.g., pretests).  This sounds good, but it is hard to nail down in practice. I’m seeing and hearing about a category of studies that perfectly illustrate the problem. Imagine that a developer commissions a study of a form of software. A set of schools and their 1000 students are assigned to use the software, while control schools and their 1000 students do not have access to the software but continue with business as usual.

Computers routinely produce “trace data” that automatically tells researchers all sorts of things about how much students used the software, what they did with it, how successful they were, and so on.

The problem is that typically, large numbers of students given software do not use it. They may never even hit a key, or they may use the software so little that the researchers rule the software use to be effectively zero. So in a not unusual situation, let’s assume that in the treatment group, the one that got the software, only 500 of the 1000 students actually used the software at an adequate level.

Now here’s the rub. Almost always, the 500 students will out-perform the 1000 controls, even after controlling for pretests. Yet this would be likely to happen even if the software were completely ineffective.

To understand this, think about the 500 students who did use the software and the 500 who did not. The users are probably more conscientious, hard-working, and well-organized. The 500 non-users are more likely to be absent a lot, to fool around in class, to use their technology to play computer games, or go on (non-school-related) social media, rather than to do math or science for example. Even if the pretest scores in the user and non-user groups were identical, they are not identical students, because their behavior with the software is not equal.

I once visited a secondary school in England that was a specially-funded model for universal use of technology. Along with colleagues, I went into several classes. The teachers were teaching their hearts out, making constant use of the technology that all students had on their desks. The students were well-behaved, but just a few dominated the discussion. Maybe the others were just a bit shy, we thought. From the front of each class, this looked like the classroom of the future.

But then, we filed to the back of each class, where we could see over students’ shoulders. And we immediately saw what was going on. Maybe 60 or 70 percent of the students were actually on social media unrelated to the content, paying no attention to the teacher or instructional software!

blog_5-24-18_DistStudents_500x332

Now imagine that a study compared the 30-40% of students who were actually using the computers to students with similar pretests in other schools who had no computers at all. Again, the users would look terrific, but this is not a fair comparison, because all the goof-offs and laggards in the computer school had selected themselves out of the study while goof-offs and laggards in the control group were still included.

Rigorous researchers use a method called intent-to-treat, which in this case would include every student, whether or not they used the software or played non-educational computer games. “Not fair!” responds the software developer, because intent-to-treat includes a lot of students who never touched a key except to use social media. No sophisticated researcher accepts such an argument, however, because including only users gives the experimental group a big advantage.

Here’s what is happening at the policy level. Software developers are using data from studies that only include the students who made adequate use of the software. They are then claiming that such studies are correlational and meet the “promising” standard of ESSA.

Those who make this argument are correct in saying that such studies are correlational. But these studies are very, very, very bad, because they are biased toward the treatment. The ESSA standards specify well-designed and well-implemented studies, and these studies may be correlational, but they are not well-designed or well-implemented. Software developers and other vendors are very concerned about the ESSA evidence standards, and some may use the “promising” category as a loophole. Evidence for ESSA does not accept such studies, even as promising, and the What Works Clearinghouse does not even have any category that corresponds to “promising.” Yet vendors are flooding state departments of education and districts with studies they claim to meet the ESSA standards, though in the lowest category.

Recently, I heard something that could be a solution to this problem. Apparently, some states are announcing that for school improvement grants, and any other purpose that has financial consequences, they will only accept programs with “strong” and “moderate” evidence. They have the right to do this; the federal law says school improvement grants must support programs that at least meet the “promising” standard, but it does not say states cannot set a higher minimum standard.

One might argue that ignoring “promising” studies is going too far. In Evidence for ESSA (www.evidenceforessa.org), we accept studies as “promising” if they have weaknesses that do not lead to bias, such as clustered studies that were significant at the student but not the cluster level. But the danger posed by studies claiming to fit “promising” using biased designs is too great. Until the feds fix the definition of “promising” to exclude bias, the states may have to solve it for themselves.

I hope there will be further development of the “promising” standard to focus it on lower-quality but unbiased evidence, but as things are now, perhaps it is best for states themselves to declare that “promising” is no longer promising.

Eventually, evidence will prevail in education, as it has in many other fields, but on the way to that glorious future, we are going to have to make some adjustments. Requiring that “promising” be truly promising would be a good place to begin.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Advertisements

What if a Sears Catalogue Married Consumer Reports?

blog_3-15-18_familyreading_500x454When I was in high school, I had a summer job delivering Sears catalogues. I borrowed my mother’s old Chevy station wagon and headed out fully laden into the wilds of the Maryland suburbs of Washington.

I immediately learned something surprising. I thought of a Sears catalogue as a big book of advertisements. But the people to whom I was delivering them often saw it as a book of dreams. They were excited to get their catalogues. When a neighborhood saw me coming, I became a minor celebrity.

Thinking back on those days, I was thinking about our Evidence for ESSA website (www.evidenceforessa.org). I realized that what I wanted it to be was a way to communicate to educators the wonderful array of programs they could use to improve outcomes for their children. Sort of like a Sears catalogue for education. However, it provides something that a Sears catalogue does not: Evidence about the effectiveness of each catalogue entry. Imagine a Sears catalogue that was married to Consumer Reports. Where a traditional Sears catalogue describes a kitchen gadget, “It slices and dices, with no muss, no fuss!”, the marriage with Consumer Reports would instead say, “Effective at slicing and dicing, but lots of muss. Also fuss.”

If this marriage took place, it might take some of the fun out of the Sears catalogue (making it a book of realities rather than a book of dreams), but it would give confidence to buyers, and help them make wise choices. And with proper wordsmithing, it could still communicate both enthusiasm, when warranted, and truth. But even more, it could have a huge impact on the producers of consumer goods, because they would know that their products would need to be rigorously tested and found to be able to back up their claims.

In enhancing the impact of research on the practice of education, we have two problems that have to be solved. Just like the “Book of Dreams,” we have to help educators know the wonderful array of programs available to them, programs they may never had heard of. And beyond the particular programs, we need to build excitement about the opportunity to select among proven programs.

In education, we make choices not for ourselves, but on behalf of our children. Responsible educators want to choose programs and practices that improve the achievement of their students. Something like a marriage of the Sears catalogue and Consumer Reports is necessary to address educators’ dreams and their need for information on program outcomes. Users should be both excited and informed. Information usually does not excite. Excitement usually does not inform. We need a way to do both.

In Evidence for ESSA, we have tried to give educators a sense that there are many solutions to enduring instructional problems (excitement), and descriptions of programs, outcomes, costs, staffing requirements, professional development, and effects for particular subgroups, for example (information).

In contrast to Sears catalogues, Evidence for ESSA is light (Sears catalogues were huge, and ultimately broke the springs on my mother’s station wagon). In contrast to Consumer Reports, Evidence for ESSA is free.  Every marriage has its problems, but our hope is that we can capture the excitement and the information from the marriage of these two approaches.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Picture source: Nationaal Archief, the Netherlands

 

Evidence for ESSA Celebrates its First Anniversary

Penguin 02 22 18On February 28, 2017 we launched Evidence for ESSA (www.evidenceforessa.org), our website providing the evidence to support educational programs according to the standards laid out in the Every Child Succeeds Act in December, 2015.

Evidence for ESSA began earlier, of course. It really began one day in September, 2016, when I heard leaders of the Institute for Education Sciences (IES) and the What Works Clearinghouse (WWC) announce that the WWC would not be changed to align with the ESSA evidence standards. I realized that no one else was going to create scientifically valid, rapid, and easy-to-use websites providing educators with actionable information on programs meeting ESSA standards. We could do it because our group at Johns Hopkins University, and partners all over the world, had been working for many years creating and updating another website, the Best Evidence Encyclopedia (BEE; www.bestevidence.org).BEE reviews were not primarily designed for practitioners and they did not align with ESSA standards, but at least we were not starting from scratch.

We assembled a group of large membership organizations to advise us and to help us reach thoughtful superintendents, principals, Title I directors, and others who would be users of the final product. They gave us invaluable advice along the way. We also assembled a technical working group (TWG) of distinguished researchers to advise us on key decisions in establishing our website.

It is interesting to note that we have not been able to obtain adequate funding to support Evidence for ESSA. Instead, it is mostly being written by volunteers and graduate students, all of whom are motivated only by a passion for evidence to improve the education of students.

A year after launch, Evidence for ESSA has been used by more than 36,000 unique users, and I hear that it is very useful in helping states and districts meet the ESSA evidence standards.

We get a lot of positive feedback, as well as complaints and concerns, to which we try to respond rapidly. Feedback has been important in changing some of our policies and correcting some errors and we are glad to get it.

At this moment we are thoroughly up-to-date on reading and math programs for grades pre-kindergarten to 12, and we are working on science, writing, social-emotional outcomes, and summer school. We are also continuing to update our more academic BEE reviews, which draw from our work on Evidence for ESSA.

In my view, the evidence revolution in education is truly a revolution. If the ESSA evidence standards ultimately prevail, education will at long last join fields such as medicine and agriculture in a dynamic of practice to development to evaluation to dissemination to better practice, in an ascending spiral that leads to constantly improving practices and outcomes.

In a previous revolution, Thomas Jefferson said, “If I had to choose between government without newspapers and newspapers without government, I’d take the newspapers.” In our evidence revolution in education, Evidence for ESSA, the WWC, and other evidence sources are our “newspapers,” providing the information that people of good will can use to make wise and informed decisions.

Evidence for ESSA is the work of many dedicated and joyful hands trying to provide our profession with the information it needs to improve student outcomes. The joy in it is the joy in seeing teachers, principals, and superintendents see new, attainable ways to serve their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Getting the Best Mileage from Proven Programs

Race carWouldn’t you love to have a car that gets 200 miles to the gallon? Or one that can go hundreds of miles on a battery charge? Or one that can accelerate from zero to sixty twice as fast as any on the road?

Such cars exist, but you can’t have them. They are experimental vehicles or race cars that can only be used on a track or in a lab. They may be made of exotic materials, or may not carry passengers or groceries, or may be dangerous on real roads.

In working on our Evidence for ESSA website (www.evidenceforessa.org), we see a lot of studies that are like these experimental cars. For example, there are studies of programs in which the researcher or her graduate students actually did the teaching, or in which students used innovative technology with one adult helper for every machine or every few machines. Such studies are fine for theory building or as pilots, but we do not accept them for Evidence for ESSA, because they could never be replicated in real schools.

However, there is a much more common situation to which we pay very close attention. These are studies in which, for example, teachers receive a great deal of training and coaching, but an amount that seems replicable, in principle. For example, we would reject a study in which the experimenters taught the program, but not one in which they taught ordinary teachers how to use the program.

In such studies, the problem comes in dissemination. If studies validating a program provided a lot of professional development, we would accept it only if the disseminator provides a similar level of professional development, and their estimates of cost and personnel take this level of professional development into account. We put on our website clear expectations that these services be provided at a level similar to what was provided in the research, if the positive outcomes seen in the research are to be obtained.

The problem is that disseminators often offer schools a form of the program that was never evaluated, to keep costs low. They know that schools don’t like to spend a lot on professional development, and they are concerned that if they require the needed levels of PD or other services or materials, schools won’t buy their program. At the extreme end of this, there are programs that were successfully evaluated using extensive professional development, and then put their teacher’s manual on the web for schools to use for free.

A recent study of a program called Mathalicious illustrated the situation. Mathalicious is an on-line math course for middle school. An evaluation found that teachers randomly assigned to just get a license, with minimal training, did not obtain significant positive impacts, compared to a control group. Those who received extensive on-line training, however, did see a significant improvement in math scores, compared to controls.

When we write our program descriptions, we compare program implementation details in the research to what is said or required on the program’s website. If these do not match, within reason, we try to make it clear what were the key elements necessary for success.

Going back to the car analogy, our procedures eliminate those amazing cars that can only operate on special tracks, but we accept cars that can run on streets, carry children and groceries, and generally do what cars are expected to do. But if outstanding cars require frequent recharging, or premium gasoline, or have other important requirements, we’ll say so, in consultation with the disseminator.

In our view, evidence in education is not for academics, it’s for kids. If there is no evidence that a program as disseminated benefits kids, we don’t want to mislead educators who are trying to use evidence to benefit their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

…But It Was The Very Best Butter! How Tests Can Be Reliable, Valid, and Worthless

I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”

I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.

Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.

Wow.

How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.

Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.

So how can it be that using even the most reliable and valid tests doesn’t solve this problem?

In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”

The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).

Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.

Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).

Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.

Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Half a Worm: Why Education Policy Needs High Evidence Standards

There is a very old joke that goes like this:

What’s the second-worst thing to find in your apple?  A worm.

What’s the worst?  Half a worm.

The ESSA evidence standards provide clearer definitions of “strong,” “moderate,” and “promising” levels of evidence than have ever existed in law or regulation. Yet they still leave room for interpretation.  The problem is that if you define evidence-based too narrowly, too few programs will qualify.  But if you define evidence-based too broadly, it loses its meaning.

We’ve already experienced what happens with a too-permissive definition of evidence.  In No Child Left Behind, “scientifically-based research” was famously mentioned 110 times.  The impact of this, however, was minimal, as everyone soon realized that the term “scientifically-based” could be applied to just about anything.

Today, we are in a much better position than we were in 2002 to insist on relatively strict evidence of effectiveness, both because we have better agreement about what constitutes evidence of effectiveness and because we have a far greater number of programs that would meet a high standard.  The ESSA definitions are a good consensus example.  Essentially, they define programs with “strong evidence of effectiveness” as those with at least one randomized study showing positive impacts using rigorous methods, and “moderate evidence of effectiveness” as those with at least one quasi-experimental study.  “Promising” is less well-defined, but requires at least one correlational study with a positive outcome.

Where the half-a-worm concept comes in, however, is that we should not use a broader definition of “evidence-based”.  For example, ESSA has a definition of “strong theory.”  To me, that is going too far, and begins to water down the concept.  What program in all of education cannot justify a “strong theory of action”?

Further, even in the top categories, there are important questions about what qualifies. In school-level studies, should we insist on school-level analyses (i.e., HLM)? Every methodologist would say yes, as I do, but this is not specified. Should we accept researcher-made measures? I say no, based on a great deal of evidence indicating that such measures inflate effects.

Fortunately, due to investments made by IES, i3, and other funders, the number of programs that meet strict standards has grown rapidly. Our Evidence for ESSA website (www.evidenceforessa.org) has so far identified 101 PK-12 reading and math programs, using strict standards consistent with ESSA definitions. Among these, more than 60% meet the “strong” standard. There are enough proven programs in every subject and grade level to give educators choices among proven programs. And we add more each week.

This large number of programs meeting strict evidence standards means that insisting on rigorous evaluations, within reason, does not mean that we end up with too few programs to choose among. We can have our apple pie and eat it, too.

I’d love to see federal programs of all kinds encouraging use of programs with rigorous evidence of effectiveness.  But I’d rather see a few programs that meet a strict definition of “proven” than to see a lot of programs that only meet a loose definition.  20 good apples are much better than applesauce of dubious origins!

This blog is sponsored by the Laura and John Arnold Foundation

Where Will the Capacity for School-by-School Reform Come From?

In recent months, I’ve had a number of conversations with state and district leaders about implementing the ESSA evidence standards. To its credit, ESSA diminishes federal micromanaging, and gives more autonomy to states and locals, but now that the states and locals are in charge, how are they going to achieve greater success? One state department leader described his situation in ESSA as being like that of a dog who’s been chasing cars for years, and then finally catches one. Now what?

ESSA encourages states and local districts to help schools adopt and effectively implement proven programs. For school improvement, portions of Title II, and Striving Readers, ESSA requires use of proven programs. Initially, state and district folks were worried about how to identify proven programs, though things are progressing on that front (see, for example, www.evidenceforessa.org). But now I’m hearing a lot more concern about capacity to help all those individual schools do needs assessments, select proven programs aligned with their needs, and implement them with thought, care, and knowledgeable application of implementation science.

I’ve been in several meetings where state and local folks ask federal folks how they are supposed to implement ESSA. “Regional educational labs will help you!” they suggest. With all due respect to my friends in the RELs, this is going to be a heavy lift. There are ten of them, in a country with about 52,000 Title I schoolwide projects. So each REL is responsible for, on average, five states, 1,400 districts, and 5,200 high-poverty schools. For this reason, RELs have long been primarily expected to work with state departments. There are just not enough of them to serve many individual districts, much less schools.

State departments of education and districts can help schools select and implement proven programs. For example, they can disseminate information on proven programs, make sure that recommended programs have adequate capacity, and perhaps hold effective methods “fairs” to introduce people in their state to program providers. But states and districts rarely have capacity to implement proven programs themselves. It’s very hard to build state and local capacity to support specific proven programs. For example, due to frequent downturns in state or district funding come, the first departments to be cut back or eliminated often involve professional development. For this reason, few state departments or districts have large, experienced professional development staffs. Further, constant changes in state and local superintendents, boards, and funding levels, make it difficult to build up professional development capacity over a period of years.

Because of these problems, schools have often been left to make up their own approaches to school reform. This happened on a wide scale in the NCLB School Improvement Grants (SIG) program, where federal mandates specified very specific structural changes but left the essentials, teaching, curriculum, and professional development, up to the locals. The MDRC evaluation of SIG schools found that they made no better gains than similar, non-SIG schools.

Yet there is substantial underutilized capacity available to help schools across the U.S. to adopt proven programs. This capacity resides in the many organizations (both non-profit and for-profit) that originally created the proven programs, provided the professional development that caused them to meet the “proven” standard, and likely built infrastructure to ensure quality, sustainability, and growth potential.

The organizations that created proven programs have obvious advantages (their programs are known to work), but they also have several less obvious advantages. One is that organizations built to support a specific program have a dedicated focus on that program. They build expertise on every aspect of the program. As they grow, they hire capable coaches, usually ones who have already shown their skills in implementing or leading the program at the building level. Unlike states and districts that often live in constant turmoil, reform organizations or for-profit professional development organizations are likely to have stable leadership over time. In fact, for a high-poverty school engaged with a program provider, that provider and its leadership may be the only partner stable enough to be likely to be able to help them with their core teaching for many years.

State and district leaders play major roles in accountability, management, quality assurance, and personnel, among many other issues. With respect to implementation of proven programs, they have to set up conditions in which schools can make informed choices, monitor the performance of provider organizations, evaluate outcomes, and ensure that schools have the resources and supports they need. But truly reforming hundreds of schools in need of proven programs one at a time is not realistic for most states and districts, at least not without help. It makes a lot more sense to seek capacity in organizations designed to provide targeted professional development services on proven programs, and then coordinate with these providers to ensure benefits for students.

This blog is sponsored by the Laura and John Arnold Foundation