The Good, the Bad, and the (Un)Promising

The ESSA evidence standards are finally beginning to matter. States are starting the process that will lead them to make school improvement awards to their lowest-achieving schools. The ESSA law is clear that for schools to qualify for these awards, they must agree to implement programs that meet the strong, moderate, or promising levels of the ESSA evidence standards. This is very exciting for those who believe in the power of proven programs to transform schools and benefit children. It is good news for kids, for teachers, and for our profession.

But inevitably, there is bad news with the good. If evidence is to be a standard for government funding, there are bound to be people who disseminate programs lacking high-quality evidence who will seek to bend the definitions to declare themselves “proven.” And there are also bound to be schools and districts that want to keep using what they have always used, or to keep choosing programs based on factors other than evidence, while doing the minimum the law requires.

The battleground is the ESSA “promising” criterion. “Strong” programs are pretty well defined as having significant positive evidence from high-quality randomized studies. “Moderate” programs are pretty well defined as having significant positive evidence from high-quality matched studies. Both “strong” and “moderate” are clearly defined in Evidence for ESSA (www.evidenceforessa.org), and, with a bit of translation, by the What Works Clearinghouse, both of which list specific programs that meet or do not meet these standards.

“Promising,” on the other hand is kind  of . . . squishy. The ESSA evidence standards do define programs meeting “promising” as ones that have statistically significant effects in “well-designed and well-implemented” correlational studies, with controls for inputs (e.g., pretests).  This sounds good, but it is hard to nail down in practice. I’m seeing and hearing about a category of studies that perfectly illustrate the problem. Imagine that a developer commissions a study of a form of software. A set of schools and their 1000 students are assigned to use the software, while control schools and their 1000 students do not have access to the software but continue with business as usual.

Computers routinely produce “trace data” that automatically tells researchers all sorts of things about how much students used the software, what they did with it, how successful they were, and so on.

The problem is that typically, large numbers of students given software do not use it. They may never even hit a key, or they may use the software so little that the researchers rule the software use to be effectively zero. So in a not unusual situation, let’s assume that in the treatment group, the one that got the software, only 500 of the 1000 students actually used the software at an adequate level.

Now here’s the rub. Almost always, the 500 students will out-perform the 1000 controls, even after controlling for pretests. Yet this would be likely to happen even if the software were completely ineffective.

To understand this, think about the 500 students who did use the software and the 500 who did not. The users are probably more conscientious, hard-working, and well-organized. The 500 non-users are more likely to be absent a lot, to fool around in class, to use their technology to play computer games, or go on (non-school-related) social media, rather than to do math or science for example. Even if the pretest scores in the user and non-user groups were identical, they are not identical students, because their behavior with the software is not equal.

I once visited a secondary school in England that was a specially-funded model for universal use of technology. Along with colleagues, I went into several classes. The teachers were teaching their hearts out, making constant use of the technology that all students had on their desks. The students were well-behaved, but just a few dominated the discussion. Maybe the others were just a bit shy, we thought. From the front of each class, this looked like the classroom of the future.

But then, we filed to the back of each class, where we could see over students’ shoulders. And we immediately saw what was going on. Maybe 60 or 70 percent of the students were actually on social media unrelated to the content, paying no attention to the teacher or instructional software!

blog_5-24-18_DistStudents_500x332

Now imagine that a study compared the 30-40% of students who were actually using the computers to students with similar pretests in other schools who had no computers at all. Again, the users would look terrific, but this is not a fair comparison, because all the goof-offs and laggards in the computer school had selected themselves out of the study while goof-offs and laggards in the control group were still included.

Rigorous researchers use a method called intent-to-treat, which in this case would include every student, whether or not they used the software or played non-educational computer games. “Not fair!” responds the software developer, because intent-to-treat includes a lot of students who never touched a key except to use social media. No sophisticated researcher accepts such an argument, however, because including only users gives the experimental group a big advantage.

Here’s what is happening at the policy level. Software developers are using data from studies that only include the students who made adequate use of the software. They are then claiming that such studies are correlational and meet the “promising” standard of ESSA.

Those who make this argument are correct in saying that such studies are correlational. But these studies are very, very, very bad, because they are biased toward the treatment. The ESSA standards specify well-designed and well-implemented studies, and these studies may be correlational, but they are not well-designed or well-implemented. Software developers and other vendors are very concerned about the ESSA evidence standards, and some may use the “promising” category as a loophole. Evidence for ESSA does not accept such studies, even as promising, and the What Works Clearinghouse does not even have any category that corresponds to “promising.” Yet vendors are flooding state departments of education and districts with studies they claim to meet the ESSA standards, though in the lowest category.

Recently, I heard something that could be a solution to this problem. Apparently, some states are announcing that for school improvement grants, and any other purpose that has financial consequences, they will only accept programs with “strong” and “moderate” evidence. They have the right to do this; the federal law says school improvement grants must support programs that at least meet the “promising” standard, but it does not say states cannot set a higher minimum standard.

One might argue that ignoring “promising” studies is going too far. In Evidence for ESSA (www.evidenceforessa.org), we accept studies as “promising” if they have weaknesses that do not lead to bias, such as clustered studies that were significant at the student but not the cluster level. But the danger posed by studies claiming to fit “promising” using biased designs is too great. Until the feds fix the definition of “promising” to exclude bias, the states may have to solve it for themselves.

I hope there will be further development of the “promising” standard to focus it on lower-quality but unbiased evidence, but as things are now, perhaps it is best for states themselves to declare that “promising” is no longer promising.

Eventually, evidence will prevail in education, as it has in many other fields, but on the way to that glorious future, we are going to have to make some adjustments. Requiring that “promising” be truly promising would be a good place to begin.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

Advertisements

Effect Sizes and the 10-Foot Man

If you ever go into the Ripley’s Believe It or Not Museum in Baltimore, you will be greeted at the entrance by a statue of the tallest man who ever lived, Robert Pershing Wadlow, a gentle giant at 8 feet, 11 inches in his stocking feet. Kids and adults love to get their pictures taken standing by him, to provide a bit of perspective.

blog_5-10-18_Wadlow_292x500

I bring up Mr. Wadlow to explain a phrase I use whenever my colleagues come up with an effect size of more than 1.00. “That’s a 10-foot man,” I say. What I mean, of course, is that while it is not impossible that there could be a 10-foot man someday, it is extremely unlikely, because there has never been a man that tall in all of history. If someone reports seeing one, they are probably mistaken.

In the case of effect sizes you will never, or almost never, see an effect size of more than +1.00, assuming the following reasonable conditions:

  1. The effect size compares experimental and control groups (i.e., it is not pre-post).
  2. The experimental and control group started at the same level, or they started at similar levels and researchers statistically controlled for pretest differences.
  3. The measures involved were independent of the researcher and the treatment, not made by the developers or researchers. The test was not given by the teachers to their own students.
  4. The treatment was provided by ordinary teachers, not by researchers, and could in principle be replicated widely in ordinary schools. The experiment had a duration of at least 12 weeks.
  5. There were at least 30 students and 2 teachers in each treatment group (experimental and control).

If these conditions are met, the chances of finding effect sizes of more than +1.00 are about the same as the chances of finding a 10-foot man. That is, zero.

I was thinking about the 10-foot man when I was recently asked by a reporter about the “two sigma effect” claimed by Benjamin Bloom and much discussed in the 1970s and 1980s. Bloom’s students did a series of experiments in which students were taught about a topic none of them knew anything about, usually principles of sailing. After a short period, students were tested. Those who did not achieve at least 80% (defined as “mastery”) on the tests were tutored by University of Chicago graduate students long enough to ensure that every tutored student reached mastery. The purpose of this demonstration was to make a claim that every student could learn whatever we wanted to teach them, and the only variable was instructional time, as some students need more time to learn than others. In a system in which enough time could be given to all, “ability” would disappear as a factor in outcomes. Also, in comparison to control groups who were not taught about sailing at all, the effect size was often more than 2.0, or two sigma. That’s why this principle was called the “two sigma effect.” Doesn’t the two sigma effect violate my 10-foot man principle?

No, it does not. The two sigma studies used experimenter-made tests of content taught to the experimental but not control groups. It used University of Chicago graduate students providing far more tutoring (as a percentage of initial instruction) than any school could ever provide. The studies were very brief and sample sizes were small. The two sigma experiments were designed to prove a point, not to evaluate a feasible educational method.

A more recent example of the 10-foot man principle is found in Visible Learning, the currently fashionable book by John Hattie claiming huge effect sizes for all sorts of educational treatments. Hattie asks the reader to ignore any educational treatment with an effect size of less than +0.40, and reports many whole categories of teaching methods with average effect sizes of more than +1.00. How can this be?

The answer is that such effect sizes, like two sigma, do not incorporate the conditions I laid out. Instead, Hattie throws into his reviews entire meta-analyses which may include pre-post studies, studies using researcher-made measures, studies with tiny samples, and so on. For practicing educators, such effect sizes are useless. An educator knows that all children grow from pre- to posttest. They would not (and should not) accept measures made by researchers. The largest known effect sizes that do meet the above conditions are one-to-one tutoring studies with effect sizes up to +0.86. Still not +1.00. What could be more effective than the best of 1-1 tutoring?

It’s fun to visit Mr. Wadlow at the museum, and to imagine what an ever taller man could do on a basketball team, for example. But if you see a 10-foot man at Ripley’s Believe it or Not, or anywhere else, here’s my suggestion. Don’t believe it. And if you visit a museum of famous effect sizes that displays a whopper effect size of +1.00, don’t believe that, either. It doesn’t matter how big effect sizes are if they are not valid.

A 10-foot man would be a curiosity. An effect size of +1.00 is a distraction. Our work on evidence is too important to spend our time looking for 10-foot men, or effect sizes of +1.00, that don’t exist.

Photo credit: [Public domain], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

What if a Sears Catalogue Married Consumer Reports?

blog_3-15-18_familyreading_500x454When I was in high school, I had a summer job delivering Sears catalogues. I borrowed my mother’s old Chevy station wagon and headed out fully laden into the wilds of the Maryland suburbs of Washington.

I immediately learned something surprising. I thought of a Sears catalogue as a big book of advertisements. But the people to whom I was delivering them often saw it as a book of dreams. They were excited to get their catalogues. When a neighborhood saw me coming, I became a minor celebrity.

Thinking back on those days, I was thinking about our Evidence for ESSA website (www.evidenceforessa.org). I realized that what I wanted it to be was a way to communicate to educators the wonderful array of programs they could use to improve outcomes for their children. Sort of like a Sears catalogue for education. However, it provides something that a Sears catalogue does not: Evidence about the effectiveness of each catalogue entry. Imagine a Sears catalogue that was married to Consumer Reports. Where a traditional Sears catalogue describes a kitchen gadget, “It slices and dices, with no muss, no fuss!”, the marriage with Consumer Reports would instead say, “Effective at slicing and dicing, but lots of muss. Also fuss.”

If this marriage took place, it might take some of the fun out of the Sears catalogue (making it a book of realities rather than a book of dreams), but it would give confidence to buyers, and help them make wise choices. And with proper wordsmithing, it could still communicate both enthusiasm, when warranted, and truth. But even more, it could have a huge impact on the producers of consumer goods, because they would know that their products would need to be rigorously tested and found to be able to back up their claims.

In enhancing the impact of research on the practice of education, we have two problems that have to be solved. Just like the “Book of Dreams,” we have to help educators know the wonderful array of programs available to them, programs they may never had heard of. And beyond the particular programs, we need to build excitement about the opportunity to select among proven programs.

In education, we make choices not for ourselves, but on behalf of our children. Responsible educators want to choose programs and practices that improve the achievement of their students. Something like a marriage of the Sears catalogue and Consumer Reports is necessary to address educators’ dreams and their need for information on program outcomes. Users should be both excited and informed. Information usually does not excite. Excitement usually does not inform. We need a way to do both.

In Evidence for ESSA, we have tried to give educators a sense that there are many solutions to enduring instructional problems (excitement), and descriptions of programs, outcomes, costs, staffing requirements, professional development, and effects for particular subgroups, for example (information).

In contrast to Sears catalogues, Evidence for ESSA is light (Sears catalogues were huge, and ultimately broke the springs on my mother’s station wagon). In contrast to Consumer Reports, Evidence for ESSA is free.  Every marriage has its problems, but our hope is that we can capture the excitement and the information from the marriage of these two approaches.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Picture source: Nationaal Archief, the Netherlands

 

Getting the Best Mileage from Proven Programs

Race carWouldn’t you love to have a car that gets 200 miles to the gallon? Or one that can go hundreds of miles on a battery charge? Or one that can accelerate from zero to sixty twice as fast as any on the road?

Such cars exist, but you can’t have them. They are experimental vehicles or race cars that can only be used on a track or in a lab. They may be made of exotic materials, or may not carry passengers or groceries, or may be dangerous on real roads.

In working on our Evidence for ESSA website (www.evidenceforessa.org), we see a lot of studies that are like these experimental cars. For example, there are studies of programs in which the researcher or her graduate students actually did the teaching, or in which students used innovative technology with one adult helper for every machine or every few machines. Such studies are fine for theory building or as pilots, but we do not accept them for Evidence for ESSA, because they could never be replicated in real schools.

However, there is a much more common situation to which we pay very close attention. These are studies in which, for example, teachers receive a great deal of training and coaching, but an amount that seems replicable, in principle. For example, we would reject a study in which the experimenters taught the program, but not one in which they taught ordinary teachers how to use the program.

In such studies, the problem comes in dissemination. If studies validating a program provided a lot of professional development, we would accept it only if the disseminator provides a similar level of professional development, and their estimates of cost and personnel take this level of professional development into account. We put on our website clear expectations that these services be provided at a level similar to what was provided in the research, if the positive outcomes seen in the research are to be obtained.

The problem is that disseminators often offer schools a form of the program that was never evaluated, to keep costs low. They know that schools don’t like to spend a lot on professional development, and they are concerned that if they require the needed levels of PD or other services or materials, schools won’t buy their program. At the extreme end of this, there are programs that were successfully evaluated using extensive professional development, and then put their teacher’s manual on the web for schools to use for free.

A recent study of a program called Mathalicious illustrated the situation. Mathalicious is an on-line math course for middle school. An evaluation found that teachers randomly assigned to just get a license, with minimal training, did not obtain significant positive impacts, compared to a control group. Those who received extensive on-line training, however, did see a significant improvement in math scores, compared to controls.

When we write our program descriptions, we compare program implementation details in the research to what is said or required on the program’s website. If these do not match, within reason, we try to make it clear what were the key elements necessary for success.

Going back to the car analogy, our procedures eliminate those amazing cars that can only operate on special tracks, but we accept cars that can run on streets, carry children and groceries, and generally do what cars are expected to do. But if outstanding cars require frequent recharging, or premium gasoline, or have other important requirements, we’ll say so, in consultation with the disseminator.

In our view, evidence in education is not for academics, it’s for kids. If there is no evidence that a program as disseminated benefits kids, we don’t want to mislead educators who are trying to use evidence to benefit their children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Evidence-Based Does Not Equal Evidence-Proven

Chemist

As I speak to educational leaders about using evidence to help them improve outcomes for students, there are two words I hear all the time that give me the fantods (as Mark Twain would say):

Evidence-based

            I like the first word, “evidence,” just fine, but the second word, “based,” sort of negates the first one. The ESSA evidence standards require programs that are evidence-proven, not just evidence-based, for various purposes.

“Evidence-proven” means that a given program, practice, or policy has been put to the test. Ideally, students, teachers, or schools have been assigned at random to use the experimental program or to remain in a control group. The program is provided to the experimental group for a significant period of time, at least a semester, and then final performance on tests that are fair to both groups are compared, using appropriate statistics.

If your doctor gives you medicine, it is evidence proven. It isn’t just the same color or flavor as something proven, it isn’t just generally in line with what research suggests might be a good idea. Instead, it has been found to be effective, compared to current standards of care, in rigorous studies.

“Evidence-based,” on the other hand, is one of those wiggle words that educators love to use to indicate that they are up-to-date and know what’s expected, but don’t actually intend to do anything different from what they are doing now.

Evidence-based is today’s equivalent of “based on scientifically-based research” in No Child Left Behind. It sure sounded good, but what educational program or practice can’t be said to be “based on” some scientific principle?

In a recent Brookings article Mark Dynarski wrote about state ESSA plans, and conversations he’s heard among educators. He says that the plans are loaded with the words “evidence-based,” but with little indication of what specific proven programs they plan to implement, or how they plan to identify, disseminate, implement, and evaluate them.

I hope the ESSA evidence standards give leaders in even a few states the knowledge and the courage to insist on evidence-proven programs, especially in very low-achieving “school improvement” schools that desperately need the very best approaches. I remain optimistic that ESSA can be used to expand evidence-proven practices. But will it in fact have this impact? That remains to be proven.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Why the What Works Clearinghouse Matters

In 1962, the most important breakthrough in modern medicine took place. It was not a drug, not a device, not a procedure. It did not immediately save a single life, or cure a single person of disease. But it profoundly changed medicine worldwide, and led to the rapid progress in all of medicine that we have come to take for granted.

This medical miracle was a law, passed in the U.S. Congress, called the Kefauver-Harris Drug Act. It required that drugs sold in the U.S. be proven safe and effective, in high-quality randomized experiments. This law was introduced by Senator Estes Kefauver of Tennessee, largely in response to the thalidomide disaster, when a widely used drug was found to produce disastrous birth defects.

From the moment the Act was passed, medical research changed utterly. The number of randomized experiments shot up. There are still errors and debates and misplaced enthusiasm, but the progress that has made in every area of medicine is undeniable. Today, it is unthinkable in medicine that any drug would be widely sold if it has not been proven to work. Even though Kefauver-Harris itself only applies to the U.S., all advanced countries now have similar laws requiring rigorous evidence of safety and effectiveness of medicines.

One of the ways the Kefauver-Harris Act made its impact was through reviews and publications of research on the evidence supporting the safety and efficacy of medicines. It’s no good having a law requiring strong evidence if only experts know what the evidence is. Many federal programs have sprung up over the years to review the evidence of what works and communicate it to front-line practitioners.

In education, we are belatedly going through our own evidence revolution. Since 2002, the function of communicating the findings of rigorous research in education has mostly been fulfilled by the What Works Clearinghouse (WWC), a website maintained by the U.S. Department of Education’s Institute of Education Sciences (IES). The existence of the WWC has been enormously beneficial. In addition to reviewing the evidence base for educational programs, the WWC’s standards set norms for research. No funder and no researcher wants to invest resources in a study they know the WWC will not accept.

In 2015, education finally had what may be its own Kefauver-Harris moment. This was the passage by the U.S. Congress of the Every Student Succeeds Act (ESSA), which contains specific definitions of strong, moderate, and promising levels of evidence. For certain purposes, especially for school improvement funding for very low-achieving schools, schools must use programs that meet ESSA evidence standards. For others, schools or districts can receive bonus points on grant applications if they use proven programs.

ESSA raises the stakes for evidence in education, and therefore should have raised the stakes for the WWC. If the government itself now requires or incentivizes the use of proven programs, then shouldn’t the government provide information on what individual programs meet those standards?

Yet several months after ESSA was passed, IES announced that the WWC would not be revised to align itself with ESSA evidence standards. This puts educators, and the government itself, in a bind. What if ESSA and WWC conflict? The ESSA standards are in law, so they must prevail over the WWC. Yet the WWC has a website, and ESSA does not. If WWC standards and ESSA standards were identical, or nearly so, this would not be a problem. But in fact they are very far apart.

Anticipating this situation, my colleagues and I at Johns Hopkins University created a new website, www.evidenceforessa.org. It launched in February, 2017, including elementary and secondary reading and math. We are now adding other subjects and grade levels.

In creating our website, we draw from the WWC every day, and in particular use a new Individual Study Database (ISD) that contains information on all of the evaluations the WWC has ever accepted.

The ISD is a useful tool for us, but it has made it relatively easy to ask and answer questions about the WWC itself, and the answers are troubling. We’ve found that almost half of the WWC outcomes rated “positive” or “potentially positive” are not even statistically significant. We have found that measures made by researchers or developers produce effect sizes more than three times those that are independent, yet they are fully accepted by the WWC.

As reported in a recent blog, we’ve discovered that the WWC is very, very slow to add new studies to its main “Find What Works” site. The WWC science topic is not seeking or accepting new studies (“This area is currently inactive and not conducting reviews”). Character education, dropout prevention, and English Language Learners are also inactive. How does this make any sense?

Over the next couple of months, starting in January, I will be releasing a series of blogs sharing what we have been finding out about the WWC. My hope in this is that we can help create a dialogue that will lead the WWC to reconsider many of its core policies and practices. I’m doing this not to compete or conflict with the WWC, but to improve it. If evidence is to have a major role in education policy, government has to help educators and policy makers make good choices. That is what the WWC should be doing, and I still believe it is possible.

The WWC matters, or should matter, because it expresses government’s commitment to evidence, and evidence-based reform. But it can only be a force for good if it is right, timely, accessible, comprehensible, and aligned with other government initiatives. I hope my upcoming blogs will be read in the spirit in which they were written, with hopes of helping the WWC do a better job of communicating evidence to educators eager to help young people succeed in our schools.

 

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Title I: A 20% Solution

Here’s an idea that would cost nothing and profoundly shift education funding and the interest of educators and policy makers toward evidence-proven programs. Simply put, the idea is to require that schools receiving Title I funds use 20% of the total on programs that meet at least a moderate standard of evidence. Two thin dimes on the dollar could make a huge difference in all of education.

In terms of federal education policy, Title I is the big kahuna. At $15 billion per year, it is the largest federal investment in elementary and secondary education, and it has been very politically popular on both sides of the aisle since the Johnson administration in 1965, when the Elementary and Secondary Education Act (ESEA) was first passed. Title I has been so popular because it goes to every congressional district, and provides much-needed funding by formula to help schools serving children living in poverty. Since the reauthorization of ESEA as the Every Student Succeeds Act in 2015, Title I remains the largest expenditure.

In ESSA and other federal legislation, there are two kinds of funding. One is formula funding, like Title I, where money usually goes to states and is then distributed to districts and schools. The formula may adjust for levels of poverty and other factors, but every eligible school gets its share. The other kind of funding is called competitive, or discretionary funding. Schools, districts, and other entities have to apply for competitive funding, and no one is guaranteed a share. In many cases, federal funds are first given to states, and then schools or districts apply to their state departments of education to get a portion of it, but the state has to follow federal rules in awarding the funds.

Getting proven programs into widespread use can be relatively easy in competitive grants. Competitive grants are usually evaluated on a 100-point scale, with all sorts of “competitive preference points” for certain categories of applicants, such as for rural locations, inclusion of English language learners or children of military families, and so on. These preferences add perhaps one to five points to a proposal’s score, giving such applicants a leg up but not a sure thing. In the same way, I and others have proposed adding competitive preference points in competitive proposals for applicants who propose to adopt programs that meet established standards of evidence. For example, Title II SEED grants for professional development now require that applicants propose to use programs found to be effective in at least one rigorous study, and give five points if the programs have been proven effective in at least two rigorous studies. Schools qualifying for school improvement funding under ESSA are now required to select programs that meet ESSA evidence standards.

Adding competitive preference points for using proven programs in competitive grants is entirely sensible and pain-free. It costs nothing, and does not require applicants to use any particular program. In fact, applicants can forego the preference points entirely, and hope to win without them. Preference points for proven programs is an excellent way to nudge the field toward evidence-based reform without top-down mandates or micromanagement. The federal government states a preference for proven programs, which will at least raise their profile among grant writers, but no school or district has to do anything different.

The much more difficult problem is how to get proven programs into formula funding (such as Title I). The great majority of federal funds are awarded by formula, so restricting evidence-based reform to competitive grants is only nibbling at the edges of practice. One solution to this would be to allocate incentive grants to districts if they agree to use formula funds to adopt and implement proven programs.

However, incentives cost money. Instead, imagine that districts and schools get their Title I formula funds, as they have since 1965. However, Congress might require that districts use at least 20% of their Title I, Part A funding to adopt and implement programs that meet a modest standard of evidence, similar to the “moderate” level in ESSA (which requires one quasi-experimental study with positive effects). The adopted program could be anything that meets other Title I requirements—reading, math, tutoring, technology—except that the program has to have evidence of effectiveness. The funds could pay for necessary staffing, professional development, materials, software, hardware, and so on. Obviously, schools could devote more than 20% if they choose to do so.

There are several key advantages to this 20% solution. First, of course, children would immediately benefit from receiving programs with at least moderate evidence of effectiveness. Second, the process would instantly make leaders of the roughly 55,000 Title I schools intensely interested in evidence. Third, the process could gradually shift discussion about Title I away from its historical focus on “how much?” to an additional focus on “for what purpose?” Publishers, software developers, academics, philanthropy, and government itself would perceive the importance of evidence, and would commission or carry out far more high-quality studies to meet the new standards. Over time, the standards of evidence might increase.

All of this would happen at no additional cost, and with a minimum of micromanagement. There are now many programs that would meet the “moderate” standards of evidence in reading, math, tutoring, whole-school reform, and other approaches, so schools would have a wide choice. No Child Left Behind required that low-performing schools devote 20% of their Title I funding to after-school tutoring programs and student transfer policies that research later showed to make little or no difference in outcomes. Why not spend the same on programs that are proven to work in advance, instead of once again rolling the dice with the educational futures of at-risk children?

20% of Title I is a lot of money, but if it can make 100% of Title I more impactful, it is more than worthwhile.