Pilot Studies: On the Path to Solid Evidence

This week, the Education Technology Industry Network (ETIN), a division of the Software & Information Industry Association (SIIA), released an updated guide to research methods, authored by a team at Empirical Education Inc. The guide is primarily intended to help software companies understand what is required for studies to meet current standards of evidence.

In government and among methodologists and well-funded researchers, there is general agreement about the kind of evidence needed to establish the effectiveness of an education program intended for broad dissemination. To meet its top rating (“meets standards without reservations”) the What Works Clearinghouse (WWC) requires an experiment in which schools, classes, or students are assigned at random to experimental or control groups, and it has a second category (“meets standards with reservations”) for matched studies.

These WWC categories more or less correspond to the Every Student Succeeds Act (ESSA) evidence standards (“strong” and “moderate” evidence of effectiveness, respectively), and ESSA adds a third category, “promising,” for correlational studies.

Our own Evidence for ESSA website follows the ESSA guidelines, of course. The SIIA guidelines explain all of this.

Despite the overall consensus about the top levels of evidence, the problem is that doing studies that meet these requirements is expensive and time-consuming. Software developers, especially small ones with limited capital, often do not have the resources or the patience to do such studies. Any organization that has developed something new may not want to invest substantial resources into large-scale evaluations until they have some indication that the program is likely to show well in a larger, longer, and better-designed evaluation. There is a path to high-quality evaluations, starting with pilot studies.

The SIIA Guide usefully discusses this problem, but I want to add some further thoughts on what to do when you can’t afford a large randomized study.

1. Design useful pilot studies. Evaluators need to make a clear distinction between full-scale evaluations, intended to meet WWC or ESSA standards, and pilot studies (the SIIA Guidelines call these “formative studies”), which are just meant for internal use, both to assess the strengths or weaknesses of the program and to give an early indicator of whether or not a program is ready for full-scale evaluation. The pilot study should be a miniature version of the large study. But whatever its findings, it should not be used in publicity. Results of pilot studies are important, but by definition a pilot study is not ready for prime time.

An early pilot study may be just a qualitative study, in which developers and others might observe classes, interview teachers, and examine computer-generated data on a limited scale. The problem in pilot studies is at the next level, when developers want an early indication of effects on achievement, but are not ready for a study likely to meet WWC or ESSA standards.

2. Worry about bias, not power. Small, inexpensive studies pose two types of problems. One is the possibility of bias, discussed in the next section. The other is lack of power, mostly meaning having a large enough sample to determine that a potentially meaningful program impact is statistically significant, or unlikely to have happened by chance. To understand this, imagine that your favorite baseball team adopts a new strategy. After the first ten games, the team is doing better than it did last year, in comparison to other teams, but this could have happened by chance. After 100 games? Now the results are getting interesting. If 10 teams all adopt the strategy next year and they all see improvements on average? Now you’re headed toward proof.

During the pilot process, evaluators might compare multiple classes or multiple schools, perhaps assigned at random to experimental and control groups. There may not be enough classes or schools for statistical significance yet, but if the mini-study avoids bias, the results will at least be in the ballpark (so to speak).

3. Avoid bias. A small experiment can be fine as a pilot study, but every effort should be made to avoid bias. Otherwise, the pilot study will give a result far more positive than the full-scale study will, defeating the purpose of doing a pilot.

Examples of common sources of biases in smaller studies are as follows.

a. Use of measures made by developers or researchers. These measures typically produce greatly inflated impacts.

b. Implementation of gold-plated versions of the program. . In small pilot studies, evaluations often implement versions of the program that could never be replicated. Examples include providing additional staff time that could not be repeated at scale.

c. Inclusion of highly motivated teachers or students in the experimental group, which gets the program, but not the control group. For example, matched studies of technology often exclude teachers who did not implement “enough” of the program. The problem is that the full-scale experiment (and real life) include all kinds of teachers, so excluding teachers who could not or did not want to engage with technology overstates the likely impact at scale in ordinary schools. Even worse, excluding students who did not use the technology enough may bias the study toward more capable students.

d. Learn from pilots. Evaluators, developers, and disseminators should learn as much as possible from pilots. Observations, interviews, focus groups, and other informal means should be used to understand what is working and what is not, so when the program is evaluated at scale, it is at its best.



As evidence becomes more and more important, publishers and software developers will increasingly be called upon to prove that their products are effective. However, no program should have its first evaluation be a 50-school randomized experiment. Such studies are indeed the “gold standard,” but jumping from a two-class pilot to a 50-school experiment is a way to guarantee failure. Software developers and publishers should follow a path that leads to a top-tier evaluation, and learn along the way how to ensure that their programs and evaluations will produce positive outcomes for students at the end of the process.


This blog is sponsored by the Laura and John Arnold Foundation


Headstands and Evidence: On Over-Teachable Measures

Working on Evidence for ESSA has been a wonderful educational experience for those of us who are doing it. The problem is that the Every Student Succeeds Act (ESSA) evidence standards allow evidence supporting a given program to be considered “strong,” “moderate,” or “promising” based on positive effects in a single well-conducted study. While these standards are a major step forward in general, this definition allows the awful possibility that programs could be validated based on flawed studies, and specifically based on outcomes on measures that are not valid for this purpose. For this reason, we’ve excluded measures that are too closely aligned with the experimental program, such as measures made by developers or researchers.

In the course of doing the research reviews behind Evidence for ESSA, we’ve realized that there is a special category of measures that is also problematic. These are measures that are not made by researchers or developers, but are easy to teach to. A good example from the old days is verbal analogies on the Scholastic Achievement Test (e.g., Big: Small::Wet: _?_).  (Answer: dry)

These are no longer used on the SAT, presumably because the format is unusual and can be taught, giving an advantage to students who are coached on the SAT rather than to ones who actually know useful content of use in post-secondary education.

One key example of over-teachable measures are tests given in one minute, such as DIBELS. These are popular because they are inexpensive and brief, and may be useful as benchmark tests. But as measures for highly consequential purposes, such as anything relating to student or teacher accountability, or program evaluation for the ESSA evidence standards, one-minute tests are not appropriate, even if they correlate well with longer, well-validated measures. The problem is that such measures are overly teachable. In the case of DIBELS, for example, students can be drilled to read (ignoring punctuation or meaning) to correctly pronounce as many words as possible in a minute.

Another good example is elision tests in early reading (how would you say “bird” without the /b/?). Elision tests focus on an unusual situation that children are unlikely to have seen unless their teacher is specifically preparing them for an elision test.

Using over-teachable measures is a bit like holding teachers or schools or programs accountable for their children’s ability to do headstands. For PE, measures of time taken to run a 100-yeard dash, ability to lift weights, or percent of successful free throws in basketball would be legitimate, because these are normal parts of a PE program, and they assess general strength, speed, and muscular control. But headstands are relatively unusual as a focus in PE, and are easily taught and practiced. A PE program should not be able to meet ESSA evidence standards based on students’ ability to do headstands because this is not a crucial skill and because the control group would not be likely to spend much time on it.

Over-teachable measures have an odd but interesting aspect. When they are first introduced, there is nothing wrong with them. They may be reliable, valid, correlated with other measures, and show other solid psychometric properties. But the day the very same measure is used to evaluate students, teachers, schools, or programs, the measure’s over-teachability may come into play, and its reliability for this particular purpose may no longer be acceptable.

There’s nothing wrong with headstands per se. They have long been a part of gymnastics. But when the ability to do headstands becomes a major component of overall PE evaluation, it turns the whole concept… well, it turns the whole concept of evaluation on its head.  The same can be said about other over-teachable measures.

This blog is sponsored by the Laura and John Arnold Foundation

Publishers and Evidence

High above the Avenue of the Americas on prime real estate in midtown Manhattan towers the 51-story McGraw-Hill building. When in New York, I always find time to go look at that building and reflect on the quixotic quest I and my colleagues are on to get educational decision makers to choose rigorously evaluated, proven programs. These programs are often made by small non-profit organizations and universities, like the ones I work in. Looking up at that mighty building in New York, I always wonder, are we fooling ourselves? Who are we to take on some of the most powerful companies in the world?

Education publishing is dominated by three giant, multi-billion dollar publishers, plus another three or four even bigger technology companies. These behemoths are not worried about us, not one bit. Instead, they are worried about each other.

From my experience, there are very good people who work in publishing and technology companies, people who genuinely hope that their products will improve learning for students. They would love to create innovative programs, evaluate them rigorously, and disseminate those found to be effective. However, big as they are, the major publishers face severe constraints in offering proven programs. Because they are in ferocious competition with each other, publishers cannot easily invest in expensive development and evaluation, or insist on extensive professional development, a crucial element of virtually all programs that have been shown to improve student achievement. Doing so would raise their costs, making them vulnerable to lower-cost competitors.

In recent years, many big publishers and technology companies have begun to commission third-party evaluations of their major textbooks, software, and other products. If the evaluations show positive outcomes, they can use this information in their marketing, and having rigorous evidence showing positive impacts helps protect them from the possibility that government might begin to favor programs, software, or other products with proven outcomes in rigorous research. This is exactly what did happen with the enactment of the ESSA evidence standards, though the impact of these standards has not yet been strongly felt.

However, publishers and technology companies cannot get too far out ahead of their market. If superintendents, central office leaders, and others who select textbooks and technology get on board the evidence train, then publishers will greatly expand their efforts in research and development. If the market continues to place little value on evidence, so will the big publishers.

In contrast to commercial publishers and technology companies, non-profit organizations play a disproportionate role in the evidence movement. They are often funded by government or philanthropies to create and evaluate innovations, as big commercial companies almost never are. Non-profits have the freedom to experiment, and to disseminate what works. However, non-profits, universities, and tiny for-profit start-ups are small, under-capitalized, and have little capacity or experience in marketing. Their main, and perhaps only, competitive advantage is that they have evidence of effectiveness. If no one cares about evidence, our programs will not last long.

One problem publishers face is that evaluations of traditional textbooks usually do not show any achievement benefits compared to control groups. The reason is that one publisher’s textbook is just not that different from another’s, which is what the control group is using. Publishers rarely provide much professional development, which makes it difficult for them to introduce anything truly innovative. The half-day August in-service that comes with most textbooks is barely enough to get teachers familiar with the features in the most traditional book. The same is true of technology approaches, which also rarely make much difference in student outcomes, perhaps because they typically provide little professional development beyond what is necessary to run the software.

The strategy emphasized by government and philanthropy for many years has been to fund innovators to create and evaluate programs. Those that succeed are then encouraged or funded to “scale up” their proven programs. Some are able to grow to impressive scale, but never so much as to worry big companies. An occasional David can surprise an occasional Goliath, but in the long run, the big guys win, and they’ll keep winning until someone changes the rules. To oversimplify a bit, what we have are massive publishers and technology companies with few proven innovations, and small non-profits with proven programs but little money or marketing expertise. This is not a recipe for progress.

The solution lays with government. National, state, and/or local governments have to adopt policies that favor the use of programs and software that have been proven in rigorous experiments to be effective in improving student achievement. At the federal level, the ESSA evidence standards are showing the way, and if they truly catch hold, this may be enough. But imagine if a few large states or even big districts started announcing that they were henceforth going to require evidence of effectiveness when they adopt programs and software. The effect could be electric.

For non-profits, such policies could greatly expand access to schools, and perhaps to funding. But most non-profits are so small that it would take them years to scale up substantially while maintaining quality and effectiveness.

For publishers and technology companies, the effect could be even more dramatic. If effectiveness begins to matter, even if just in a few key places, then it becomes worthwhile for them to create, partner with, or acquire effective innovations that provide sufficient professional development. In states and districts with pro-evidence policies, publishers would not have to worry about being undercut by competitors, because all vendors would have to meet evidence standards.

Publishers have tried to acquire proven programs in the past, but this usually comes to smash, because they tend to strip out the professional development and other elements that made the program work in the first place. However, in a pro-evidence environment, publishers would be motivated to maintain the quality, effectiveness, and “brand” of any programs they acquire.

In medicine, most research on practical medications is funded by drug companies and carefully monitored and certified by government. Could such a thing happen in education?

Publishers and technology companies have the capital and expertise to take effective programs to scale. Partnering with creators of proven programs, or creating and evaluating their own, big companies can make a real difference, as long as government ensures that the programs they are disseminating are in fact of the same quality and effectiveness as the versions that were found to be effective.

Publishers and technology companies are a key part of the education landscape. They need to be welcomed into evidence-based reform, and incentivized to engage in innovation and evaluation. Otherwise, educational innovation will remain a marginal activity, benefitting thousands of students when millions are in need.

This blog is sponsored by the Laura and John Arnold Foundation