Experiments: Why Bigger is Better

You see it in newspapers and magazines every day. “Miracle cure for xxx!” screams the headline. “Education program doubles student gains!”

Then the article underneath the headline describes a small study, say with 50 to 100 subjects or even fewer. The study authors are quoted saying how important this is and why they were surprised and pleased by the findings. If the authors want to express appropriate scientific caution, they then say something like, “Of course we’ll want to see this replicated on a larger scale…” But this caveat appears near the end of the article when most readers have already gone on to the crossword puzzle.

Chances are, you’ll never hear about this treatment again. If anyone does repeat the evaluation using a significantly larger sample size, the outcomes are almost certain to be far smaller than those in the small study.

My colleague, Alan Cheung, and I have recently completed a large review of research on methodological features that influence the reported outcomes of experiments in education. We examined 645 studies of elementary and secondary math, reading, and science programs that met a stringent set of inclusion standards. The outcomes were astonishing. The smaller the study, the larger the effect size. The average effect size in studies with sample sizes of 100 or less was +0.38, while the average effect size for studies with at least 3000 studies was only +0.11.

Why were these differences so large? One reason is that in small studies, researchers are often able to provide a great deal of oversight to make sure the treatment is implemented perfectly, far better than would be possible in a large study. This is called the “super-realization effect.”

Second, small studies are far more likely to use outcome measures made up by the researchers, and these tend to produce much higher effect sizes.

Third, when small studies find negative or zero effects, they tend to disappear, both because journals reject them and because researchers shelve them (the “file-drawer effect”). In contrast, a large study costs a lot of money and probably was done with a grant, so a report is more certain to be available regardless of the outcome.

Could the small study outcomes be the true outcomes, while the large study outcome is so small because it is difficult to ensure quality implementation at a large scale? Perhaps. But the smaller effect size is probably a lot closer to what would be seen in real life. Education is an applied field, and no one should be terribly interested in treatments that only work with small numbers of students.

Small studies are fine in a process of development, possibly showing that a given program has potential. However, before we can say that a program is ready for broad dissemination, we need to see it repeated in multiple experiments or in studies with many students. Big experiments are expensive and difficult, of course, but if we’re serious about evidence-based reform, we’d better go big or go home!

It’s the Only Gum My Mom Lets Me Chew

For many years, there was a series of ads for Trident Sugarless Gum that always followed the same pattern. First there were statements about all the wonderful things about the gum, including, “4 out of 5 dentists…” This part was completely boring, perhaps deliberately so. But at the end, there would always be a very cute kid with a sheepish expression who’d say, “Besides, it’s the only gum my mom lets me chew.”

The ad was clearly directed to the parents, not the kids, and I think it was brilliant. What it was trying to do, I’d assume, is to play on parents’ sense of responsibility. Every parent knows that sugared gum is bad for kids’ teeth. The ad subtly said, “You care about your kids and their health. Take a stand to defend your child from the evils of Juicy Fruit.”

Evidence-based reform in education needs to occupy a similar place in the culture of education. Someday, teachers need to expect each other to use proven programs, and to take it as a point of pride that they know about what works and put that knowledge to work in the classroom every day. Teachers care about their kids and their profession, and therefore they see the value of using programs known to work.

Government can play a role in establishing such a norm. For example, government agencies can provide preference points on competitive grants to schools that commit to using proven programs. They can establish criteria for levels of evidence required for a program to be considered proven, and disseminate information about those programs. They can support developers and researchers in creating, evaluating, and disseminating proven programs, as Investing in Innovation currently does. All of these strategies, and more, could help educators learn about and use proven programs to accomplish their goals, and this in turn could build a sense of professionalism, an optimism in the profession that solutions are readily available.

By putting a child’s face and parents’ love and care in front of the statistics, Trident made a place for sugarless gum in the marketplace. In the same way, proven programs have to become desirable to educators for all the right reasons — not just the effect sizes, but kids who are successful and excited about learning.

Early Childhood Education in the Balance

Back in the day, a kindergarten was a garden for children, a place where children could play, sing, paint, and pretend. Letters, numbers, and anything that smacked of formal schooling was minimized. Instead, kindergarten was intended to facilitate the transition from home to school, in a home-like setting.

Today, of course, kindergarten is less of a garden and more of a hothouse. At least in public schools, it’s a rare kindergarten that does not have a strong focus on letters and numbers. A child exposed only to the play-oriented children’s garden of old would arrive in first grade at a serious disadvantage. In most kindergarten classes there is still plenty of play, singing, and make-believe, but also a lot of literacy and numeracy.

Debate in early childhood education has largely shifted from the kindergarten to the pre-kindergarten. For a long time, programs for four-year-olds have resembled kindergartens of the past. Children are painting, playing with blocks, dressing up for make-believe, using sand and water tables, singing, and listening to stories.

In most states, pre-K is not available to all, and many children who attend pre-K do so as part of the federal Head Start program. A lot of attention has been paid to the question, “Does Head Start work?” For decades, the evidence that it does has depended on longitudinal studies of the Perry Preschool, the Abecedarian Project, and other small, colossally funded experimental approaches. However, evaluations of run-of-the-mill Head Start programs find a consistent and depressing pattern. Immediately after their Head Start experience, young children perform somewhat better on cognitive measures than do similar children who did not receive pre-K services at all, but within a year or two these differences fade away.

Seeing these outcomes, early childhood researchers began in the 1990s to experiment with ways to make Head Start and other early childhood approaches more effective. Numerous studies compared outcomes for children who were all in preschool, but who received different programs. In the mid-2000s, a large federal project called the Preschool Curriculum Evaluation Research (PCER) initiative evaluated a large number of programs using consistent, rigorous methods. This study added substantially to the number and quality of studies of preschool models of all kinds.

My colleagues Bette Chambers, Alan Cheung, and I have just completed a review of research on studies that compared alternative approaches to pre-K. We found 32 studies of 22 programs that met our standards. These studies were of exceptional quality; 30 of the studies involved random assignment to conditions. We mainly compared programs with elements focused on literacy (which we called “balanced” approaches) to those that did not have such elements (“developmental” approaches). The outcomes were striking. At the end of pre-K, children in the balanced programs performed better, on average, on both literacy and language measures. The literacy outcomes were not too surprising, because the balanced programs had a stronger emphasis on literacy. However, at the end of kindergarten, the children who had been in the balanced groups still performed at a higher level on both literacy and language measures.

Our review supports the idea that young children can benefit from literacy experiences, to learn letters and sounds, while they continue to play, pretend, draw, and sing. Keeping literacy out of the mix does not benefit children immediately or one year later.

I’d be the last person to want to take the garden out of kindergarten or preschool. Pre-K can still be fun, social, and interactive. But adding in a focus on literacy helps children arrive in first grade ready to succeed in reading. How can that be a bad thing?

Theory Is Not Enough

“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” — Stephen Hawking

Readers of this blog are aware that I am enthusiastic about the EDGAR definitions of “strong” and “moderate” evidence of effectiveness for educational programs. However, EDGAR also has two categories below “moderate.” These are called “promising” and “based on strong theory.” I never mention these, but I thought it’s time to do so, because they come up in current policy debates.

Advocates for evidence in education want federal policies to seek to increase the use of proven programs, and I have frequently written in support of encouraging schools to use these programs when they exist and building them when they don’t. But some would expand the definition of “evidence-based” to make it too broad. In particular, in areas in which very few programs meet at least the “moderate” level of EDGAR, they want to allow “promising” or “based on strong theory” as alternatives. “Promising” means that a program has correlational evidence supporting it. With the right definition, this may not be so bad. However, “based on strong theory” just means that there is some explanation for why the program might work.

In a moment I’ll share my thoughts about why encouraging use of programs that only have “strong theory” is a bad idea, but first I’d like to tell you a story about why this discussion is far from theoretical for me personally.

When I was in college, I had a summer job at a school for children with intellectual disabilities in Washington, DC. Another college sophomore and I were in charge of the lowest-performing class. We overlapped by a week with the outgoing teacher, who showed us in detail what she did with the children every day. This consisted of two activities. One involved getting children to close their eyes and smell small jars of substances, such as garlic, cinnamon, and cloves, and say what they were. The other involved having children string beads in alternating patterns. These activities were directly from a theory called Psychomotor Patterning, or Doman-Delcato, which was extremely popular in the 1960s and early 1970s. In addition to sensory stimulation, advocates of Doman-Delcato had children with intellectual disabilities crawl and do other stylized exercises on the theory that these children had skipped developmental steps and could catch up to their peers by going back and repeating those steps.

In our school, my partner and I started off dutifully continuing what the teacher had shown us, but after a few days we looked at each other and said, “This is stupid.” We knew that our kids, aged perhaps 11-15, had two potential futures. If they were lucky, they could stay at home and perhaps get a job in a sheltered workshop. Otherwise, they were likely to end up in the state hospital, a terrible fate. We decided to drop the patterning, and teach our kids to tie their shoes, to sweep, to take care of themselves. We began to take them on walks and to a local McDonalds to teach them how to behave in public.

One of our children was a sweet, beautiful girl named Sarah, about 12 years old. Sarah was extremely proud of her long, blond hair, which she would stroke and say, “Sarah’s so pretty,” which I’m sure she’d heard countless times.

I was working especially hard with Sarah, and she learned quickly. I taught her to sweep, for example, starting with balled-up paper and moving to smaller and smaller things to sweep up.

One day, Sarah was gone. We heard that her parents had taken her to the state hospital.

For some reason, the parents brought Sarah back for a visit about a month later. Her beautiful hair was gone, as was the sparkle that had once been in her eyes. She stared at the floor.

A few years later, in another school, I saw teachers working with teenagers with Down Syndrome, having them crawl around the classroom every day. Like Sarah, these kids had two potential futures. This school had a sheltered workshop housed in it, and if they could qualify to work there, their futures were bright. Instead, they were wasting their best chances crawling like babies.

“Based on strong theory” may sound academic or policy-wonky to many, but to me, it means that it is okay to subject children to treatments with no conclusive evidence of effectiveness when better treatments exist or could exist. In particular, “based on strong theory” all too often just means “what’s fashionable right now.” Doman-Delcato Psychomotor Patterning was a hugely popular fad because it gave parents and teachers hope that intellectual disabilities could be reversible. When I was a special education teacher, “based on strong theory” meant that my kids had received years of useless sensory stimulation instead of learning anything potentially useful. Perhaps Sarah was going to end up in the state hospital no matter what my school did. But I cannot set aside my memory of her when I hear people say that “strong theory” might be enough when actual evidence is lacking.

From a policy perspective, it would be useful to have federal and state funding support programs with strong or moderate evidence of effectiveness. In areas lacking such evidence-proven programs, government might fund research and development, perhaps, but should not encourage use of programs that are only supported by “strong theory.” Allowing weak categories into the discussion waters down the entire idea of evidence-based reform, as all programs could probably meet such definitions. Even worse, encouraging use of programs based on strong theory could lead schools to use the current fad. And if you doubt this, ask Dr. Doman and Dr. Delcato.