Developer- and Researcher-Made Measures

What if people could make their own yardsticks, and all of a sudden people who did so gained two inches overnight, while people who used ordinary yardsticks did not change height? What if runners counted off time as they ran (one Mississippi, two Mississippi…), and then it so happened that these runners reduced their time in the 100-yard dash by 20%? What if archers could draw their own targets freehand and those who did got more bullseyes?

All of these examples are silly, you say. Of course people who make their own measures will do better on the measures they themselves create. Even the most honest and sincere people, trying to be fair, may give themselves the benefit of the doubt in such situations.

In educational research, it is frequently the case that researchers or developers make up their own measures of achievement or other outcomes. Numerous reviews of research (e.g., Baye et al., 2019; Cheung & Slavin, 2016; deBoer et al., 2014; Wolf et al., 2019) have found that studies that use measures made by developers or researchers obtain effect sizes that may be two or three times as large as measures independent of the developers or researchers. In fact, some studies (e.g., Wolf et al., 2019; Slavin & Madden, 2011) have compared outcomes on researcher/developer-made measures and independent measures within the same studies. In almost every study with both kinds of measures, the researcher/developer measures show much higher effect sizes.

I think anyone can see that researcher/developer measures tend to overstate effects, and the reasons why they would do so are readily apparent (though I will discuss them in a moment). I and other researchers have been writing about this problem in journals and other outlets for years. Yet journals still accept these measures, most authors of meta-analyses still average them into their findings, and life goes on.

I’ve written about this problem in several blogs in this series. In this one I hope to share observations about the persistence of this practice.

How Do Researchers Justify Use of Researcher/Developer-Made Measures?

Very few researchers in education are dishonest, and I do not believe that researchers set out to hoodwink readers by using measures they made up. Instead, researchers who make up their own measures or use developer-made measures express reasonable-sounding rationales for making their own measures. Some common rationales are discussed below.

  1. Perhaps the most common rationale for using researcher/developer-made measures is that the alternative is to use standardized tests, which are felt to be too insensitive to any experimental treatment. Often researchers will use both a “distal” (i.e., standardized) measure and a “proximal” (i.e., researcher/developer-made) measure. For example, studies of vocabulary-development programs that focus on specific words will often create a test consisting primarily or entirely of these focal words. They may also use a broad-range standardized test of vocabulary. Typically, such studies find positive effects on the words taught in the experimental group, but not on vocabulary in general. However, the students in the control group did not focus on the focal words, so it is unlikely they would improve on them as much as students who spent considerable time with them, regardless of the teaching method. Control students may be making impressive gains on vocabulary, mostly on words other than those emphasized in the experimental group.
  2. Many researchers make up their own tests to reflect their beliefs about how children should learn. For example, a researcher might believe that students should learn algebra in third grade. Because there are no third grade algebra tests, the researcher might make one. If others complain that of course the students taught algebra in third grade will do better on a test of the algebra they learned (but that the control group never saw), the researcher may give excellent reasons why algebra should be taught to third graders, and if the control group didn’t get that content, well, they should
  3. Often, researchers say they used their own measures because there were no appropriate tests available focusing on whatever they taught. However, there are many tests of all kinds available either from specialized publishers or from measures made by other researchers. A researcher who cannot find anything appropriate is perhaps studying something so esoteric that it will not have ever been seen by any control group.
  4. Sometimes, researchers studying technology applications will give the final test on the computer. This may, of course, give a huge advantage to the experimental group, which may have been using the specific computers and formats emphasized in the test. The control group may have much less experience with computers, or with the particular computer formats used in the experimental group. The researcher might argue that it would not be fair to teach on computers but test on paper. Yet every student knows how to write with a pencil, but not every student has extensive experience with the computers used for the test.

blog_10-24-19_hslab_500x333

A Potential Solution to the Problem of Researcher/Developer Measures

Researcher/developer-made measures clearly inflate effect sizes considerably. Further, research in education, an applied field, should use measures like those for which schools and teachers are held accountable. No principal or teacher gets to make up his or her own test to use for accountability, and neither should researchers or developers have that privilege.

However, arguments for the use of researcher- and developer-made measures are not entirely foolish, as long as these measures are only used as supplements to independent measures. For example, in a vocabulary study, there may be a reason researchers want to know the effect of a program on the hundred words it emphasizes. This is at least a minimum expectation for such a treatment. If a vocabulary intervention that focused on only 100 words all year did not improve knowledge of those words, that would be an indication of trouble. Similarly, there may be good reasons to try out treatments based on unique theories of action and to test them using measures also aligned with that theory of action.

The problem comes in how such results are reported, and especially how they are treated in meta-analyses or other quantitative syntheses. My suggestions are as follows:

  1. Results from researcher/developer-made measures should be reported in articles on the program being evaluated, but not emphasized or averaged with independent measures. Analyses of researcher/developer-made measures may provide information, but not a fair or meaningful evaluation of the program impact. Reports of effect sizes from researcher/developer measures should be treated as implementation measures, not outcomes. The outcomes emphasized should only be those from independent measures.
  2. In meta-analyses and other quantitative syntheses, only independent measures should be used in calculations. Results from researcher/developer measures may be reported in program descriptions, but never averaged in with the independent measures.
  3. Studies whose only achievement measures are made by researchers or developers should not be included in quantitative reviews.

Fields in which research plays a central and respected role in policy and practice always pay close attention to the validity and fairness of measures. If educational research is ever to achieve a similar status, it must relegate measures made by researchers or developers to a supporting role, and stop treating such data the same way it treats data from independent, valid measures.

References

Baye, A., Lake, C., Inns, A., & Slavin, R. (2019). Effective reading programs for secondary students. Reading Research Quarterly, 54 (2), 133-166.

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

de Boer, H., Donker, A.S., & van der Werf, M.P.C. (2014). Effects of the attributes of educational interventions on students’ academic performance: A meta- analysis. Review of Educational Research, 84(4), 509–545. https://doi.org/10.3102/0034654314540006

Slavin, R.E., & Madden, N.A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4 (4), 370-380.

Wolf, R., Morrison, J., Inns, A., Slavin, R., & Risman, K. (2019). Differences in average effect sizes in developer-commissioned and independent studies. Manuscript submitted for publication.

Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

 

On High School Graduation Rates: Want to Buy My Bridge?

FSK Bridge 02 13 18
 

Francis Scott Key Bridge (Baltimore) By Artondra Hall [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons, edited for size

 

I happen to own the Francis Scott Key Bridge in Baltimore, pictured here. It’s lovely in itself, has beautiful views of downtown and the outer harbor, and rakes in more than $11 million in tolls each year. But I’m willing to sell it to you, cheap!

If you believe that I own a bridge in Baltimore, then let me try out an even more fantastic idea on you. Since 1992, the achievement of America’s 12th graders on NAEP reading and math tests has been unchanged. Yet high school graduation rates have been soaring. From 2006 to 2016, U.S. graduation rates have increased from 73% to 84%, an all-time record. Does this sound plausible to you?

Recently, the Washington Post (https://www.washingtonpost.com/local/education/fbi-us-education-department-investigating-ballou-graduation-scandal/2018/02/02/b307e57c-07ab-11e8-b48c-b07fea957bd5_story.html?utm_term=.84c1176bb8ff) reported a scandal about graduation rates at Ballou High School in Washington, DC, a high-poverty school not known (in the past) for its graduation rates. In 2017, 100% of Ballou students graduated, and 100% were accepted into college. An investigation by radio station WAMU, however, found that a large proportion of the graduating seniors had very poor attendance, poor achievement, and other problems. In fact, the Post reported that one third of all graduating seniors in DC did not meet district graduation standards. Ballou’s principal and the DC Director of Secondary Schools resigned, and there are ongoing investigations. The FBI has recently gotten involved.

In response to these stories, teachers across America wrote to express their views. Almost without exception, the teachers said that the situation in their districts is similar to that in DC. They said they are pressured, even threatened, to promote and then graduate every student possible. Students who fail courses are often offered “credit recovery” programs to obtain their needed credits, and these were found in an investigation by the Los Angeles Times  to have extremely low standards (https://robertslavinsblog.wordpress.com/2017/08/17/the-high-school-graduation-miracle/). Failing students may also be allowed to do projects or otherwise show their knowledge in alternative ways, but these are derided as “Mickey Mouse.” And then there are students like some of those at Ballou, who did not even bother to show up for credit recovery or Mickey Mouse, but were graduated anyway.

The point is, it’s not just Ballou. It’s not just DC. In high-poverty districts coast to coast, standards for graduation have declined. My colleague, Bob Balfanz, coined the term “dropout factories” many years ago to describe high schools, almost always serving high-poverty areas, that produced a high proportion of all dropouts nationwide. In response, our education system got right to work on what it does best: Change the numbers to make the problem appear to go away. The FBI might make an example of DC, but if DC is in fact doing what many high-poverty districts are doing throughout the country, is it fair to punish it disproportionately? It’s not up to me to judge the legalities or ethics involved, but clearly, the problem is much, much bigger.

Some people have argued with me on this issue. “Where’s the harm,” they ask, “in letting students graduate? So many of these students encounter serious barriers to educational success. Why not give them a break?”

I will admit to a sympathy for giving high school students who just barely miss standards legitimate opportunities to graduate, such as taking appropriately demanding makeup courses. But what is happening in DC and elsewhere is very far from this reasonable compromise with reality.

I have done some research in inner-city high schools. In just about every class, there are students who are actively engaged in lessons, and others who would become actively engaged if their teachers used proven programs (in my case it was cooperative learning). But even with the best programs, there were kids in the back of the class with headphones on, who were totally disengaged, no matter what the teacher did. And those were the ones who actually showed up at all.

The kids who were engaged, or became engaged because of excellent instruction, should have a path to graduation, one way or another. The rest should have every opportunity, encouragement, and assistance to reach this goal. Some will choose to take advantage, some will not, but that must be their choice, with appropriate consequences.

Making graduation too easy not only undermines the motivations of students (and teachers). It also undermines the motivation of the entire system to introduce and effectively implement effective programs, from preschool to 12th grade. If educators can keep doing what they’ve always done, knowing that numbers will be fiddled with at the end to make everything come out all right, then the whole system can and will lose a major institutional incentive for improvement.

The high dropout rate of inner-city schools is indeed a crisis. It needs to be treated as such-not a crisis of numbers, but a crisis encountered by hundreds of thousands of vulnerable, valuable students. Loosening standards and then declaring success, which every educator knows to be false, corrupts the system, undermining confidence in the numbers even when they are legitimate. It fosters cynicism that nothing can be done.

Is it too much to expect that we can create and implement effective strategies that would enable virtually all students to succeed on appropriate standards in elementary, middle, and high school, so that virtually all can meet rigorous requirements and walk across a stage, head held high, knowing that they truly attained what a high school diploma is supposed to certify?

If you agree that high school graduation standards have gone off the rails, it is not enough to demand tougher standards. You also have to advocate for and work for application of proven approaches to make deserved and meaningful graduation accessible to all.

On the other hand, if you think the graduation rate has legitimately skyrocketed in the absence of any corresponding improvement in reading or math achievement, please contact me at www.buy-my-bridge.com. It really is a lovely bridge.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

Does Research Based Reform Require Standardized Tests?

Whenever I speak about evidence-based reform someone always asks this question: “Won’t all of these randomized evaluations just reinforce teaching to the (very bad word) standardized reading and math tests?” My wife Nancy and I were recently out at our alma mater, Reed College. I gave a speech on evidence-based reform, and of course I got this question.

Just to save time in future talks I’m going to answer this question here once and for all. And the answer is:

No! In fact, evidence-based reform could be America’s escape route from excessive use of accountability and sanctions to guide education policy.

Here’s how this could happen. I’d be the first to admit that today, most studies use standardized tests as their outcome measures. However, this need not be the case, and there are in fact exceptions.

Experiments compare learning gains made by students in a given treatment group to those in a control group. Any assessment can be used as the posttest as long as it meets two key standards:

a) It measures content taught equally in both groups, and
b) It was not made up by the developer or researchers.

What this means is that authentic performance measures, measures aligned with today’s standards, measures that require creativity and non-routine problem solving, or measures of subjects (such as social studies) that are taught in school but not tested for accountability, can all be valid in experiments, as long as experimental and control students had exposure to the content, skills, or performances being assessed.

As a practical example, imagine a year-long evaluation of an innovative science program for seventh graders. The new program emphasizes the use of open-ended group laboratory projects and explanations of processes observed in the lab. The control group uses a traditional science text, demonstrations, and more traditional lab work.

Experimenters might evaluate the innovative approach using a multifaceted measure (made by someone other than the researchers) covering all of seventh grade science, including open-ended as well as multiple choice tests. In addition, students might be asked to set up, carry out, and explain a lab exercise, involving content that was not presented in either program.

If this independent measure showed positive effects on all types of measures, great. If it showed advantages on open-ended and lab assessments but no differences on multiple choice tests, this could be important, too.

The point here is that measures used in program evaluations need not be limited to standardized measures, but can instead help break us free from standardized measures. One reason it’s hard to get away from multiple choice tests is that they are cheap and easy to administer and score, which means they are much easier to use when you have to test a whole state. It would be very difficult to have every student in a state set up a lab experiment for example. But program evaluations need not be bound by this practical restriction, because they only involve a few dozen schools, at most. Even if the state does not test writing, for example, program evaluations can. Even if the state does not test social studies, program evaluations can.

As evidence-based reform becomes more important, it may become more and more possible to justify the teaching of approaches proven to be effective on valid measures, even if the subject is not currently assessed for accountability and even if the tests are not accountability measures. States may never administer history tests, but they may encourage or incentivize use of proven approaches to teaching history. Perhaps proven approaches to music, art, or physical education could be identified and then supported and encouraged.

Managing schools by accountability alone leaves us trying to enforce minimums using limited assessments. This may be necessary to be sure that all schools are meeting minimums and making good progress on basic skills, but we need to go beyond standardized tests to consider all of the needs shared by all children. Evidence that goes beyond what states can test well may be just the ticket out of the standardized test trap.

Accountability for the Top 95 Percent

2014-12-11-HP54_12_11_14.jpg

Perhaps the most controversial issue in education policy is test-based accountability. Since the 1980s, most states have had tests in reading and math (at least), and have used average school test scores for purposes ranging from praising or embarrassing school staffs to providing financial incentives or closing down low-scoring schools. Test-based accountability became national with NCLB, which required annual testing from grades 3-8, and prescribed sanctions for low-achieving schools. The Obama administration added to this an emphasis on using student test scores as part of teacher evaluations.

The entire test-based accountability movement has paid little attention to evidence. In fact, in 2011, the National Research Council reviewed research on high-stakes accountability and found few benefits.

There’s nothing wrong with testing students and identifying schools in which students appear to be making good or poor progress in comparison to other schools serving students with similar backgrounds, as long as this is just used as information to identify areas of need. What is damaging about accountability is the use of test scores for draconian consequences, such as firing principals and closing schools. The problem is that terror is just not a very good strategy for professional development. Teachers and principals afraid of punishment are more likely to use questionable strategies to raise their scores—teaching the test, reducing time on non-tested subjects, trying to attract higher-achieving kids or get rid of lower performers, not to mention out-and-out cheating. Neither terror nor the hope of rewards does much to fundamentally improve day to day teaching because the vast majority of teachers are already doing their best. There are bad apples, and they need to be rooted out. But you can’t improve the overall learning of America’s children unless you improve daily teaching practices for the top 95% of teachers, the ones who come to work every day, do their best, care about their kids, and go home dead tired.

Improving outcomes for the students of the top 95% requires top-quality, attractive, engaging professional development to help teachers use proven programs and practices. Because people are more likely to take seriously professional development they’ve chosen, teachers should have choices (as a school or department, primarily) of which proven programs they want to adopt and implement.

The toughest accountability should be reserved for the programs themselves, and the organizations that provide them. Teachers and principals should have confidence that if they do adopt a given program and implement it with fidelity and intelligence, it will work. This is best demonstrated in large experiments in which teachers in many schools use innovative programs, and outcomes are compared with similar schools without the programs. They should know that they’ll get enough training and coaching to see that the program will work.

Offering a broad range of proven programs would give local schools and districts
expanded opportunities to make wise choices for their children. Just as evidence in agriculture informs but does not force choices by farmers, evidence in education should enable school leaders to advance children’s learning in a system of choice, not compulsion.

If schools had choices among many proven programs, in all different subjects (tested as well as untested), the landscape of accountability would change. Instead of threatening teachers and principals, government could provide help for schools to adopt programs they want and need. Offering proven programs provides a means of improving outcomes even in untested areas, such as science, social studies, and foreign language. As time goes on, more and better programs with convincing evaluation evidence would appear, because developers and funders would perceive the need for them.

Moving to a focus on evidence-based reform will not solve all of the contentious issues about accountability, but it could help us focus the reform conversation on how to move forward the top 95% of teachers and schools—the ones who teach 95% of our kids—and how to put accountability in proper proportion.

Test-Based Accountability, Inspectors, and Evidence

2014-10-01-HP46Image10_1_2014.jpg

Recently, Marc Tucker of the National Center for Education and the Economy wrote a thoughtful critique of education policy in the U.S., questioning the heavy reliance on test-based accountability and suggesting that the U.S. adopt a system like those used in most of our peer countries, with less frequent and less consequential testing and a reliance on inspectors to visit all schools, especially those lagging on national measures. In the New York Times, Joe Nocera heaped praise on Tucker’s analysis.

Personally, I agree with Tucker’s (and Nocera’s) enthusiasm for an assessment and accountability system that uses testing in a more thoughtful, less draconian way. There is certainly little evidence to support test-based accountability with substantial consequences for schools and teachers as it is being used today. I’d also be glad to see U.S. schools try the kinds of independent school inspectors used in most of our peer countries.

However, as I’ve noted in earlier blogs, it’s fun to consider what other countries do, but there are too many factors involved to infer that adopting the policies of other countries will work here. I work part-time in England, which uses exactly the policies Tucker espouses. Its accountability measures are used only at the end of primary school (6th grade) and secondary school (11th). An independent and respected corps of inspectors visits schools, making more frequent visits when schools’ scores are low or declining. All well and good, but England’s PISA scores are nearly identical to ours. England’s gentler accountability policies make teaching and school leadership less unpleasant than it is here, and the hysteria and pressure-induced cheating often seen here are unknown in England, so their policies may be better in many ways. But the different U.S. policies are not the main cause of the modest rankings of U.S. students.

As another example, consider Canada. Provinces vary, but no Canadian school system tests as often as we do. However, they do not have inspectors, and Canada always scores well above both the U.S. and England. So, should we emulate nearby Canada rather than faraway Finland or Shanghai? Perhaps, but before getting too excited about our neighbor to the north, it’s important to note that the U.S. states nearest to Canada, such as Massachusetts, Minnesota, and Washington, are among the highest U.S. achievers. Is Canada’s success due to policies or demography?

My point is just that while international comparisons might suggest policies or practices worth piloting and evaluating in the U.S., the main focus should be on evaluations of policies, practices, and programs within the U.S. If inspectors, for example, seem like a good idea, let’s try them in U.S. schools, randomly assigning some schools to receive inspectors and some not.

We can all make our best guesses about what might work to improve U.S. schools, but let’s put our guesses to the test.