What if people could make their own yardsticks, and all of a sudden people who did so gained two inches overnight, while people who used ordinary yardsticks did not change height? What if runners counted off time as they ran (one Mississippi, two Mississippi…), and then it so happened that these runners reduced their time in the 100-yard dash by 20%? What if archers could draw their own targets freehand and those who did got more bullseyes?
All of these examples are silly, you say. Of course people who make their own measures will do better on the measures they themselves create. Even the most honest and sincere people, trying to be fair, may give themselves the benefit of the doubt in such situations.
In educational research, it is frequently the case that researchers or developers make up their own measures of achievement or other outcomes. Numerous reviews of research (e.g., Baye et al., 2019; Cheung & Slavin, 2016; deBoer et al., 2014; Wolf et al., 2019) have found that studies that use measures made by developers or researchers obtain effect sizes that may be two or three times as large as measures independent of the developers or researchers. In fact, some studies (e.g., Wolf et al., 2019; Slavin & Madden, 2011) have compared outcomes on researcher/developer-made measures and independent measures within the same studies. In almost every study with both kinds of measures, the researcher/developer measures show much higher effect sizes.
I think anyone can see that researcher/developer measures tend to overstate effects, and the reasons why they would do so are readily apparent (though I will discuss them in a moment). I and other researchers have been writing about this problem in journals and other outlets for years. Yet journals still accept these measures, most authors of meta-analyses still average them into their findings, and life goes on.
I’ve written about this problem in several blogs in this series. In this one I hope to share observations about the persistence of this practice.
How Do Researchers Justify Use of Researcher/Developer-Made Measures?
Very few researchers in education are dishonest, and I do not believe that researchers set out to hoodwink readers by using measures they made up. Instead, researchers who make up their own measures or use developer-made measures express reasonable-sounding rationales for making their own measures. Some common rationales are discussed below.
- Perhaps the most common rationale for using researcher/developer-made measures is that the alternative is to use standardized tests, which are felt to be too insensitive to any experimental treatment. Often researchers will use both a “distal” (i.e., standardized) measure and a “proximal” (i.e., researcher/developer-made) measure. For example, studies of vocabulary-development programs that focus on specific words will often create a test consisting primarily or entirely of these focal words. They may also use a broad-range standardized test of vocabulary. Typically, such studies find positive effects on the words taught in the experimental group, but not on vocabulary in general. However, the students in the control group did not focus on the focal words, so it is unlikely they would improve on them as much as students who spent considerable time with them, regardless of the teaching method. Control students may be making impressive gains on vocabulary, mostly on words other than those emphasized in the experimental group.
- Many researchers make up their own tests to reflect their beliefs about how children should learn. For example, a researcher might believe that students should learn algebra in third grade. Because there are no third grade algebra tests, the researcher might make one. If others complain that of course the students taught algebra in third grade will do better on a test of the algebra they learned (but that the control group never saw), the researcher may give excellent reasons why algebra should be taught to third graders, and if the control group didn’t get that content, well, they should
- Often, researchers say they used their own measures because there were no appropriate tests available focusing on whatever they taught. However, there are many tests of all kinds available either from specialized publishers or from measures made by other researchers. A researcher who cannot find anything appropriate is perhaps studying something so esoteric that it will not have ever been seen by any control group.
- Sometimes, researchers studying technology applications will give the final test on the computer. This may, of course, give a huge advantage to the experimental group, which may have been using the specific computers and formats emphasized in the test. The control group may have much less experience with computers, or with the particular computer formats used in the experimental group. The researcher might argue that it would not be fair to teach on computers but test on paper. Yet every student knows how to write with a pencil, but not every student has extensive experience with the computers used for the test.
A Potential Solution to the Problem of Researcher/Developer Measures
Researcher/developer-made measures clearly inflate effect sizes considerably. Further, research in education, an applied field, should use measures like those for which schools and teachers are held accountable. No principal or teacher gets to make up his or her own test to use for accountability, and neither should researchers or developers have that privilege.
However, arguments for the use of researcher- and developer-made measures are not entirely foolish, as long as these measures are only used as supplements to independent measures. For example, in a vocabulary study, there may be a reason researchers want to know the effect of a program on the hundred words it emphasizes. This is at least a minimum expectation for such a treatment. If a vocabulary intervention that focused on only 100 words all year did not improve knowledge of those words, that would be an indication of trouble. Similarly, there may be good reasons to try out treatments based on unique theories of action and to test them using measures also aligned with that theory of action.
The problem comes in how such results are reported, and especially how they are treated in meta-analyses or other quantitative syntheses. My suggestions are as follows:
- Results from researcher/developer-made measures should be reported in articles on the program being evaluated, but not emphasized or averaged with independent measures. Analyses of researcher/developer-made measures may provide information, but not a fair or meaningful evaluation of the program impact. Reports of effect sizes from researcher/developer measures should be treated as implementation measures, not outcomes. The outcomes emphasized should only be those from independent measures.
- In meta-analyses and other quantitative syntheses, only independent measures should be used in calculations. Results from researcher/developer measures may be reported in program descriptions, but never averaged in with the independent measures.
- Studies whose only achievement measures are made by researchers or developers should not be included in quantitative reviews.
Fields in which research plays a central and respected role in policy and practice always pay close attention to the validity and fairness of measures. If educational research is ever to achieve a similar status, it must relegate measures made by researchers or developers to a supporting role, and stop treating such data the same way it treats data from independent, valid measures.
Baye, A., Lake, C., Inns, A., & Slavin, R. (2019). Effective reading programs for secondary students. Reading Research Quarterly, 54 (2), 133-166.
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
de Boer, H., Donker, A.S., & van der Werf, M.P.C. (2014). Effects of the attributes of educational interventions on students’ academic performance: A meta- analysis. Review of Educational Research, 84(4), 509–545. https://doi.org/10.3102/0034654314540006
Slavin, R.E., & Madden, N.A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4 (4), 370-380.
Wolf, R., Morrison, J., Inns, A., Slavin, R., & Risman, K. (2019). Differences in average effect sizes in developer-commissioned and independent studies. Manuscript submitted for publication.
Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.