What if people could make their own yardsticks, and all of a sudden people who did so gained two inches overnight, while people who used ordinary yardsticks did not change height? What if runners counted off time as they ran (one Mississippi, two Mississippi…), and then it so happened that these runners reduced their time in the 100-yard dash by 20%? What if archers could draw their own targets freehand and those who did got more bullseyes?
All of these examples are silly, you say. Of course people who make their own measures will do better on the measures they themselves create. Even the most honest and sincere people, trying to be fair, may give themselves the benefit of the doubt in such situations.
In educational research, it is frequently the case that researchers or developers make up their own measures of achievement or other outcomes. Numerous reviews of research (e.g., Baye et al., 2019; Cheung & Slavin, 2016; deBoer et al., 2014; Wolf et al., 2019) have found that studies that use measures made by developers or researchers obtain effect sizes that may be two or three times as large as measures independent of the developers or researchers. In fact, some studies (e.g., Wolf et al., 2019; Slavin & Madden, 2011) have compared outcomes on researcher/developer-made measures and independent measures within the same studies. In almost every study with both kinds of measures, the researcher/developer measures show much higher effect sizes.
I think anyone can see that researcher/developer measures tend to overstate effects, and the reasons why they would do so are readily apparent (though I will discuss them in a moment). I and other researchers have been writing about this problem in journals and other outlets for years. Yet journals still accept these measures, most authors of meta-analyses still average them into their findings, and life goes on.
I’ve written about this problem in several blogs in this series. In this one I hope to share observations about the persistence of this practice.
How Do Researchers Justify Use of Researcher/Developer-Made Measures?
Very few researchers in education are dishonest, and I do not believe that researchers set out to hoodwink readers by using measures they made up. Instead, researchers who make up their own measures or use developer-made measures express reasonable-sounding rationales for making their own measures. Some common rationales are discussed below.
- Perhaps the most common rationale for using researcher/developer-made measures is that the alternative is to use standardized tests, which are felt to be too insensitive to any experimental treatment. Often researchers will use both a “distal” (i.e., standardized) measure and a “proximal” (i.e., researcher/developer-made) measure. For example, studies of vocabulary-development programs that focus on specific words will often create a test consisting primarily or entirely of these focal words. They may also use a broad-range standardized test of vocabulary. Typically, such studies find positive effects on the words taught in the experimental group, but not on vocabulary in general. However, the students in the control group did not focus on the focal words, so it is unlikely they would improve on them as much as students who spent considerable time with them, regardless of the teaching method. Control students may be making impressive gains on vocabulary, mostly on words other than those emphasized in the experimental group.
- Many researchers make up their own tests to reflect their beliefs about how children should learn. For example, a researcher might believe that students should learn algebra in third grade. Because there are no third grade algebra tests, the researcher might make one. If others complain that of course the students taught algebra in third grade will do better on a test of the algebra they learned (but that the control group never saw), the researcher may give excellent reasons why algebra should be taught to third graders, and if the control group didn’t get that content, well, they should
- Often, researchers say they used their own measures because there were no appropriate tests available focusing on whatever they taught. However, there are many tests of all kinds available either from specialized publishers or from measures made by other researchers. A researcher who cannot find anything appropriate is perhaps studying something so esoteric that it will not have ever been seen by any control group.
- Sometimes, researchers studying technology applications will give the final test on the computer. This may, of course, give a huge advantage to the experimental group, which may have been using the specific computers and formats emphasized in the test. The control group may have much less experience with computers, or with the particular computer formats used in the experimental group. The researcher might argue that it would not be fair to teach on computers but test on paper. Yet every student knows how to write with a pencil, but not every student has extensive experience with the computers used for the test.
A Potential Solution to the Problem of Researcher/Developer Measures
Researcher/developer-made measures clearly inflate effect sizes considerably. Further, research in education, an applied field, should use measures like those for which schools and teachers are held accountable. No principal or teacher gets to make up his or her own test to use for accountability, and neither should researchers or developers have that privilege.
However, arguments for the use of researcher- and developer-made measures are not entirely foolish, as long as these measures are only used as supplements to independent measures. For example, in a vocabulary study, there may be a reason researchers want to know the effect of a program on the hundred words it emphasizes. This is at least a minimum expectation for such a treatment. If a vocabulary intervention that focused on only 100 words all year did not improve knowledge of those words, that would be an indication of trouble. Similarly, there may be good reasons to try out treatments based on unique theories of action and to test them using measures also aligned with that theory of action.
The problem comes in how such results are reported, and especially how they are treated in meta-analyses or other quantitative syntheses. My suggestions are as follows:
- Results from researcher/developer-made measures should be reported in articles on the program being evaluated, but not emphasized or averaged with independent measures. Analyses of researcher/developer-made measures may provide information, but not a fair or meaningful evaluation of the program impact. Reports of effect sizes from researcher/developer measures should be treated as implementation measures, not outcomes. The outcomes emphasized should only be those from independent measures.
- In meta-analyses and other quantitative syntheses, only independent measures should be used in calculations. Results from researcher/developer measures may be reported in program descriptions, but never averaged in with the independent measures.
- Studies whose only achievement measures are made by researchers or developers should not be included in quantitative reviews.
Fields in which research plays a central and respected role in policy and practice always pay close attention to the validity and fairness of measures. If educational research is ever to achieve a similar status, it must relegate measures made by researchers or developers to a supporting role, and stop treating such data the same way it treats data from independent, valid measures.
Baye, A., Lake, C., Inns, A., & Slavin, R. (2019). Effective reading programs for secondary students. Reading Research Quarterly, 54 (2), 133-166.
Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.
de Boer, H., Donker, A.S., & van der Werf, M.P.C. (2014). Effects of the attributes of educational interventions on students’ academic performance: A meta- analysis. Review of Educational Research, 84(4), 509–545. https://doi.org/10.3102/0034654314540006
Slavin, R.E., & Madden, N.A. (2011). Measures inherent to treatments in program effectiveness reviews. Journal of Research on Educational Effectiveness, 4 (4), 370-380.
Wolf, R., Morrison, J., Inns, A., Slavin, R., & Risman, K. (2019). Differences in average effect sizes in developer-commissioned and independent studies. Manuscript submitted for publication.
Photo Courtesy of Allison Shelley/The Verbatim Agency for American Education: Images of Teachers and Students in Action
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.
2 thoughts on “Developer- and Researcher-Made Measures”
Thanks Robert another excellent analysis of the problems of comparing effect sizes across different studies without knowing what sorts of tests are used.
Prof Adrian Simpson (2018, p. 5) also shows problems with different standardised tests for the SAME maths intervention,
“The effect size… for the PIM test was 0.33 and for the SENT-R-B test was 1.11.”
I found this a very strange post in a normally very interesting blog. At its heart appears to be the common, but grossly misleading misconception that effect size measures the impact of an intervention. It doesn’t. The same intervention can result in very different effect sizes, depending – among many different factors – on the choice of comparison treatment, sample and, yes, the measure.
But the key issue at the heart of this blog post’s concerns is not whether a measure is designed by the researcher or designed independently, but how sensitive the measure is to the difference in treatments for the sample. As George points out above, two standardised tests for the same intervention (on the same sample with the same comparison treatment) can have very different effect sizes.
For example, an intervention which (only) impacts on ability to add fractions compared to the alternative treatment will have a bigger effect size when measured with a test of fraction addition than a more general test of fractions, which will have a larger effect size than a test of more general mathematics which will have a larger effect size than a general achievement test covering many subjects; even if all of these were standardised tests.
Simply restricting measures to standardised ones will not solve the problems of combining more or less sensitive measures in meta-analysis since independent evaluators can use measures which are more or less sensitive to the difference in treatments. If they are doing a fractions intervention they can use a fractions standardised test or they can use a more general mathematics standardised test and end up with very different effect sizes.
Throughout the post, there are a number of conflations of rather different things. Researcher made tests are not the same thing as proximal tests and standardised tests are not the same thing as distal tests (albeit that their may be a relationship between being researcher made and being more sensitive to the difference in treatments on the sample). Most oddly, given the shared authorship, it is stated that Slavin and Madden (2011) “compared outcomes on researcher/developer-made measures and independent measures within the same studies” – they didn’t: they compared effect sizes from what they termed ‘treatment inherent’ and ‘treatment independent’ tests. Treatment inherent is not the same as researcher made and treatment independent is not the same as standardised. See p374 of the Slavin & Madden article which outline the authors’ personal rules for deciding whether a test is treatment inherent or not and which explicitly allows for researcher made tests to be treatment independent and standardised tests to be treatment inherent. Moreover, treatment inherent is not a binary concept: a measure can be more or less sensitive to the difference in treatments on the sample; and splitting a continuum is not always wise and that the decision rules for the split may be far from universally agreed
My main objection is the sly winking at the purported invalidity of researcher designed measures in the very final sentence.
If the post is suggesting that effect sizes from researcher-made tests are not valid measures of the impact of an intervention, it is right: but that is because effect size is not a valid measure of the impact of an intervention, regardless of the underlying test or the status of the people who developed it. Effect size is a measure of the clarity of the study: an effect size can be large if an irrelevant intervention is studied with great precision (a passive comparison, a homogenous sample and a sensitive measure [even if it is standardised!]), while a highly impactful intervention may have a small effect size if the study is imprecise and noisy (with a relatively effective comparison treatment, a heterogenous sample and a measure which is insensitive to the difference in treatments [even if researcher-made!]).
We need to stop is the pretence that we can identify better or worse interventions on the basis of a metric that does not measure effectiveness.
LikeLiked by 2 people