Benchmark Assessments: Weighing the Pig More Often?

There is an old saying about educational assessment: “If you want to fatten a pig, it doesn’t help to weigh it more often.”

To be fair, it may actually help to weigh pigs more often, so the farmer knows whether they are gaining weight at the expected levels. Then they can do something in time if this is not the case.

It is surely correct that weighing pigs does no good in itself, but it may serve a diagnostic purpose. What matters is not the weighing, but rather what the farmer or veterinarian does based on the information provided by the weighing.

blog_4-11-19_pigscale_500x432

This blog is not, however, about porcine policy, but educational policy. In schools, districts, and even whole states, most American children take “benchmark assessments” roughly three to six times a year. These assessments are intended to tell teachers, principals, and other school leaders how students are doing, especially in reading and math. Ideally, benchmark assessments are closely aligned with state accountability tests, making it possible for school leaders to predict how whole grade levels are likely to do on the state tests early enough in the year to enable them to provide additional assistance in areas of need. The information might be as detailed as “fourth graders need help in fractions” or “English learners need help in vocabulary.”

Benchmark assessments are only useful if they improve scores on state accountability tests. Other types of intervention may be beneficial even if they do not make any difference in state test scores, but it is hard to see why benchmark assessments would be valuable if they do not in fact have any impact on state tests, or other standardized tests.

So here is the bad news: Research finds that benchmark assessments do not make any difference in achievement.

High-quality, large scale randomized evaluations of benchmark assessments are relatively easy to do. Many have in fact been done. Use of benchmark assessments have been evaluated in elementary reading and math (see www.bestevidence.org). Here is a summary of the findings.

Number of Studies Mean Effect Size
Elementary Reading 6 -0.02
Elementary Math 4    .00
Study-weighted mean 10 -0.01

In a rational world, these findings would put an end to benchmark assessments, at least as they are used now. The average outcomes are not just small, they are zero. They use up a lot of student time and district money.

In our accountability-obsessed educational culture, how could use of benchmark assessments make no difference at all on the only measure they are intended to improve? I would suggest several possibilities.

First, perhaps the most likely, is that teachers and schools do not do much with the information from benchmark assessments. If you are trying to lose weight, you likely weigh yourself every day. But if you then make no systematic effort to change your diet or increase your exercise, then all those weighings are of little value. In education, the situation is much worse than in weight reduction, because teachers are each responsible for 20-30 students. Results of benchmark assessments are different for each student, so a school staff that learns that its fourth graders need improvement in fractions finds it difficult to act on this information. Some fourth graders in every school are excelling in fractions, some just need a little help, and some are struggling in fractions because they missed the prerequisite skills. “Teach more fractions” is not a likely solution except for some of that middle group, yet differentiating instruction for all students is difficult to do well.

Another problem is that it takes time to score and return benchmark assessments, so by the time a team of teachers decides how to respond to benchmark information, the situation has moved on.

Third, benchmark assessments may add little because teachers and principals already know a lot more about their students than any test can tell them. Imagine a principal receiving the information that her English learners need help in vocabulary. I’m going to guess that she already knows that. But more than that, she and her teachers know which English learners need what kind of vocabulary, and they have other measures and means of finding out. Teachers already give a lot of brief, targeted curriculum-linked assessments, and they always have. Further, wise teachers stroll around and listen in on students working in cooperative groups, or look at their tests or seatwork or progress on computer curriculum, to get a sophisticated understanding of why some students are having trouble, and ideas for what to do about it. For example, it is possible that English learners are lacking school-specific vocabulary, such as that related to science or social studies, and this observation may suggest solutions (e.g., teach more science and social studies). But what if some English learners are afraid or unwilling to express themselves in class, but sit quietly and never volunteer answers? A completely different set of solutions might be appropriate in this case, such as using cooperative learning or tutoring strategies to give students safe spaces in which to use the vocabulary they have, and gain motivation and opportunities to learn and use more.

Benchmark assessments fall into the enormous category of educational solutions that are simple, compelling, and wrong. Yes, teachers need to know what students are learning and what is needed to improve it, but they have available many more tools that are far more sensitive, useful, timely, and tied to actions teachers can take.

Eliminating benchmark assessments would save schools a lot of money. Perhaps that money could be redirected to professional development to help teachers use approaches actually proven to work. I know, that’s crazy talk. But perhaps if we looked at what students are actually doing and learning in class, we could stop weighing pigs and start improving teaching for all children.

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.

How Computers Can Help Do Bad Research

“To err is human. But it takes a computer to really (mess) things up.” – Anonymous

Everyone knows the wonders of technology, but they also know how technology can make things worse. Today, I’m going to let my inner nerd run free (sorry!) and write a bit about how computers can be misused in educational program evaluation.

Actually, there are many problems, all sharing the possibilities for serious bias created when computers are used to collect “big data” on computer-based instruction (note that I am not accusing computers of being biased in favor of their electronic pals!  The problem is that “big data” often contains “big bias.” Computers do not have biases. They do what their operators ask them to do.) (So far).

Here is one common problem.  Evaluators of computer-based instruction almost always have available massive amounts of data indicating how much students used the computers or software. Invariably, some students use the computers a lot more than others do. Some may never even log on.

Using these data, evaluators often identify a sample of students, classes, or schools that met a given criterion of use. They then locate students, classes, or schools not using the computers to serve as a control group, matching on achievement tests and perhaps other factors.

This sounds fair. Why should a study of computer-based instruction have to include in the experimental group students who rarely touched the computers?

The answer is that students who did use the computers an adequate amount of time are not at all the same as students who had the same opportunity but did not use them, even if they all had the same pretests, on average. The reason may be that students who used the computers were more motivated or skilled than other students in ways the pretests do not detect (and therefore cannot control for). Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious. Sometimes whole classes or schools use computers more than others do, and this may indicate other positive features about those classes or schools that pretests do not capture.

Sometimes a high frequency of computer use indicates negative factors, in which case evaluations that only include the students who used the computers at least a certain amount of time may show (meaningless) negative effects. Such cases include situations in which computers are used to provide remediation for students who need it, or some students may be taking ‘credit recovery’ classes online to replace classes they have failed.

Evaluations in which students who used computers are compared to students who had opportunities to use computers but did not do so have the greatest potential for bias. However, comparisons of students in schools with access to computers to schools without access to computers can be just as bad, if only the computer-using students in the computer-using schools are included.  To understand this, imagine that in a computer-using school, only half of the students actually use computers as much as the developers recommend. The half that did use the computers cannot be compared to the whole non-computer (control) schools. The reason is that in the control schools, we have to assume that given a chance to use computers, half of their students would also do so and half would not. We just don’t know which particular students would and would not have used the computers.

Another evaluation design particularly susceptible to bias is studies in which, say, schools using any program are matched (based on pretests, demographics, and so on) with other schools that did use the program after outcomes are already known (or knowable). Clearly, such studies allow for the possibility that evaluators will “cherry-pick” their favorite experimental schools and “crabapple-pick” control schools known to have done poorly.

blog_12-13-18_evilcomputer_500x403

Solutions to Problems in Evaluating Computer-based Programs.

Fortunately, there are practical solutions to the problems inherent to evaluating computer-based programs.

Randomly Assigning Schools.

The best solution by far is the one any sophisticated quantitative methodologist would suggest: identify some numbers of schools, or grades within schools, and randomly assign half to receive the computer-based program (the experimental group), and half to a business-as-usual control group. Measure achievement at pre- and post-test, and analyze using HLM or some other multi-level method that takes clusters (schools, in this case) into account. The problem is that this can be expensive, as you’ll usually need a sample of about 50 schools and expert assistance.  Randomized experiments produce “intent to treat” (ITT) estimates of program impacts that include all students whether or not they ever touched a computer. They can also produce non-experimental estimates of “effects of treatment on the treated” (TOT), but these are not accepted as the main outcomes.  Only ITT estimates from randomized studies meet the “strong” standards of ESSA, the What Works Clearinghouse, and Evidence for ESSA.

High-Quality Matched Studies.

It is possible to simulate random assignment by matching schools in advance based on pretests and demographic factors. In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT estimates, with the exception of randomizing at the start.

In this “moderate” or quasi-experimental category there is one research design that may allow evaluators to do relatively inexpensive, relatively fast evaluations. Imagine that a program developer has sold their program to some number of schools, all about to start the following fall. Assume the evaluators have access to state test data for those and other schools. Before the fall, the evaluators could identify schools not using the program as a matched control group. These schools would have to have similar prior test scores, demographics, and other features.

In order for this design to be free from bias, the developer or evaluator must specify the entire list of experimental and control schools before the program starts. They must agree that this list is the list they will use at posttest to determine outcomes, no matter what happens. The list, and the study design, should be submitted to the Registry of Efficacy and Effectiveness Studies (REES), recently launched by the Society for Research on Educational Effectiveness (SREE). This way there is no chance of cherry-picking or crabapple-picking, as the schools in the analysis are the ones specified in advance.

All students in the selected experimental and control schools in the grades receiving the treatment would be included in the study, producing an ITT estimate. There is not much interest in this design in “big data” on how much individual students used the program, but such data would produce a  “treatment-on-the-treated” (TOT) estimate that should at least provide an upper bound of program impact (i.e., if you don’t find a positive effect even on your TOT estimate, you’re really in trouble).

This design is inexpensive both because existing data are used and because the experimental schools, not the evaluators, pay for the program implementation.

That’s All?

Yup.  That’s all.  These designs do not make use of the “big data “cheaply assembled by designers and evaluators of computer-based programs. Again, the problem is that “big data” leads to “big bias.” Perhaps someone will come up with practical designs that require far fewer schools, faster turn-around times, and creative use of computerized usage data, but I do not see this coming. The problem is that in any kind of experiment, things that take place after random or matched assignment (such as participation or non-participation in the experimental treatment) are considered bias, of interest in after-the-fact TOT analyses but not as headline ITT outcomes.

If evidence-based reform is to take hold we cannot compromise our standards. We must be especially on the alert for bias. The exciting “cost-effective” research designs being talked about these days for evaluations of computer-based programs do not meet this standard.

What’s the Evidence that Evidence Works?

I recently gave a couple of speeches on evidence-based reform in education in Barcelona.  In preparing for them, one of the organizers asked me an interesting question: “What is your evidence that evidence works?”

At one level, this is a trivial question. If schools select proven programs and practices aligned with their needs and implement them with fidelity and intelligence, with levels of resources similar to those used in the original successful research, then of course they’ll work, right? And if a school district adopts proven programs, encourages and funds them, and monitors their implementation and outcomes, then of course the appropriate use of all these programs is sure to enhance achievement district-wide, right?

Although logic suggests that a policy of encouraging and funding proven programs is sure to increase achievement on a broad scale, I like to be held to a higher standard: Evidence. And, it so happens, I happen to have some evidence on this very topic. This evidence came from a large-scale evaluation of an ambitious, national effort to increase use of proven and promising schoolwide programs in elementary and middle schools, in a research center funded by the Institute for Education Sciences (IES) called the Center for Data-Driven Reform in Education, or CDDRE (see Slavin, Cheung, Holmes, Madden, & Chamberlain, 2013). The name of the program the experimental schools used was Raising the Bar.

How Raising the Bar Raised the Bar

The idea behind Raising the Bar was to help schools analyze their own needs and strengths, and then select whole-school reform models likely to help them meet their achievement goals. CDDRE consultants provided about 30 days of on-site professional development to each district over a 2-year period. The PD focused on review of data, effective use of benchmark assessments, school walk-throughs by district leaders to see the degree to which schools were already using the programs they claimed to be using, and then exposing district and school leaders to information and data on schoolwide programs available to them, from several providers. If districts selected a program to implement, their district and school received PD on ensuring effective implementation and principals and teachers received PD on the programs they chose.

blog_7-26-18_polevault_375x500

Evaluating Raising the Bar

In the study of Raising the Bar we recruited a total of 397 elementary and 225 middle schools in 59 districts in 7 states (AL, AZ, IN, MS, OH, TN). All schools were Title I schools in rural and mid-sized urban districts. Overall, 30% of students were African-American, 20% were Hispanic, and 47% were White. Across three cohorts, starting in 2005, 2006, or 2007, schools were randomly assigned to either use Raising the Bar, or to continue with what they were doing. The study ended in 2009, so schools could have been in the Raising the Bar group for two, three, or four years.

Did We Raise the Bar?

State test scores were obtained from all schools and transformed to z-scores so they could be combined across states. The analyses focused on grades 5 and 8, as these were the only grades tested in some states at the time. Hierarchical linear modeling, with schools nested within districts, were used for analysis.

For reading in fifth grade, outcomes were very good. By Year 3, the effect sizes were significant, with significant individual-level effect sizes of +0.10 in Year 3 and +0.19 in Year 4. In middle school reading, effect sizes reached an effect size of +0.10 by Year 4.

Effects were also very good in fifth grade math, with significant effects of +0.10 in Year 3 and +0.13 in Year 4. Effect sizes in middle school math were also significant in Year 4 (ES=+0.12).

Note that these effects are for all schools, whether they adopted a program or not. Non-experimental analyses found that by Year 4, elementary schools that had chosen and implemented a reading program (33% of schools by Year 3, 42% by Year 4) scored better than matched controls in reading. Schools that chose any reading program usually chose our Success for All reading program, but some chose other models. Even in schools that did not adopt reading or math programs, scores were always higher, on average, (though not always significantly higher) than for schools that did not choose programs.

How Much Did We Raise the Bar?

The CDDRE project was exceptional because of its size and scope. The 622 schools, in 59 districts in 7 states, were collectively equivalent to a medium-sized state. So if anyone asks what evidence-based reform could do to help an entire state, this study provides one estimate. The student-level outcome in elementary reading, an effect size of +0.19, applied to NAEP scores, would be enough to move 43 states to the scores now only attained by the top 10. If applied successfully to schools serving mostly African American and Hispanic students or to students receiving free- or reduced-price lunches regardless of ethnicity, it would reduce the achievement gap between these and White or middle-class students by about 38%. All in four years, at very modest cost.

Actually, implementing something like Raising the Bar could be done much more easily and effectively today than it could in 2005-2009. First, there are a lot more proven programs to choose from than there were then. Second, the U.S. Congress, in the Every Student Succeeds Act (ESSA), now has definitions of strong, moderate, and promising levels of evidence, and restricts school improvement grants to schools that choose such programs. The reason only 42% of Raising the Bar schools selected a program is that they had to pay for it, and many could not afford to do so. Today, there are resources to help with this.

The evidence is both logical and clear: Evidence works.

Reference

Slavin, R. E., Cheung, A., Holmes, G., Madden, N. A., & Chamberlain, A. (2013). Effects of a data-driven district reform model on state assessment outcomes. American Educational Research Journal, 50 (2), 371-396.

Photo by Sebastian Mary/Gio JL [CC BY-SA 2.0  (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.