Handling Outbreaks after COVID-19 Re-openings: The Case of Germany

By guest blogger Nathan Storey*

As schools across the U.S. are beginning to reopen in hybrid or full formats, unanticipated outbreaks of COVID are bound to occur. To help schools prepare, we have been writing about strategies schools and districts in other countries have used to combat outbreaks.

In this week’s case study, I examine how Germany has responded to outbreaks and managed school reopening nationwide.

Germany

Over one month since reopening after the summer holiday, German schools are largely still open. Critics and health experts worried in the early weeks as cases in the country appeared to increase (Morris & Weber-Steinhaus, 2020), but schools have been able to continue to operate. Now students sit in classes without masks, and children are allowed to move and interact freely on the playground.

Immediately following the reopening, 31 outbreak clusters (150 cases) were identified in the first week of schooling, and 41 schools in Berlin (out of 825 schools in the region) experienced COVID-19 cases during the first two weeks of schooling, requiring quarantines, testing, and temporary closures. Similar issues occurred across the country as schools reopened in other states. Mecklenburg-Western Pomerania, the first state to reopen, saw 800-plus students from Goethe Gymnasium in Ludwigslust sent home for quarantine after a faculty member tested positive. One hundred primary school students in Rostock district were quarantined for two weeks when a fellow student tested positive. Yet now one month later, German schools remain open. How is this possible?

Germany has focused its outbreak responses on individual student and class-level quarantines instead of shutting down entire schools. Due to active and widespread testing nationwide in the early stages of the outbreak, the country was able to get control of community-level positivity rates, paving the way for schools to reopen both in the spring, and again after summer break. Rates rose in August, but tracking enabled authorities to trace the cases to people returning from summer vacation, not from schools. At schools, outbreaks have generally been limited to one teacher or one student, who have contracted the virus from family or community members, not from within the school.

When these outbreaks occur, schools close for a day awaiting test results, but reopen quickly once affected individuals are tested negative and can return to class. At Sophie-Charlotte High School in Berlin, three days after reopening, the school received word that two students tested positive from the girls’ parents. The school in turn informed the local health authority, leading to 191 students and teachers asked to quarantine at home. Everyone was tested and two days later they received their test results. Before the week was up, school was back in session. By one estimate, due to the efficient testing and individual or class quarantines, fewer than 600 Berliner students have had to stay home for a day (out of more than 366,000 students) (Bennhold, 2020).

So far, there has been one more serious outbreak at Heinrich Hertz School in Hamburg, where a cluster of 26 students and three teachers have all received positive diagnoses, potentially infected by one of the teachers. The school moved to quarantine grades six and eight, and mask wearing rules were more strictly followed. The school and local health authorities are continuing to study the potential transmission patterns to locate the origin of the cluster.

Testing in Germany is effective because it is extensive, but targeted to those with direct contact with infections. At Heinz-Berggruen school in Berlin, a sixth grader was found to be infected after being tested even though she had no symptoms. Someone in her family had tested positive. Tracing the family member’s contacts, tests determined the source of the infection stemmed from international travel, and Heinz-Berggruen remained open, with just the infected student quarantined for two weeks. At Goethe Gymnasium in Ludwigslust, mentioned earlier, the infected teacher was sent home, and all 55 teachers were subsequently tested. The school was able to reopen less than a week later.

Some challenges have arisen. As in the US, German states are responsible for their own COVID-19 prevention measures and must make plans for the case of outbreaks. One city councilor in the Neukölln district of Berlin revealed there was confusion among parents and schools about children’s symptoms and response plans. As a result, children whose only symptoms are runny noses, for instance, have been sent home, and worries are increasing as to how effectively schools and districts will differentiate COVID-19 from flu in the winter.

The German case provides some optimism that schools can manage outbreaks and reopen successfully through careful planning and organization. Testing, contact tracing, and communication are vital, as is lowering of community positivity rates. Cases may be rising in Germany again (Loxton, 2020), but with these strategies and new national COVID management rules in place, the country is in an excellent position to address the challenge.

*Nathan Storey is a graduate student at the Johns Hopkins University School of Education

References

Barton, T., & Parekh, A. (2020, August 11). Reopening schools: Lessons from abroad. https://doi.org/10.26099/yr9j-3620

(2020, June 12). As Europe reopens schools, relief combines with risk. The New York Times. https://www.nytimes.com/2020/05/10/world/europe/reopen-schools-germany.html

Bennhold, K. (2020, August 26). Germany faces a ‘roller coaster’ as schools reopen amid Coronavirus—The New York Times. https://www.nytimes.com/2020/08/26/world/europe/germany-schools-virus-reopening.html?smid=em-share

Holcombe, M. (2020, October 5). New York City to close schools in some areas as Northeast sees rise in new cases. CNN. https://www.cnn.com/2020/10/05/health/us-coronavirus-monday/index.html

Loxton, R. (2020, October 15). What you need to know about Germany’s new coronavirus measures for autumn. The Local. https://www.thelocal.de/20201015/what-you-need-to-know-about-germanys-new-coronavirus-measures-for-autumn-and-winter

Medical Xpress. (2020, August 7). Germany closes two schools in new virus blow. https://medicalxpress.com/news/2020-08-germany-schools-virus.html

Morris, L., & Weber-Steinhaus, F. (2020, September 11). Schools have seen no coronavirus outbreaks since reopening a month ago in Germany—The Washington Post. https://www.washingtonpost.com/world/europe/covid-schools-germany/2020/09/10/309648a4-eedf-11ea-bd08-1b10132b458f_story.html

Noryskiewicz, A. (2020, August 25). Coronavirus data 2 weeks into Germany’s school year “reassures” expert. https://www.cbsnews.com/news/coronavirus-school-germany-no-outbreaks/

The Associated Press (2020, August 27). Europe is going back to school despite recent virus surge—Education Week. AP. http://www.edweek.org/ew/articles/2020/08/27/europe-is-going-back-to-school_ap.html?cmp=eml-enl-eu-news2&M=59665135&U=&UUID=4397669ca555af41d7b271f2dafac508

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

How Can You Tell When The Findings of a Meta-Analysis Are Likely to Be Valid?

In Baltimore, Faidley’s, founded in 1886, is a much loved seafood market inside Lexington Market. Faidley’s used to be a real old-fashioned market, with sawdust on the floor and an oyster bar in the center. People lined up behind their favorite oyster shucker. In a longstanding tradition, the oyster shuckers picked oysters out of crushed ice and tapped them with their oyster knives. If they sounded full, they opened them. But if they did not, the shuckers discarded them.

I always noticed that the line was longer behind the shucker who was discarding the most oysters. Why? Because everyone knew that the shucker who was pickier was more likely to come up with a dozen fat, delicious oysters, instead of say, nine great ones and three…not so great.

I bring this up today to tell you how to pick full, fair meta-analyses on educational programs. No, you can’t tap them with an oyster knife, but otherwise, the process is similar. You want meta-analysts who are picky about what goes into their meta-analyses. Your goal is to make sure that a meta-analysis produces results that truly represent what teachers and schools are likely to see in practice when they thoughtfully implement an innovative program. If instead you pick the meta-analysis with the biggest effect sizes, you will always be disappointed.

As a special service to my readers, I’m going to let you in on a few trade secrets about how to quickly evaluate a meta-analysis in education.

One very easy way to evaluate a meta-analysis is to look at the overall effect size, probably shown in the abstract. If the overall mean effect size is more than about +0.40, you probably don’t have to read any further. Unless the treatment is tutoring or some other treatment that you would expect to make a massive difference in student achievement, it is rare to find a single legitimate study with an effect size that large, much less an average that large. A very large effect size is almost a guarantee that a meta-analysis is full of studies with design features that greatly inflate effect sizes, not studies with outstandingly effective treatments.

Next, go to the Methods section, which will have within it a section on inclusion (or selection) criteria. It should list the types of studies that were or were not accepted into the study. Some of the criteria will have to do with the focus of the meta-analysis, specifying, for example, “studies of science programs for students in grades 6 to 12.” But your focus is on the criteria that specify how picky the meta-analysis is. As one example of a picky set of critera, here are the main ones we use in Evidence for ESSA and in every analysis we write:

  1. Studies had to use random assignment or matching to assign students to experimental or control groups, with schools and students in each specified in advance.
  2. Students assigned to the experimental group had to be compared to very similar students in a control group, which uses business-as-usual. The experimental and control students must be well matched, within a quarter standard deviation at pretest (ES=+0.25), and attrition (loss of subjects) must be no more than 15% higher in one group than the other at the end of the study. Why? It is essential that experimental and control groups start and remain the same in all ways other than the treatment. Controls for initial differences do not work well when the differences are large.
  3. There must be at least 30 experimental and 30 control students. Analyses of combined effect sizes must control for sample sizes. Why? Evidence finds substantial inflation of effect sizes in very small studies.
  4. The treatments must be provided for at least 12 weeks. Why? Evidence finds major inflation of effect sizes in very brief studies, and brief studies do not represent the reality of the classroom.
  5. Outcome measures must be measures independent of the program developers and researchers. Usually, this means using national tests of achievement, though not necessarily standardized tests. Why? Research has found that tests made by researchers can inflate effect sizes by double, or more, and research-made measures do not represent the reality of classroom assessment.

There may be other details, but these are the most important. Note that there is a double focus of these standards. Each is intended both to minimize bias, but also to maximize similarity to the conditions faced by schools. What principal or teacher who cares about evidence would be interested in adopting a program evaluated in comparison to a very different control group? Or in a study with few subjects, or a very brief duration? Or in a study that used measures made by the developers or researchers? This set is very similar to what the What Works Clearinghouse (WWC) requires, except #5 (the WWC requires exclusion of “overaligned” measures, but not developer-/researcher-made measures).

If these criteria are all there in the “Inclusion Standards,” chances are you are looking at a top-quality meta-analysis. As a rule, it will have average effect sizes lower than those you’ll see in reviews without some or all of these standards, but the effect sizes you see will probably be close to what you will actually get in student achievement gains if your school implements a given program with fidelity and thoughtfulness.

What I find astonishing is how many meta-analyses do not have standards this high. Among experts, these criteria are not controversial, except for the last one, which shouldn’t be. Yet meta-analyses are often written, and accepted by journals, with much lower standards, thereby producing greatly inflated, unrealistic effect sizes.

As one example, there was a meta-analysis of Direct Instruction programs in reading, mathematics, and language, published in the Review of Educational Research (Stockard et al., 2016). I have great respect for Direct Instruction, which has been doing good work for many years. But this meta-analysis was very disturbing.

The inclusion and exclusion criteria in this meta-analysis did not require experimental-control comparisons, did not require well-matched samples, and did not require any minimum sample size or duration. It was not clear how many of the outcomes measures were made by program developers or researchers, rather than independent of the program.

With these minimal inclusion standards, and a very long time span (back to 1966), it is not surprising that the review found a great many qualifying studies. 528, to be exact. The review also reported extraordinary effect sizes: +0.51 for reading, +0.55 for math, and +0.54 for language. If these effects were all true and meaningful, it would mean that DI is much more effective than one-to-one tutoring, for example.

But don’t get your hopes up. The article included an online appendix that showed the sample sizes, study designs, and outcomes of every study.

First, the authors identified eight experimental designs (plus single-subject designs, which were treated separately). Only two of these would meet anyone’s modern standards of meta-analysis: randomized and matched. The others included pre-post gains (no control group), comparisons to test norms, and other pre-scientific designs.

Sample sizes were often extremely small. Leaving aside single-case experiments, there were dozens of single-digit sample sizes (e.g., six students), often with very large effect sizes. Further, there was no indication of study duration.

What is truly astonishing is that RER accepted this study. RER is the top-rated journal in all of education, based on its citation count. Yet this review, and the Kulik & Fletcher (2016) review I cited in a recent blog, clearly did not meet minimal standards for meta-analyses.

My colleagues and I will be working in the coming months to better understand what has gone wrong with meta-analysis in education, and to propose solutions. Of course, our first step will be to spend a lot of time at oyster bars studying how they set such high standards. Oysters and beer will definitely be involved!

Photo credit: Annette White / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

References

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Stockard, J., Wood, T. W., Coughlin, C., & Rasplica Khoury, C. (2018). The effectiveness of Direct Instruction curricula: A meta-analysis of a half century of research. Review of Educational Research88(4), 479–507. https://doi.org/10.3102/0034654317751919

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

Meta-Analysis or Muddle-Analysis?

One of the best things about living in Baltimore is eating steamed hard shell crabs every summer.  They are cooked in a very spicy mix of spices, and with Maryland corn and Maryland beer, these define the very peak of existence for Marylanders.  (To be precise, the true culture of the crab also extends into Virginia, but does not really exist more than 20 miles inland from the bay).  

As every crab eater knows, a steamed crab comes with a lot of inedible shell and other inner furniture.  So you get perhaps an ounce of delicious meat for every pound of whole crab. Here is a bit of crab math.  Let’s say you have ten pounds of whole crabs, and I have 20 ounces of delicious crabmeat.  Who gets more to eat?  Obviously I do, because your ten pounds of crabs will only yield 10 ounces of meat. 

How Baltimoreans learn about meta-analysis.

All Baltimoreans instinctively understand this from birth.  So why is this same principle not understood by so many meta-analysts?

I recently ran across a meta-analysis of research on intelligent tutoring programs by Kulik & Fletcher (2016),  published in the Review of Educational Research (RER). The meta-analysis reported an overall effect size of +0.66! Considering that the single largest effect size of one-to-one tutoring in mathematics was “only” +0.31 (Torgerson et al., 2013), it is just plain implausible that the average effect size for a computer-assisted instruction intervention is twice as large. Consider that a meta-analysis our group did on elementary mathematics programs found a mean effect size of +0.19 for all digital programs, across 38 rigorous studies (Slavin & Lake, 2008). So how did Kulik & Fletcher come up with +0.66?

The answer is clear. The authors excluded very few studies except for those of less than 30 minutes’ duration. The studies they included used methods known to greatly inflate effect sizes, but they did not exclude or control for them. To the authors’ credit, they then carefully documented the effects of some key methodological factors. For example, they found that “local” measures (presumably made by researchers) had a mean effect size of +0.73, while standardized measures had an effect size of +0.13, replicating findings of many other reviews (e.g., Cheung & Slavin, 2016). They found that studies with sample sizes less than 80 had an effect size of +0.78, while those with samples of more than 250 had an effect size of +0.30. Brief studies had higher effect sizes than those of longer studies, as found in many studies. All of this is nice to know, but even knowing it all, Kulik & Fletcher failed to control for any of it, not even to weight by sample size. So, for example, the implausible mean effect size of +0.66 includes a study with a sample size of 33, a duration of 80 minutes, and an effect size of +1.17, on a “local” test. Another had 48 students, a duration of 50 minutes, and an effect size of +0.95. Now, if you believe that 80 minutes on a computer is three times as effective for math achievement than months of one-to-one tutoring by a teacher, then I have a lovely bridge in Baltimore I’d like to sell you.

I’ve long been aware of these problems with meta-analyses that neither exclude nor control for characteristics of studies known to greatly inflate effect sizes. This was precisely the flaw for which I criticized John Hattie’s equally implausible reviews. But what I did not know until recently was just how widespread this is.

I was working on a proposal to do a meta-analysis of research on technology applications in mathematics. A colleague located every meta-analysis published on this topic since 2013. She found 20 of them. After looking at the remarkable outcomes on a few, I computed a median effect size across all twenty. It was +0.44. That is, to put it mildly, implausible. Looking further, I discovered that only one of the reviews adjusted for sample size (inverse variances). Its mean effect size was +0.05. Every one of the other 19 meta-analyses, all in respectable journals, did not control for methodological features or exclude studies based on them, and reported effect sizes up to +1.02 and +1.05.

Meta-analyses are important, because they are widely read and widely cited, in comparison to individual studies. Yet until meta-analyses start consistently excluding, or at least controlling for studies with factors known to inflate mean effect sizes, then they will have little if any meaning for practice. As things stand now, the overall mean impacts reported by meta-analyses in education depend on how stringent the inclusion standards were, not how effective the interventions truly were.

This is a serious problem for evidence-based reform. Our field knows how to solve it, but all too many meta-analysts do not do so. This needs to change. We see meta-analyses claiming huge impacts, and then wonder why these effects do not transfer to practice. In fact, these big effect sizes do not transfer because they are due to methodological artifacts, not to actual impacts teachers are likely to obtain in real schools with real students.

Ten pounds (160 ounces) of crabs only appear to be more than 20 ounces of crabmeat,  because the crabs contain a lot you need to discard.  The same is true of meta-analyses.  Using small samples, brief durations, and researcher-made measures in evaluations inflate effect sizes without adding anything to the actual impact of treatments for students.  Our job as meta-analysts is to strip away the bias the best we can, and get to the actual impact.  Then we can make comparisons and generalizations that make sense, and move forward understanding of what really works in education.

In our research group, when we deal with thorny issues of meta-analysis, I often ask my colleagues to consider that they had a sister who is a principal.  “What would you say to her,” I ask, “if she asked what really works, all BS aside?  Would you suggest a program that was very effective in a 30-minute study?  One that has only been evaluated with 20 students?  One that has only been shown to be effective if the researcher gets to make the measure?  Principals are sharp, and appropriately skeptical.  Your sister would never accept such evidence.  Especially if she’s experienced with Baltimore crabs.”

References

Cheung, A., & Slavin, R. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45 (5), 283-292.

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of Educational Research, 86(1), 42-78.

Slavin, R., & Lake, C. (2008). Effective programs in elementary mathematics: A best-evidence synthesis. Review of Educational Research, 78 (3), 427-515.

Torgerson, C. J., Wiggins, A., Torgerson, D., Ainsworth, H., & Hewitt, C. (2013). Every Child Counts: Testing policy effectiveness using a randomised controlled trial, designed, conducted and reported to CONSORT standards. Research In Mathematics Education, 15(2), 141–153. doi:10.1080/14794802.2013.797746.

Photo credit: Kathleen Tyler Conklin/(CC BY 2.0)

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org

How Much Have Students Lost in The COVID-19 Shutdowns?

Everyone knows that school closures due to the COVID-19 pandemic are having a serious negative impact on student achievement, and that this impact is sure to be larger for disadvantaged students than for others. However, how large will the impact turn out to be? This is not a grim parlor game for statisticians, but could have real meaning for policy and practice. If the losses turn out to be modest comparable to the “summer slide” we are used to (but which may not exist), then one might argue that when schools open, they might continue where they left off, and students might eventually make up their losses, as they do with summer slide. If, on the other hand, losses are very large, then we need to take emergency action.

Some researchers have used data from summer losses and from other existing data on, for example, teacher strikes, to estimate COVID losses (e.g., Kuhfeld et al., 2020). But now we have concrete evidence, from a country similar to the U.S. in most ways.

A colleague came across a study that has, I believe, the first actual data on this question. It is a recent study from Belgium (Maldonado & DeWitte, 2020) that assessed COVID-19 losses among Dutch-speaking students in that country.

The news is very bad.

The researchers obtained end-of-year test scores from all sixth graders who attend publicly-funded Catholic schools, which are attended by most students in Dutch-speaking Belgium. Sixth grade is the final year of primary school, and while schools were mostly closed from March to June due to COVID, the sixth graders were brought back to their schools in late May to prepare for and take their end-of primary tests. Before returning, the sixth graders had missed about 30% of the days in their school year. They were offered on-line teaching at home, as in the U.S.

The researchers compared the June test scores to those of students in the same schools in previous years, before COVID. After adjustments for other factors, students scored an effect size of -0.19 in mathematics, and -0.29 in Dutch (reading, writing, language). Schools serving many disadvantaged students had significantly larger losses in both subjects; inequality within the schools increased by 17% in mathematics and 20% in Dutch, and inequality between schools increased by 7% in math and 18% in Dutch.

There is every reason to expect that the situation in the U.S. will be much worse than that in Belgium. Most importantly, although Belgium had one of the worst COVID-19 death rates in the world, it has largely conquered the disease by now (fall), and its schools are all open. In contrast, most U.S. schools are closed or partially closed this fall. Students are usually offered remote instruction, but many disadvantaged students lack access to technology and supervision, and even students who do have equipment and supervision do not seem to be learning much, according to anecdotal reports.

In many U.S. schools that have opened fully or partially, outbreaks of the disease are disrupting schooling, and many parents are refusing to send their children to school. Although this varies greatly by regions of the U.S., the average American student is likely to have missed several more effective months of in-person schooling by the time schools return to normal operation.

But even if average losses turn out to be no worse than those seen in Belgium, the consequences are terrifying, for Belgium as well as for the U.S. and other COVID-inflicted countries.

Effect sizes of -0.19 and -0.29 are very large. From the Belgian data on inequality, we might estimate that for disadvantaged students (those in the lowest 25% of socioeconomic status), losses could have been -0.29 in mathematics and -0.39 in Dutch. What do we have in our armamentarium that is strong enough to overcome losses this large?

In a recent blog, I compared average effect sizes from studies of various solutions currently being proposed to remedy students’ losses from COVID shutdowns: Extended school days, after-school programs, summer school, and tutoring. Only tutoring, both one-to-one and one-to-small group, in reading and mathematics, had an effect size larger than +0.10. In fact, there are several one-to-one and one-to-small group tutoring models with effect sizes of +0.40 or more, and averages are around +0.30. Research in both reading and mathematics has shown that well-trained teaching assistants using structured tutoring materials or software can obtain outcomes as good as those obtained by certified teachers as tutors. On the basis of these data, I’ve been writing about a “Marshall Plan” to hire thousands of tutors in every state to provide tutoring to students scoring far below grade level in reading and math, beginning with elementary reading (where the evidence is strongest).

I’ve also written about national programs in the Netherlands and in England to provide tutoring to struggling students. Clearly, we need a program of this kind in the U.S. And if our scores are like the Belgian scores, we need it as quickly as possible. Students who have fallen far below grade level cannot be left to struggle without timely and effective assistance, powerful enough to bring them at least to where they would have been without the COVID school closures. Otherwise, these students are likely to lose motivation, and to suffer lasting damage. An entire generation of students, harmed through no fault of their own, cannot be allowed to sink into failure and despair.

References

Kuhfeld, M., Soland, J., Tarasawa, B., Johnson, A., Ruzek, E., & Liu, J. (2020). Projecting the potential impacts of COVID-19 school closures on academic achievement. (EdWorkingPaper: 20-226). Retrieved from Annenberg Institute at Brown University: https://doi.org/10.26300/cdrv-yw05

Maldonado, J. E., & DeWitte, K. (2020). The effect of school closures on standardized student test outcomes.Leuven, Belgium: University of Leuven.

This blog was developed with support from Arnold Ventures. The views expressed here do not necessarily reflect those of Arnold Ventures.

Note: If you would like to subscribe to Robert Slavin’s weekly blogs, just send your email address to thebee@bestevidence.org