Saturday, November 14, 2020

The Effect of Simpson's Paradox On Interpreting COVID-19 Case Growth (Florida Data, Thru 11/12/20)

As we're deep in the "third wave" of confirmed coronavirus cases in the U.S., the upward trend is no longer geographically confined. That being said, the highest spread right now is in the Midwest - some places, like the Northeast and Northwest, aren't currently seeing as high case growth.

Which leads to a hypothesis around Simpson's Paradox, which is a phenomenon in statistics "in which a trend appears in several different groups of data but disappears or reverses when these groups are combined". This certainly occurred early on over the summer, when it appeared cases had plateaued (graphs from the COVID Tracking Project, 6/16/20):

But the Northeast was coming down off of an extreme high, while cases were accelerating in the South and West:


So depending upon how you cut the data, you would draw different conclusions (this is the crux of Simpson's Paradox).

So is the opposite occurring now? Places that haven't had high case loads yet are exploding now, whereas locations that were hit hard earlier aren't seeing stark increases?

To test this, I want to get data down to the county level - and to start, I'm only going to use Florida data (thru 11/12/20), since I'm most familiar with it and there are 67 counties, as opposed to 3,141 nationwide. The theory is that over the summer, South Florida (Miami-Dade, Broward, and Palm Beach counties) were driving the outbreak (~44% of cases at the peak), whereas now other cities and/or rural counties are seeing more growth. So while Florida is encountering a "second wave", it might be mainly occurring in places that weren't hit hard yet:

[7DayMA] = 7 Day Moving Average

The 7 day moving average is my focus, since this smooths out day-of-week reporting variance. Looking at this by county, starting April 1, doesn't really support my theory, as Dade/Broward/Palm Beach are the three highest lines in the past ~2 weeks:

County legend sorted by total cases

But South Florida has ~30% of the state's population, so I adjusted for the number of residents in each county. Doing so is... not very conclusive:

County legend sorted by total cases
Capped at 500 cases / 100,000 residents; Lafayette County (2nd-smallest in FL) had a very bad August

So what about when each county set their current peak for cases (defined as the single highest 7 day moving average); in other words, when did they set their record highest day? For example, the peak in Lafayette County (see above) was in August, the peak in Miami-Dade was July, etc:

Or what about the rolling number of records, i.e. when a day is higher in a county than any previous day, that sets a new record? We would expect to see a lot of records being set in every month if new waves of cases are happening in new places:


This doesn't exhibit Simpson's Paradox - cases might not be rising as rapidly as they were at first, but they aren't disproportionally rising in different places either. To illustrate: 5/31 was the last day until 7/20 that the state saw a week-over-week (WoW) decline in the 7 day moving average. The average weekly increase in cases over this 49 day period was 51%. Up until 11/12 (the end of the current dataset), the last day the state saw a WoW decline was 10/9. Over this 34 day period, the average weekly increase has been 17%. So the rate of increase is certainly slower right now than it was during the summer peak.

But this still has not shown Simpson's Paradox, which was the goal of this exercise. So I tried to look at where the new cases were coming from, as a percent of the overall total:

County legend sorted by total cases
Some days do not add to 100% due to cases from "Unknown" county

This data is obviously very granular and messy, and the three bottom bars (blue: Dade, orange: Broward, gray: Palm Beach; South Florida counties) did have a higher percentage of the total back in April (60%) vs the July peak (44%) vs September (29%). But this is rising again: 31% in October, to 37% in November to date.

So I grouped the state in to metropolitan areas, as defined by the U.S. Census Bureau statistical areas (MSA):
  • S Florida: Miami-Dade, Broward, Palm Beach
  • Tampa: Hillsborough, Pinellas, Pasco, Hernando
  • Orlando: Orange, Seminole, Osceola, Lake 
  • Jax: Duval, St. Johns, Clay, Nassau, Baker
Metro legend sorted by total cases

To me, this is the graph that illustrates the largest repudiation yet. All four major metro areas appear to rise and fall roughly in sequence with the rises and falls of the statewide case growth, including in the past ~1 month. Recall that 10/9 was the last day the entire state saw a WoW decline - on that date, the rest of the state (non South Florida/Tampa/Orlando/Jacksonville) encompassed 39% of new cases. On 11/12, the last date in the dataset, that figure was 34% - reflecting that cases appear to be growing again in places that already peaked over the summer, a rejection of Simpson's Paradox.

No comments:

Post a Comment