Friday, November 27, 2020

The Effect of Simpson's Paradox On Interpreting COVID-19 Case Growth (United States Data, Thru 11/25/20)

A quick re-primer on Simpson's Paradox, from my previous post concluding we weren't seeing it in Florida re: COVID-19

... A hypothesis around Simpson's Paradox, which is a phenomenon in statistics "in which a trend appears in several different groups of data but disappears or reverses when these groups are combined". This certainly occurred early on over the summer, when it appeared cases had plateaued (graphs from the COVID Tracking Project, 6/16/20):

But the Northeast was coming down off of an extreme high, while cases were accelerating in the South and West:


So depending upon how you cut the data, you would draw different conclusions (this is the crux of Simpson's Paradox).

So is the opposite occurring now? Places that haven't had high case loads yet are exploding now, whereas locations that were hit hard earlier aren't seeing stark increases?

Now that this county-level data is available via the CDC, I was able to pull the data down from John Hopkins and conduct similar analyses as my previous Florida post. Population data is from Census estimates as of 2019. Note that I am using raw case counts, and not estimating true infections like Youyang Gu does. With 3,143 county equivalents in the U.S., the conclusions should track closely for both raw case counts and true infection estimates. Data through 11/25 as to not be skewed by day-to-day Thanksgiving variance.

Overall, it appears the United States has had 3 waves thus far:


The rate of growth during the first wave was the highest since cases were just starting out - once cases reached 10,000 per day, there were 17 consecutive days when a new record was set. During that time period, the week-over-week (WoW) average growth was 290%. The better comparison is the second wave (28 consecutive records) vs third wave (31 consecutive records, active streak). These two average rates of WoW growth are almost the same: second wave = 27%, third wave = 26%.

But where these record case numbers are coming from is how we can determine whether Simpson's Paradox applies. I started by looking at when records were set, over the > 3,000 counties: when each county set their current peak for cases (defined as the single highest 7 day moving average); in other words, when did they set their record highest day?



The vast majority of counties have set their all-time record this month (November); even with a slight uptick in population having their record set in April or July, 2/3rds of the country lives in a county that set their highest case count this month.

This is likely not a function of testing variance either, per Youyang Gu's estimates.

A better gauge - what about the rolling number of records, i.e. when a day is higher in a county than any previous day, that sets a new record? We would expect to see a lot of records being set in every month if new waves of cases are happening in new places:

This is more so the case! This does provide some evidence that Simpson's Paradox is at play. But it doesn't fully compare the relative peaks (second and third wave) to eachother - that first record chart (current records) provides evidence that this current wave is worse than any before it for most of the country.

I can't compare > 3,000 counties in one chart, so I have to do some geographical grouping, because Simpson's Paradox can still be exhibited via different subsets of data. To do so, I'm using the Census Bureau's region definitions:


Cases are rising everywhere, whether you look at 4 large regions or 9 smaller sub-regions:



As always, I need to normalize for population. And that tells an even worse story - all 4 major regions and all 9 sub-regions have hit their highest growth in this wave, with the Midwest far and above the worst region currently, or ever:



Finally, let's look at which regions have been driving the growth over time:


Outside of the Northeast, cases are fairly evenly distributed right now, with the Midwest having the largest share.


The same is true with smaller sub-regions - East North Central (Midwest) currently has double the share (23%) of any other sub-region. 
Everywhere outside of the Northeast is getting hit hard, with the Midwest reaching heights not yet seen during the pandemic, regardless of how you slice the data. 

To answer my original hypothesis, it does appear that Simpson's Paradox may be playing a role now to some extent, and had a larger effect earlier in the pandemic. The biggest takeaway is that grouping by region is a necessary way to communicate trends - case growth is much worse in some places than others. But the current record trend is truly exhibited almost everywhere (exception: Northeast), driven by the Midwest.

No comments:

Post a Comment