Friday, November 27, 2020

The Effect of Simpson's Paradox On Interpreting COVID-19 Case Growth (United States Data, Thru 11/25/20)

A quick re-primer on Simpson's Paradox, from my previous post concluding we weren't seeing it in Florida re: COVID-19

... A hypothesis around Simpson's Paradox, which is a phenomenon in statistics "in which a trend appears in several different groups of data but disappears or reverses when these groups are combined". This certainly occurred early on over the summer, when it appeared cases had plateaued (graphs from the COVID Tracking Project, 6/16/20):

But the Northeast was coming down off of an extreme high, while cases were accelerating in the South and West:


So depending upon how you cut the data, you would draw different conclusions (this is the crux of Simpson's Paradox).

So is the opposite occurring now? Places that haven't had high case loads yet are exploding now, whereas locations that were hit hard earlier aren't seeing stark increases?

Now that this county-level data is available via the CDC, I was able to pull the data down from John Hopkins and conduct similar analyses as my previous Florida post. Population data is from Census estimates as of 2019. Note that I am using raw case counts, and not estimating true infections like Youyang Gu does. With 3,143 county equivalents in the U.S., the conclusions should track closely for both raw case counts and true infection estimates. Data through 11/25 as to not be skewed by day-to-day Thanksgiving variance.

Overall, it appears the United States has had 3 waves thus far:


The rate of growth during the first wave was the highest since cases were just starting out - once cases reached 10,000 per day, there were 17 consecutive days when a new record was set. During that time period, the week-over-week (WoW) average growth was 290%. The better comparison is the second wave (28 consecutive records) vs third wave (31 consecutive records, active streak). These two average rates of WoW growth are almost the same: second wave = 27%, third wave = 26%.

But where these record case numbers are coming from is how we can determine whether Simpson's Paradox applies. I started by looking at when records were set, over the > 3,000 counties: when each county set their current peak for cases (defined as the single highest 7 day moving average); in other words, when did they set their record highest day?



The vast majority of counties have set their all-time record this month (November); even with a slight uptick in population having their record set in April or July, 2/3rds of the country lives in a county that set their highest case count this month.

This is likely not a function of testing variance either, per Youyang Gu's estimates.

A better gauge - what about the rolling number of records, i.e. when a day is higher in a county than any previous day, that sets a new record? We would expect to see a lot of records being set in every month if new waves of cases are happening in new places:

This is more so the case! This does provide some evidence that Simpson's Paradox is at play. But it doesn't fully compare the relative peaks (second and third wave) to eachother - that first record chart (current records) provides evidence that this current wave is worse than any before it for most of the country.

I can't compare > 3,000 counties in one chart, so I have to do some geographical grouping, because Simpson's Paradox can still be exhibited via different subsets of data. To do so, I'm using the Census Bureau's region definitions:


Cases are rising everywhere, whether you look at 4 large regions or 9 smaller sub-regions:



As always, I need to normalize for population. And that tells an even worse story - all 4 major regions and all 9 sub-regions have hit their highest growth in this wave, with the Midwest far and above the worst region currently, or ever:



Finally, let's look at which regions have been driving the growth over time:


Outside of the Northeast, cases are fairly evenly distributed right now, with the Midwest having the largest share.


The same is true with smaller sub-regions - East North Central (Midwest) currently has double the share (23%) of any other sub-region. 
Everywhere outside of the Northeast is getting hit hard, with the Midwest reaching heights not yet seen during the pandemic, regardless of how you slice the data. 

To answer my original hypothesis, it does appear that Simpson's Paradox may be playing a role now to some extent, and had a larger effect earlier in the pandemic. The biggest takeaway is that grouping by region is a necessary way to communicate trends - case growth is much worse in some places than others. But the current record trend is truly exhibited almost everywhere (exception: Northeast), driven by the Midwest.

Tuesday, November 17, 2020

"What are the odds?" I Am Exposed to a COVID-19 Positive Individual

Note: This post is on the likelihood of being exposed to an infectious COVID-19 positive individual over a certain number of "close contact" community interactions, not the likelihood of being infected by a COVID-19 positive individual over a certain number of interactions

It's actually very straight forward to calculate the "probability of being exposed to someone who is infectious with COVID-19" using publicly available data, based on the number of in-person, "close contact" interactions, as defined by the CDC:

Close contact is defined as being within 6 feet for at least a period of 10 minutes to 30 minutes or more depending upon the exposure. In healthcare settings, this may be defined as exposures of greater than a few minutes or more. Data are insufficient to precisely define the duration of exposure that constitutes prolonged exposure and thus a close contact.

I will illustrate this using Florida data, which I've analyzed before, using Miami-Dade as the region in my example. Of course, this comes with some assumptions around independence and homogeneity of the population you're interacting with. Data you need:

First the number of "true infections" needs to be estimated. Youyang Gu provides this formula to estimate that:
  • # of confirmed positives * (16 * sqrt(% positive) + 2.5) =
    • 10,396 * (16 * sqrt(8.37%) + 2.5) = 74,110
    • This implies roughly 86% of actual cases are not confirmed, or ~6 in 7 (in this example)
From here, presumably the the confirmed positive cases are quarantining as directed, leaving the unknown cases as potential interactions:
  • # of unknown positive cases = # of true infections - # of confirmed positive cases
    • 74,110 - 10,396 = 63,714
Next, calculate the percent of the population you're interacting with (in this case Miami-Dade County) that are unknowingly actively positive and likely infectious:
  • % of population unknowingly actively positive and likely infectious = # of unknown positives / population
    • 63,714 / 2,715,940 = 2.345%
An additional rough estimate for this would be to use the # of confirmed daily cases per 100,000 people like so:
  • % of population (rougher estimate) unknowingly actively positive and likely infectious = (# of confirmed daily cases per 100,000 people * 5) / 10,000
    • (46 * 5) / 10,000 = 2.3%
Finally, the probability that none of the people you had "close contact" with were unknowingly, actively infectious is:
  • (1 - % population unknowingly positive) ^ # of interactions
    • (1 - 2.345%) ^ 10 = 78.9%
  • Conversely, 1 or more positive interactions = 1 - above
    • 1 - 78.9% = 21.1%

Saturday, November 14, 2020

The Effect of Simpson's Paradox On Interpreting COVID-19 Case Growth (Florida Data, Thru 11/12/20)

As we're deep in the "third wave" of confirmed coronavirus cases in the U.S., the upward trend is no longer geographically confined. That being said, the highest spread right now is in the Midwest - some places, like the Northeast and Northwest, aren't currently seeing as high case growth.

Which leads to a hypothesis around Simpson's Paradox, which is a phenomenon in statistics "in which a trend appears in several different groups of data but disappears or reverses when these groups are combined". This certainly occurred early on over the summer, when it appeared cases had plateaued (graphs from the COVID Tracking Project, 6/16/20):

But the Northeast was coming down off of an extreme high, while cases were accelerating in the South and West:


So depending upon how you cut the data, you would draw different conclusions (this is the crux of Simpson's Paradox).

So is the opposite occurring now? Places that haven't had high case loads yet are exploding now, whereas locations that were hit hard earlier aren't seeing stark increases?

To test this, I want to get data down to the county level - and to start, I'm only going to use Florida data (thru 11/12/20), since I'm most familiar with it and there are 67 counties, as opposed to 3,141 nationwide. The theory is that over the summer, South Florida (Miami-Dade, Broward, and Palm Beach counties) were driving the outbreak (~44% of cases at the peak), whereas now other cities and/or rural counties are seeing more growth. So while Florida is encountering a "second wave", it might be mainly occurring in places that weren't hit hard yet:

[7DayMA] = 7 Day Moving Average

The 7 day moving average is my focus, since this smooths out day-of-week reporting variance. Looking at this by county, starting April 1, doesn't really support my theory, as Dade/Broward/Palm Beach are the three highest lines in the past ~2 weeks:

County legend sorted by total cases

But South Florida has ~30% of the state's population, so I adjusted for the number of residents in each county. Doing so is... not very conclusive:

County legend sorted by total cases
Capped at 500 cases / 100,000 residents; Lafayette County (2nd-smallest in FL) had a very bad August

So what about when each county set their current peak for cases (defined as the single highest 7 day moving average); in other words, when did they set their record highest day? For example, the peak in Lafayette County (see above) was in August, the peak in Miami-Dade was July, etc:

Or what about the rolling number of records, i.e. when a day is higher in a county than any previous day, that sets a new record? We would expect to see a lot of records being set in every month if new waves of cases are happening in new places:


This doesn't exhibit Simpson's Paradox - cases might not be rising as rapidly as they were at first, but they aren't disproportionally rising in different places either. To illustrate: 5/31 was the last day until 7/20 that the state saw a week-over-week (WoW) decline in the 7 day moving average. The average weekly increase in cases over this 49 day period was 51%. Up until 11/12 (the end of the current dataset), the last day the state saw a WoW decline was 10/9. Over this 34 day period, the average weekly increase has been 17%. So the rate of increase is certainly slower right now than it was during the summer peak.

But this still has not shown Simpson's Paradox, which was the goal of this exercise. So I tried to look at where the new cases were coming from, as a percent of the overall total:

County legend sorted by total cases
Some days do not add to 100% due to cases from "Unknown" county

This data is obviously very granular and messy, and the three bottom bars (blue: Dade, orange: Broward, gray: Palm Beach; South Florida counties) did have a higher percentage of the total back in April (60%) vs the July peak (44%) vs September (29%). But this is rising again: 31% in October, to 37% in November to date.

So I grouped the state in to metropolitan areas, as defined by the U.S. Census Bureau statistical areas (MSA):
  • S Florida: Miami-Dade, Broward, Palm Beach
  • Tampa: Hillsborough, Pinellas, Pasco, Hernando
  • Orlando: Orange, Seminole, Osceola, Lake 
  • Jax: Duval, St. Johns, Clay, Nassau, Baker
Metro legend sorted by total cases

To me, this is the graph that illustrates the largest repudiation yet. All four major metro areas appear to rise and fall roughly in sequence with the rises and falls of the statewide case growth, including in the past ~1 month. Recall that 10/9 was the last day the entire state saw a WoW decline - on that date, the rest of the state (non South Florida/Tampa/Orlando/Jacksonville) encompassed 39% of new cases. On 11/12, the last date in the dataset, that figure was 34% - reflecting that cases appear to be growing again in places that already peaked over the summer, a rejection of Simpson's Paradox.

Sunday, November 8, 2020

"What are the odds?" No NFL Team Has Played In the Super Bowl At Home

Every year, the Super Bowl is played at a different neutral site, determined years in advance. There have been 54 Super Bowls to date in 15 different cities, and yet no team has played in the game at their home venue. Just how unlikely is this, over the past 5+ decades?

I assumed each team had an equal chance of making the Super Bowl in a given season, and then went through the history of the league to determine the number of teams in each year. There are some wrinkles to account for over time, to ultimately calculate:

# possible teams at home * (# of teams to make the Super Bowl: 2 / # teams)

This can differ a given year if either the hosting venue has more than one tenet (such as MetLife Stadium), or the game was played at a site that wasn't home to an NFL team (such as the Rose Bowl).

1 - the above calculation gives the chance that the Super Bowl does not include the home tenet, and then multiplying this over every year gives the chances it hasn't happened yet: 3.3%

SeasonNumber# Playoff Teams# TeamsSB LocationHome?# Teams PossNo HomeRunningOdds
19661224Los Angeles, California1191.7%91.7%
19672225Miami, Florida1192.0%84.3%
19683226Miami, Florida1192.3%77.8%
19694226New Orleans, Louisiana1192.3%71.9%
19705826Miami, Florida1192.3%66.3%
19716826New Orleans, Louisiana1192.3%61.2%
19727826Los Angeles, California1192.3%56.5%
19738826Houston, Texas00100.0%56.5%
19749826New Orleans, Louisiana1192.3%52.2%
197510826Miami, Florida1192.3%48.2%
197611828Pasadena, California00100.0%48.2%
197712828New Orleans, Louisiana1192.9%44.7%
1978131028Miami, Florida1192.9%41.5%
1979141028Pasadena, California00100.0%41.5%
1980151028New Orleans, Louisiana1192.9%38.6%
1981161028Pontiac, Michigan1192.9%35.8%
1982171628Pasadena, California00100.0%35.8%
1983181028Tampa, Florida1192.9%33.2%
1984191028Stanford, California00100.0%33.2%
1985201028New Orleans, Louisiana1192.9%30.9%
1986211028Pasadena, California00100.0%30.9%
1987221028San Diego, California1192.9%28.7%
1988231028Miami, Florida1192.9%26.6%
1989241028New Orleans, Louisiana1192.9%24.7%
1990251228Tampa, Florida1192.9%23.0%
1991261228Minneapolis, Minnesota1192.9%21.3%
1992271228Pasadena, California00100.0%21.3%
1993281228Atlanta, Georgia1192.9%19.8%
1994291228Miami, Florida1192.9%18.4%
1995301230Tempe, Arizona1193.3%17.2%
1996311230New Orleans, Louisiana1193.3%16.0%
1997321230San Diego, California1193.3%14.9%
1998331230Miami, Florida1193.3%13.9%
1999341231Atlanta, Georgia1193.5%13.0%
2000351231Tampa, Florida1193.5%12.2%
2001361231New Orleans, Louisiana1193.5%11.4%
2002371232San Diego, California1193.8%10.7%
2003381232Houston, Texas1193.8%10.0%
2004391232Jacksonville, Florida1193.8%9.4%
2005401232Detroit, Michigan1193.8%8.8%
2006411232Miami Gardens, Florida1193.8%8.3%
2007421232Glendale, Arizona1193.8%7.8%
2008431232Tampa, Florida1193.8%7.3%
2009441232Miami Gardens, Florida1193.8%6.8%
2010451232Arlington, Texas1193.8%6.4%
2011461232Indianapolis, Indiana1193.8%6.0%
2012471232New Orleans, Louisiana1193.8%5.6%
2013481232East Rutherford, New Jersey1287.5%4.9%
2014491232Glendale, Arizona1193.8%4.6%
2015501232Santa Clara, California1193.8%4.3%
2016511232Houston, Texas1193.8%4.0%
2017521232Minneapolis, Minnesota1193.8%3.8%
2018531232Atlanta, Georgia1193.8%3.6%
2019541232Miami Gardens, Florida1193.8%3.3%
2020551432Tampa, Florida1193.8%3.1%
2021561432Inglewood, California1287.5%2.7%
2022571432Glendale, Arizona1193.8%2.6%
2023581432TBD1193.8%2.4%
2024591432New Orleans, Louisiana1193.8%2.3%

There is a reason the 1984 season is highlighted: the San Francisco 49ers won that Super Bowl, but the venue was Stanford Stadium - which was NOT their home stadium (which was Candlestick Park). That venue would be considered semi-home, and if the site would have been The Stick, this whole exercise would be moot.

This year the chances are pretty good, with the Tampa Bay Buccaneers currently tied for the best record in the NFC, and the Super Bowl held in Tampa. Which brings us to another wrinkle over recent history - Tom Brady.

The New England Patriots have had an incredible run over the past two decades, making 9 Super Bowls. But Foxborough, Massachusetts isn't in the rotation to host the game. So what if it was? What if Gillette Stadium replaced their AFC East counterparts' Dolphin Stadium?

The Patriots didn't make any of the 3 games played in South Florida. But as I previously determined, their "generic" odds of reaching the game were ~25% in any given year over the past 19 seasons.

If I apply this to the imaginary Foxborough years (Miami years in real life), the odds that no team would have played in the Super Bowl at home gets cut in half, to 1.7%.

SeasonNumber# Playoff Teams# TeamsSB LocationHome?# Teams PossNo HomeRunningOddsNEP?
2001361231New Orleans, Louisiana1193.5%11.4%1
2002371232San Diego, California1193.8%10.7%0
2003381232Houston, Texas1193.8%10.0%1
2004391232Jacksonville, Florida1193.8%9.4%1
2005401232Detroit, Michigan1193.8%8.8%0
2006411232Foxborough, Massachusetts1174.9%6.6%0
2007421232Glendale, Arizona1193.8%6.2%1
2008431232Tampa, Florida1193.8%5.8%0
2009441232Foxborough, Massachusetts1174.9%4.3%0
2010451232Arlington, Texas1193.8%4.1%0
2011461232Indianapolis, Indiana1193.8%3.8%1
2012471232New Orleans, Louisiana1193.8%3.6%0
2013481232East Rutherford, New Jersey1287.5%3.1%0
2014491232Glendale, Arizona1193.8%2.9%1
2015501232Santa Clara, California1193.8%2.8%0
2016511232Houston, Texas1193.8%2.6%1
2017521232Minneapolis, Minnesota1193.8%2.4%1
2018531232Atlanta, Georgia1193.8%2.3%1
2019541232Foxborough, Massachusetts1174.9%1.7%0