Probabilis: 2014

Monday, December 29, 2014

NFL Playoff Picks 2014-15

The NFL Playoffs start this weekend, and the top seeds (NE, DEN, SEA, and GB) are appropriate (representing 4 of the top 5 teams in Pyth (the lone missing team being Kansas City, who disappointingly didn't make the playoffs)), and these teams are heavily favored.

Saturday, December 27, 2014

Creating an NCAAF Model as a Function of Stadium Size

My ~~friend~~ ~~roommate~~ colleague (who's also my friend and roommate), "Stan Hooper", suggested that a decent model for predicting college football game outcomes would be rating teams based on the sizes of their stadiums.

I took the size of each team's stadium and predicted the winner of each game simply by which team had the larger capacity. I factored in home-field advantage by adding the average stadium capacity to the home team's stadium size (home-field advantage in terms of points is 3.5, which is approximately the average margin of victory by home teams (this season is 3.78)) in this equation.

On the season, the Home-Field Model predicted 60.96% of straight up winners. For comparison, my MDS Model predicted 71.71% correctly, and Vegas tabbed 75.11% correctly. Honestly I'm just relieved that a model created in 30 seconds by importing a Wikipedia page into a spreadsheet didn't beat the one I worked an entire semester on.

Monday, December 22, 2014

Kentucky is So Good They Broke My Model

The Matrix model uses a series of n x n matrices involving only wins and losses to calculate a "true" win percentage that factors in the strength of each team's opponents. There is a matrix for wins and a matrix for losses, which are then inverted, factoring in the total number of games each team has played, and then multiplied by an n x 1 vector that is each team's net wins (Note: n = number of teams (in NCAAB, 351)):

net wins = wins - losses

Each game simply is denoted by a "1". Therefore, the outputs of the win and loss matrices are bound between -1 and 1. If a team has a winning "adjusted" record, their rating is above 0; otherwise it's negative. I then standardize this rating to be analogous to win percentage, bound between 0 and 1:

i = initial Matrix rating; -1 <= i <= 1

i + 1 = i'

i' / 2 = f

f = final Matrix rating; 0 <= f <= 1

Kentucky has a rating of 1.005. As seen above, the final rating has an upper bound of 1. So the issue must be within the win and loss matrices themselves.

Kentucky's loss matrix rating is 1.000; this checks out, since they're undefeated and thus have 0 losses. The issue then is pinpointed to their win matrix: their rating is 1.010.

In their win matrix, they have 12 wins, and 12 total games (Note: only Division 1 games are included, but in this case all of Kentucky's opponents have been Division 1). This checks out too. Their strength of schedule is very high: 0.645 (0.500 is average). This is the only explanation I can offer as to why Kentucky's rating is so high: they've played a very tough slate of teams and beaten them all. Even so, it shouldn't be above 1.

A full rundown of the Matrix method can be found here on page 31, "Least Squares Ratings", written by Kenneth Massey.

Thursday, November 27, 2014

Fluctuations in UNC's Bowl Chances Throughout the Season

Since I keep a weekly archive of my NCAAF ratings, I was able to retroactively determine how likely UNC was to make a bowl after each of their games. Obviously each win and loss greatly impacts the total probability, since a concrete "W" increases whatever pregame win probability to a 1 in the books (and conversely an "L" brings that down to a 0).

The following chart is based on simulating the rest of the season 10,000 times after each of UNC's games, using each week's archived ratings. Besides the explicit influence of each win and loss as described above, two notable changes occurred as follows:

The 70-41 loss to ECU had a HUGE negative impact. One reason was the lopsided score itself (and only having 3 games of data at that point), and another was that this week was when I transitioned fully from my pregame rankings to using the Composite ratings that factored in only this year's data (after Week 4)
The 50-43 loss to Notre Dame greatly improved the likelihood of 6 wins (from 13.46% to 32.03%) because it was a close loss to a good team (we performed better than expected). However, another coincident factor was that this was also the week I transitioned to fully using the Pyth ratings to predict games

Wednesday, November 19, 2014

Is FSU In the Top 4, Much Less #1?

Some controversy has developed in the past few weeks when the playoff selection committee ranked Oregon and Alabama (both with one loss) ahead of Florida State, the only Power 5 team without a loss.

However, the only thing that matters is that FSU makes it into the top 4 for a spot in the playoff. FSU is #1 in the AP poll, #1 in the Coach's poll, but #3 in the CFP rankings. I checked where the Seminoles ranked in more objective rating systems to see if they're underrated at #3, or whether they're overrated and should even be in the top 4 at all.

In my MDS Composite model (which aggregates wins/losses, strength of schedule, and margin of victory to estimate a complete picture of each team), FSU checks in at #11. PredictionMachine, which simulates every team against one another 50,000 times to generate rankings, ranks Florida State at #12. Kenneth Massey maintains composite rankings of a multitude of different statistical/computer models (114 to be exact), and FSU still doesn't crack the top 4: they're ranked #6. Of those 114 models, Florida State is ranked #4 or better in 51: less than half (44.74%). They're also ranked #1 in only 18: less than 1 in every 6 (15.79%).

The only model in which they rank #1 (of my MDS components) is the Matrix model, which only takes into account wins and losses and the strength of opponents (it was designed to be analogous to the BCS computers (and the implementation of the playoff selection committee was supposed to eliminate these flawed BCS systems)). Ultimately, Seminole fans should consider their team lucky to still be included in the playoff at all. The committee has left Marshall completely out of the Top 25 due to their weak schedule; Florida State's isn't all that impressive either (FSU's SOS ranks #66 in FBS per my metric, while Marshall ranks #118. For comparison, Alabama's is #17 and Oregon's is #53).

Finally, should Florida State (perhaps unfairly) make the playoff, don't expect them to win it all. Simulations of the current playoff matchups give the following results, with FSU repeating as champions about 1/10th of the time:

	Win %	National Champion
#1 Alabama	66.23%	46.19%
#4 Miss St	33.77%	18.32%

#2 Oregon	62.56%	24.82%
#3 Florida St	37.44%	10.66%

Monday, November 10, 2014

Downfall of Parsing in Google Sheets

With the start of college basketball rapidly approaching, I went through to set up my model for the new season. In doing so I checked for errors (probably the most intensive and frustrating part of writing code), and found that ESPN's schedules had changed the names of certain colleges, slightly. Every "State" is changed to "St", with the exception of a few, including Ohio State, NC State, and Iowa State. Those alterations and the rest are as follows:

Sunday, November 2, 2014

UNC's Chances of Going to a Bowl

Bowl odds:
6-6 or better: 44.48%
Win out: 8.62%; Lose out: 14.17%
Expected record: 5.39-6.61
UNC's rank: #58

Our remaining opponents:
vs Pitt (#46): 50.36%, +0.12
at Duke (#40): 28.49%, -7.67
vs NC State (#57): 60.08%, +3.45

Monday, October 27, 2014

MLS Cup Predictions

The MLS Cup bracket is set, and so are my picks!

Not many surprises here, except that my Pyth component has LA as far and away the best team in the league: with a predicted forward-looking win percentage of .740 (compared with their actual win percentage of 0.647). Combine that with LA's reputation of excelling in the playoffs, and expect Landon Donovan to go out on top.

Also, by far the best matchup is LA vs RSL (#1 vs #2 in my model).

Monday, October 20, 2014

World Series Preview with Pitching Matchups

I have to default to one of my (new) favorite quotes, "You can't predict sports!" Because according to my calculations, this World Series matchup had a 0.11% chance of occurring at the start of the playoffs: each team was the least likely to win their respective pennant. But, here we are.

First, I start with the likely pitching matchups for each game, which allows me to determine which team has the advantage based on starting pitching alone (a negative number indicates KC is favored):

Dif	Game	Fav Pitch
0.15	G1	SF
-0.35	G2	KC
0.09	G3	SF
0.09	G4	SF
0.15	G5	SF
-0.35	G6	KC
0.09	G7	SF

Now I can factor in home-field advantage:

Home Dif	Game	Fav Pitch
-0.15	G1	KC
-0.65	G2	KC
0.39	G3	SF
0.39	G4	SF
0.45	G5	SF
-0.65	G6	KC
-0.21	G7	KC

Not surprisingly, each team is favored in their respective home parks. Now I can do the full predictions involving the predicted winner based on their overall performance throughout the season:

Final Line	Game	Favorite	Win %
-0.34	G1	KC	54.53%
-0.85	G2	KC	61.13%
0.20	G3	SF	52.64%
0.19	G4	SF	52.55%
0.26	G5	SF	53.43%
-0.85	G6	KC	61.13%
-0.40	G7	KC	55.32%

Once again, it's clear home-field matters a LOT. The final totals conclude Kansas City wins it all 57.36% of the time, and San Francisco wins the title the other 42.64%.

Here's the breakdown of all possible outcomes:

KC in 4	SF in 4	In 4
7.49%	4.89%	12.38%
KC in 5	SF in 5	In 5
12.87%	11.95%	24.82%
KC in 6	SF in 6	In 6
19.74%	11.86%	31.60%
KC in 7	SF in 7	In 7
17.26%	13.94%	31.20%

The pick: Kansas City in 6

Monday, September 29, 2014

MLB Playoff Bracket and Inconsistency (Thanks to the AL)

This year's MLB postseason is full of inconsistency in how things could play out (per the Pyth component of the MDS Model):

Oakland is the best team but has the 4th best chance of winning it all
Across the board the AL is stronger, and the AL teams combine to win the World Series 58.78% of the time AND have home-field advantage... but Washington is the single most likely team to win it all because they only have to play an AL team once: in the World Series
The Los Angeles Angels are better than the Baltimore Orioles AND have home-field but are less likely to make it to the World Series because both Kansas City and Oakland are stronger than Detroit
Pittsburgh has to get out of the one game Wild Card playoff, yet is more likely to win the World Series than the team they lost the NL Central to, St. Louis
Using the straight rankings (and factoring in home-field) predicts a different bracket than the straight probabilities

The Matrix Model Reflecting Actual Win %

I've been curious all season as to how close the Matrix Model (which only takes into account wins and losses) would reflect the actual win percentages of teams at the end of the MLB regular season. With 162 games and all teams connected with very low degrees of separation, I assumed that the residuals between the output of the Matrix and each team's actual win percentage would be very small.

In the above chart, the teams are sorted by win percentage. This indicates that (overall) the Matrix suppresses the "true" win percentages of good teams and raises that of bad teams, meaning the model is biased towards .500. The largest residual (in absolute value) was 0.010, and the sum of all residuals is 0.001, indicating that they are small and centered around 0 as designed.

The aim of the Matrix Model is to adjust each team's wins and losses by the strength of their opponents, so I checked to see if this was the case by looking at the residuals of the teams with the highest and lowest strengths of schedule (basically to check if the model is consistent with itself).

SOS Rank		Residual
1	NYY	-0.002
2	MIN	-0.007
	...
29	SF	0.009
30	LAD	0.010

Note: A negative residual indicates that the Matrix raised the team's win percentage, while a positive residual means the model lowered it. The above table indicates that a stronger SOS correlates with a higher matrix rating, which means the model is consistent.