Probabilis: 2015

Sunday, December 6, 2015

College Football Playoff and Bowl Picks

The College Football Playoff and all 39 other bowls are now set (there were so many this year that some 5-7 teams made it), and with that I can use my Pyth ratings from the MDS Model to project the winners of each game. A couple of notes about this year's slate:

As mentioned above, three 5-7 teams will play in a bowl game this year: Nebraska, Minnesota, and San Jose State.
There is a 41st bowl for 2 FCS teams: the winners of the MEAC and SWAC will face off in the first bowl of the season in Atlanta. These 2 conferences voluntarily removed themselves from FCS playoff consideration in order to participate in this bowl.
With the 5-7 debacle, 2 MWC teams actually will end up playing each other in a bowl game: Nevada vs Colorado State. And the MWC commissioner is not happy about it.

But first, here are my probabilities for each team in the playoff:

Seed	Team	Final	Champion
1	Clemson	36.76%	14.38%
4	Oklahoma	63.24%	32.92%
2	Alabama	69.10%	40.59%
3	Michigan State	30.90%	12.12%

Alabama is the #1 team in the country in my model, and we look headed for a rematch of the 2014 Sugar Bowl in the NCG: Alabama vs Oklahoma. And 2/5ths of the time, Tricky Nicky will take home yet another title for Alabama.

Before we take a look at the rest of the bowls, let's look back at my "8 Completely Arbitrary 2015 NCAAF Bold Predictions". I definitely went chalk (not bold) with pretty much every single one, so we should see a high percentage of correct "predictions".

1. North Carolina will be this year's fraud ACC team

Check! 1/1

2. Georgia will win the SEC

Not even close. 1/2

3. Alabama will miss the CFP

As I just illustrated above, they'll likely win the damn thing. 1/3

4. Al Golden will be fired after Miami goes 6-6 again

Well, I got this half-way correct about half-way through the season. Golden got fired, but Miami went 8-4. 1.5/4

5. The ACC will be the conference left out of the CFP

Try #1 overall seed. 1.5/5

6. Tennessee will beat Florida

This definitely should've happened, but of course, it didn't. 1.5/6

7. Ezekiel Elliot will win the Heisman Trophy

Not likely. 1.5/7

8. Ohio St will repeat as National Champions

Literally can not happen. 1.5/8

Well, that should entirely have been expected. 1.5/8 for a phenomenal percent correct of 19%-ish.

Now for the rest of the bowl picks, which are backed by math and statistical reasoning, and not just me pontificating (only FBS included):

Preseason NCAAB Ratings

College basketball is BACK! And this year I've tried to do what I do for college football, where I adjust my end-of-season Composite ratings by metrics such as player turnover and recruiting classes. As KenPom has illustrated, the upcoming season is best projected by the past few seasons for a given program. As he describes it, "In the most general sense, the main ingredient in the system is inertia. If a team has been good in the recent past, it’s likely to be rated well in the preseason."

With that in mind, I actually don't take into account recruiting for college basketball; instead, my main adjustment is based on the percentage of returning possession-minutes (that link is to an ESPN Insider article, so here's a quick rundown as described in the article): "Basically it's the percentage of minutes that a player is on the floor, multiplied by the percentage of possessions he used last season (as seen at KenPom.com)."

This ultimately gives a weighted representation of how much experience is returning from last year's team for each school, which I then use to adjust my Composite ratings. Here are my projected Top 25:

1	Virginia	0.915
2	Kansas	0.882
3	Villanova	0.879
4	North Carolina	0.877
5	Gonzaga	0.868
6	Utah	0.850
7	Wisconsin	0.847
8	Baylor	0.844
9	Iowa State	0.842
10	Wichita State	0.838
11	Cincinnati	0.833
12	Arizona	0.833
13	Oklahoma	0.830
14	Kentucky	0.828
15	VCU	0.825
16	SMU	0.824
17	Duke	0.814
18	Texas	0.800
19	Michigan State	0.798
20	Notre Dame	0.797
21	Davidson	0.785
22	Northern Iowa	0.784
23	Tulsa	0.780
24	Xavier	0.774
25	Indiana	0.773

Simulated World Series Preview 2015: KCR vs NYM

Last year, when all I had was a statistical model (that accounted for starting pitching) for baseball, I picked the Kansas City Royals to win the World Series in 6 games. Thanks to a historic performance by Madison Bumgarner, that didn't pan out. Neither did most of my other picks - neither of my World Series teams made it out of their respective first rounds.

This year I drew up a play-by-play simulator, and the results have been considerably better: I've hit every single pick in the NL, including a sizable outright upset in the Mets over the Dodgers. Now that the probable starting pitchers are set for both teams, I can run my simulations again using the most recent postseason rosters.

The Mets win it all 56.61% of the time, with the Royals taking home the trophy the other 43.39%. Last year these probabilities were eerily similar, but in favor of the Royals... and I had them in 6 games. This year, I've got the Mets... in 6. So expect Kansas City to take it 7, because "you can't predict sports!".

KC in 4	NYM in 4	In 4
5.15%	7.59%	12.74%
KC in 5	NYM in 5	In 5
10.70%	14.77%	25.47%
KC in 6	NYM in 6	In 6
12.30%	18.85%	31.15%
KC in 7	NYM in 7	In 7
15.24%	15.40%	30.64%

The pick: New York Mets in 6

The "Cespedes for the rest of us" may just bring a title to the side of New York less accustomed to winning.

Oh my god. @Mets I can't believe I can actually say this, but is it true that there is now "A Cespedis for the rest of us"? #mets #T7L
— Jerry Seinfeld (@JerrySeinfeld) July 31, 2015

Tuesday, October 6, 2015

Play-by-Play Simulator for MLB

In some of my recent posts, I've mentioned writing a play-by-play baseball simulator. Just as was the case for NCAAB and NBA, I finally got it done just in time for the playoffs.

This has allowed me to run 10,000 simulations on this year's MLB playoffs and estimate the probability of each team advancing to each round and ultimately winning the World Series. But first, some notes on the data inputs and methodology used:

There are quite a few complexities in baseball that made this project very challenging. Just determining what happens after the ball is put in play took over 450 lines of code, and printing each player's simulated stats after each simulation greatly increased the computational time it takes to run everything. The end result took months to create in Python, and currently contains over 2,400 lines of code.

So why go to all this trouble when I already have a model (that takes into account starting pitching) for baseball? The complexities of a simulator result in much more accurate predictions, and, in my opinion, simulation is the single-best way to predict future events in complex systems such as sports (more on that in another post at some point).

Advantages

Once lineups are announced, I can simulate things at a very granular level, based exactly on which players are available. There's no guess work based on a team's overall body of work (like there is with almost any mathematical model).
As implied, this allows me to account for player injuries/suspensions.
I account for home-field based on home/away splits for each team.
I account for ballpark effects based on where the game is being played.
I account for the tricky situation involving AL/NL rules regarding a DH in the World Series based on which team is the home team.
I account for pitching matchups, including relievers and closers.

(Current) Disadvantages

I know this is a big one, but I currently don't account for left/right-hand splits for either batters or pitchers. Parsing those splits from Baseball-Reference proved to be extremely difficult, and I hope to have that implemented by next season.
I don't factor in defensive throwing. Whether a runner advances to the next base is determined entirely by their traits on the basepaths.
I don't account for weather (Prediction Machine actually does).
There isn't much complex math that goes into it (yet), it simply "plays" the game exactly as outcomes could occur within the rules of the game.

I gathered data from the following places:

Probable starting pitchers from MLB.com (when available)
Likely starters/rosters and stats from Baseball-Reference
Projected batting orders using historicals from Baseball Press
Ordering of home-field (for example, the 2-3-2 format in the World Series) from Wikipedia

Let's finally get to the results. After playing each game 10,000 times each, here are the probabilities for each team in this year's playoffs:

AL		ALDS	ALCS	WS	Champ
1	KCR	100.00%	45.31%	20.11%	9.65%
4	NYY	45.35%	24.35%	11.66%	6.03%
5	HOU	54.65%	30.34%	15.49%	8.38%
2	TOR	100.00%	61.66%	35.46%	20.48%
3	TEX	100.00%	38.34%	17.27%	8.01%

NL		NLDS	NLCS	WS	Champ
1	STL	100.00%	43.80%	18.80%	7.21%
4	PIT	42.67%	22.82%	10.52%	4.36%
5	CHC	57.33%	33.39%	18.38%	9.57%
2	LAD	100.00%	36.59%	15.07%	6.38%
3	NYM	100.00%	63.41%	37.23%	20.19%

This leads to two divergent brackets. The first, if you pick strictly based on the most likely team to reach each round:

Houston's Offensive Output was Extremely Unlikely, But Not Impossible

On Friday, the Houston Astros held form as the second Wild Card team by beating the Arizona Diamondbacks by a score of 21-5. A score that high is very rare; in fact, throughout the entire 2015 season, a team scoring 21+ happened a total of 3 times (including Friday in Arizona). Out of 4858 possibilities, that's 0.0618% of the time.

So how likely was it it would happen for Houston on Friday? I just so happened to have finished my play-by-play baseball simulator, and can look back at each result over the 10,000 simulations. Houston scored 21 or more runs 5 times: 0.05% of the time, which is fairly in line with the historical trend over this past season. It was extremely unlikely, but it could happen - and it did. "You can never be 100% certain!"

Friday, October 2, 2015

Hurricane Joaquin Could Decide if Houston Gets to the ALDS

The Houston Astros (HOU) are currently hanging on for dear life to the second wild card spot in the AL, up 1 game on both the Angels (LAA) and the Twins (MIN) with 3 games to play. However, they're 3 games back to the Rangers (TEX) with 3 to play as well, meaning there are numerous different scenarios that could play out:

There is only ONE scenario in which HOU wins the AL West and makes it to the ALDS directly, without having to play in the Wild Card game: if they sweep the Diamondbacks (ARI) (on the road) 3-0, TEX gets swept 0-3 by LAA (at home), and then HOU beats TEX in the one game playoff for the AL West title. Baseball Prospectus gives a 0.3% chance that this happens. And if they lose this game, HOU still gets the second wild card spot.
If a three-way tie occurs between HOU, LAA, and MIN for the second wild card spot, things get even more complicated:

"Teams would be designated A, B, C based on head-to-head record. That favors the Angels: Angels: 14-12, Astros: 13-12, Twins: 5-8. So the Angels would pick their designation: They could play at home Monday and Tuesday or on the road Tuesday against the winner of the Astros-Twins game. They presumably would pick to play just one game. So the Astros would host the Twins on Monday and the winner hosts the Angels on Tuesday."

If HOU matches or exceeds LAA and MIN in each teams' last 3 games, HOU holds on to the second wild card and would play the New York Yankees (NYY) in New York. This is the most likely scenario, since LAA has to play TEX, who are fighting to win the division, and MIN is playing the Royals (KC), who are fighting the Blue Jays (TOR) for home-field throughout the playoffs.
If HOU ties one of LAA or MIN (but not both), they will play a one game playoff to get in to the one game playoff against NYY.

In any case, Baseball Prospectus gives HOU a 70% chance to grab the second wild card spot between all of the above scenarios. So for simplicity's sake, let's assume that happens. So HOU will play at NYY in the one game Wild Card game. This is where it gets interesting...

Due to all of the complex battling described above, HOU is trotting out Dallas Keuchel (their best starter) tonight against ARI. It's generally agreed he's a front-runner for the AL Cy Young, and with a win tonight at ARI, he would get to 20 wins on the year. HOU definitely wants him pitching the do-or-die Wild Card game in New York. But here's the issue: Keuchel has to start tonight, and the AL Wild Card game is currently slated for Tuesday, which means he would only get three days's rest between the two games. And he has never started on three day's rest in his career.

Here's where Hurricane Joaquin comes in: as noted in this article, a rain out due to the hurricane would push the AL Wild Card game to Wednesday, which would allow Keuchel to pitch in the Bronx. And this makes a HUGE difference.

I've recently finished my play-by-play baseball simulator (more on that in another post), which allows me to simulate the game in New York with and without Keuchel starting. It seems very likely that Masahiro Tanaka will be the starter for the Yankees, so HOU's starting pitcher is the variable I will change, running a simulation on each scenario 10,000 times.

With Keuchel as the starter, HOU beats NYY 54.61% of the time by an average score of 4.30 to 3.66. With Collin McHugh (their second-best starter) on the mound, HOU beats NYY 47.37% of the time by an average score of 4.29 to 4.31.

Whether Keuchel starts determines whether HOU is the favorite in this game, which in turn will likely be dictated by Hurricane Joaquin's effect on New York City. If you're a Houston fan and they manage to make it to the Wild Card game, you want a rain out on Tuesday.

Thursday, September 24, 2015

Kicking the Extra Point vs Going for Two Under the New Rules

Through the first 2 weeks of the NFL season, much has been made of the new rule change that moves the extra point attempt back to the 15-yard line, and for good reason - the success rate on XP is down from 99.3% for all of last season to 94.19% so far this season. The Steelers appear to be going for two more often, but is this 5% decrease in accuracy enough to make the two-point conversion the better strategy?

It comes down to the expected value of each strategy. Over all of last season, teams successfully scored the two-point conversion 47.46% of the time. Thus, the expected value of going for two was 0.949, nearly identical to the current expected value of the extra point, 0.942. However, through a very small sample size of 15, teams have successfully converted for two 53.33% of the time in the first 2 weeks of 2015, which implies an expected value of 1.067. 0.125 points is a large difference, but it remains to be seen if this higher success rate will continue. It's more likely regression to the mean will occur, which leaves the decision between the two options basically a coin flip. It probably will be more dependent on the current game state.

Wednesday, September 23, 2015

The Most (In)efficient Way to Complete the Taco Bell $20 Challenge

The Taco Bell $20 Challenge is "a game in which a participant or participants must consume twenty US dollars worth of food items from Taco Bell restaurant in one sitting." The goal then should be to get to $20 in the most inefficient way possible: in other words, maximize cost while minimizing the amount of food.

First, a couple of caveats in order to maintain the spirit of the challenge:

The $20 has to be food only, no drinks
I removed all dessert items. Otherwise it would be pretty easy: order 20 orders of Cinnamon Twists and/or Cinnabon Delights

I used the weight of each item (in grams) as a proxy for its respective size (since we ultimately care about volume), and then aimed to minimize this within the constraint of spending at least $20. I used a Simplex Linear Program to do so, and found the following optimal order:

Data sources: Prices - Fast Food Menu Prices, Weights - Calorie Count

For contrast, choosing 20 Cinnabon Delights and/or Cinnamon Twists (both are the same price and weight) would be a clearly superior strategy:

But this is banned under my caveat. What if we make things a bit more interesting, and don't allow more than 1 of the same item?

This results in a 19% increase in weight, but adds a bit of variety to the order.

Tuesday, September 15, 2015

"What are the odds?" Down 60 Points, How Likely Was It I Would Come Back in Fantasy Football?

Fantasy football is the greatest/worst, depending upon which side of the final score you're on. In Week 1, I was getting beat down by my opponent, Colin, who had also been talking smack all week. And he had every reason to too: at one point, ESPN's projections had him winning by 60 points. But at the end of the week, this happened:

Just how likely was it I would end up winning the week? Yahoo's fantasy football gives win probabilities in addition to projected points each week. In this upcoming week, I'm favored by 4.75 over my opponent, with a win probability of 55%. By dividing this projected margin of victory by a constant and then treating it as normally distributed, I can estimate the given probability (this is how I make predictions for any given league in real sports). Using the given Yahoo numbers, I estimate that constant to be roughly 35, which now allows me to estimate Colin's win probability given that he was projected to win by 60. And that estimated win probability was 95.68%.

"What are the odds?" The Chances the SJ Giants Come Back to Beat Visalia

Going into the bottom of the 5th inning last night, the San Jose Giants trailed the Visalia Rawhide 4-0, and also were down 2-0 in the best-of-5 series. I was asked "What's the win probability for the game at that point?"

The best way to determine this would be to simulate the rest of the game/series play-by-play, taking into account each player on each team and their respective performance throughout the season. However, I don't quite have my play-by-play simulator ready for baseball (although I'm very close!), so I adapted the simulator I wrote awhile back that simply simulates how many runs are scored in each half-inning. As before, this data was gathered from Baseball-Reference, which shows how frequently 0 runs, 1 run, 2 runs, etc are scored in a half-inning (in MLB in 2015; I'm using this as a proxy for MiLB).

The Giants had to overcome a 4 run deficit in 5 innings. I ran 10,000 simulations, and they only come back to tie the game (or take the lead) by the end of the 9th 2.55% of the time. To ultimately answer the first question, they only win the game 1.78% of the time.

Then there was a followup: "Down 4-0 heading to the bottom of the 5th in a best-of-5 series in which they were down 2 games to none. What was the series win probability at that point?"

We know their odds of winning Game 3 were down to 1.78%, so now we have to determine the odds they subsequently win Game 4 AND 5. Game 4 is at home, while a possible Game 5 would be on the road. Using both teams' second-half records (both finished at .600) in Log5 and factoring in home-field (0.24 runs in MLB) gives San Jose 52.27% to win Game 4 and 47.73% to win Game 5. So the odds of winning both Game 4 and 5 are 24.95%. Finding their series win probability headed into the bottom of the 5th last night is then 1.78% * 24.95% = 0.44%, or about 1 in 225. Even so, that's twice as likely as the Cowboys' comeback over the (New York) Giants on Sunday night.

Given that they did come back and win Game 3 means they're now at 24.95%, or about 1 in 4.

Sunday, September 13, 2015

Is Running the Ball Really "Safer"?

Conventional wisdom dictates that rushing the football is a "safer" option than passing due to a lower risk of a turnover. However, does this increased security justify the decreased expected yardage compared to passing?

To start, I looked at the average yards gained per attempt for rushing vs passing, based on the 2014 NFL season.

Type	Rushing	Passing
Yards	57002	121247
# Attempts	13688	17879
Yds/Attempt	4.16	6.78

Data from Football-Reference

Note that this is over all passing attempts, not just completions. Even taking into account incomplete passes (which gain 0 yards) and negative sack yardage (which, in NFL, are deducted from passing yards), passing the ball clearly gives larger expected yardage, by 2.62 yards per play. However, the threat of an interception means rushing is "safer"... right?

Fumbles occurred on 4.89% of rushing attempts, while interceptions occurred on 2.52% of passing attempts. However, the offense can recover a fumble, and does so 37.9% of the time on rushing plays, with the defense recovering the other 62.1%. Factoring this into the above figure means 3.04% of rushing attempts result in a turnover.

Type	Rushing	Passing
Attempts	13688	17879
INT/FUM	670	450
Turnovers	416	450
TO Rate	3.04%	2.52%

So in the end, rushing is MORE risky turnover-wise than passing (or at least it was in the 2014 season).

This doesn't mean teams should pass the ball 100% of the time: there's still game theory to be considered. However, the above results counter conventional wisdom, and imply that teams are both leaving yards on the table AND are risking more turnovers by running the ball too often.

Saturday, September 5, 2015

Going 1st vs Going 2nd in Overtime in NCAAF

The conventional wisdom in college football is that once a game goes to overtime, the ideal strategy is to go second. That way you know exactly what you need to score to extend the game or win it outright, similar to batting in the bottom of the inning in baseball.

I sought to find out if past results support this idea. I looked at every game that went to overtime in the past two seasons, and then took the final overtime period to determine which team went first or second (so in a 3OT game, for example, the team that went second in that third OT was the team that ultimately went second to end the game). Over 67 games, the team that went second won 55.22% of the time. Testing this against the hypothesis of 50% (it doesn't matter whether you go first or second) gave a corresponding p-value of 0.197. This isn't definitive, but this does give some slight evidence that going second is the correct strategy.

Probabilis

Categories