Monday, October 26, 2015

Simulated World Series Preview 2015: KCR vs NYM

Last year, when all I had was a statistical model (that accounted for starting pitching) for baseball, I picked the Kansas City Royals to win the World Series in 6 games. Thanks to a historic performance by Madison Bumgarner, that didn't pan out. Neither did most of my other picks - neither of my World Series teams made it out of their respective first rounds. 

This year I drew up a play-by-play simulator, and the results have been considerably better: I've hit every single pick in the NL, including a sizable outright upset in the Mets over the Dodgers. Now that the probable starting pitchers are set for both teams, I can run my simulations again using the most recent postseason rosters.

The Mets win it all 56.61% of the time, with the Royals taking home the trophy the other 43.39%. Last year these probabilities were eerily similar, but in favor of the Royals... and I had them in 6 games. This year, I've got the Mets... in 6. So expect Kansas City to take it 7, because "you can't predict sports!".

KC in 4NYM in 4In 4
KC in 5NYM in 5In 5
KC in 6NYM in 6In 6
KC in 7NYM in 7In 7
The pick: New York Mets in 6

The "Cespedes for the rest of us" may just bring a title to the side of New York less accustomed to winning.

Tuesday, October 6, 2015

Play-by-Play Simulator for MLB

In some of my recent posts, I've mentioned writing a play-by-play baseball simulator. Just as was the case for NCAAB and NBA, I finally got it done just in time for the playoffs. 

This has allowed me to run 10,000 simulations on this year's MLB playoffs and estimate the probability of each team advancing to each round and ultimately winning the World Series. But first, some notes on the data inputs and methodology used:

There are quite a few complexities in baseball that made this project very challenging. Just determining what happens after the ball is put in play took over 450 lines of code, and printing each player's simulated stats after each simulation greatly increased the computational time it takes to run everything. The end result took months to create in Python, and currently contains over 2,400 lines of code.

So why go to all this trouble when I already have a model (that takes into account starting pitching) for baseball? The complexities of a simulator result in much more accurate predictions, and, in my opinion, simulation is the single-best way to predict future events in complex systems such as sports (more on that in another post at some point).


  1. Once lineups are announced, I can simulate things at a very granular level, based exactly on which players are available. There's no guess work based on a team's overall body of work (like there is with almost any mathematical model).
  2. As implied, this allows me to account for player injuries/suspensions.
  3. I account for home-field based on home/away splits for each team.
  4. I account for ballpark effects based on where the game is being played.
  5. I account for the tricky situation involving AL/NL rules regarding a DH in the World Series based on which team is the home team.
  6. I account for pitching matchups, including relievers and closers.
(Current) Disadvantages
  1. I know this is a big one, but I currently don't account for left/right-hand splits for either batters or pitchers. Parsing those splits from Baseball-Reference proved to be extremely difficult, and I hope to have that implemented by next season.
  2. I don't factor in defensive throwing. Whether a runner advances to the next base is determined entirely by their traits on the basepaths.
  3. I don't account for weather (Prediction Machine actually does).
  4. There isn't much complex math that goes into it (yet), it simply "plays" the game exactly as outcomes could occur within the rules of the game.
I gathered data from the following places: 

  • Probable starting pitchers from (when available)
  • Likely starters/rosters and stats from Baseball-Reference
  • Projected batting orders using historicals from Baseball Press
  • Ordering of home-field (for example, the 2-3-2 format in the World Series) from Wikipedia
Let's finally get to the results. After playing each game 10,000 times each, here are the probabilities for each team in this year's playoffs:


This leads to two divergent brackets. The first, if you pick strictly based on the most likely team to reach each round:

Monday, October 5, 2015

Houston's Offensive Output was Extremely Unlikely, But Not Impossible

On Friday, the Houston Astros held form as the second Wild Card team by beating the Arizona Diamondbacks by a score of 21-5. A score that high is very rare; in fact, throughout the entire 2015 season, a team scoring 21+ happened a total of 3 times (including Friday in Arizona). Out of 4858 possibilities, that's 0.0618% of the time.

So how likely was it it would happen for Houston on Friday? I just so happened to have finished my play-by-play baseball simulator, and can look back at each result over the 10,000 simulations. Houston scored 21 or more runs 5 times: 0.05% of the time, which is fairly in line with the historical trend over this past season. It was extremely unlikely, but it could happen - and it did. "You can never be 100% certain!"

Friday, October 2, 2015

Hurricane Joaquin Could Decide if Houston Gets to the ALDS

The Houston Astros (HOU) are currently hanging on for dear life to the second wild card spot in the AL, up 1 game on both the Angels (LAA) and the Twins (MIN) with 3 games to play. However, they're 3 games back to the Rangers (TEX) with 3 to play as well, meaning there are numerous different scenarios that could play out:

  1. There is only ONE scenario in which HOU wins the AL West and makes it to the ALDS directly, without having to play in the Wild Card game: if they sweep the Diamondbacks (ARI) (on the road) 3-0, TEX gets swept 0-3 by LAA (at home), and then HOU beats TEX in the one game playoff for the AL West title. Baseball Prospectus gives a 0.3% chance that this happens. And if they lose this game, HOU still gets the second wild card spot.
  2. If a three-way tie occurs between HOU, LAA, and MIN for the second wild card spot, things get even more complicated:  
    • "Teams would be designated A, B, C based on head-to-head record. That favors the Angels: Angels: 14-12, Astros: 13-12, Twins: 5-8So the Angels would pick their designation: They could play at home Monday and Tuesday or on the road Tuesday against the winner of the Astros-Twins game. They presumably would pick to play just one game. So the Astros would host the Twins on Monday and the winner hosts the Angels on Tuesday."
  3. If HOU matches or exceeds LAA and MIN in each teams' last 3 games, HOU holds on to the second wild card and would play the New York Yankees (NYY) in New York. This is the most likely scenario, since LAA has to play TEX, who are fighting to win the division, and MIN is playing the Royals (KC), who are fighting the Blue Jays (TOR) for home-field throughout the playoffs.
  4. If HOU ties one of LAA or MIN (but not both), they will play a one game playoff to get in to the one game playoff against NYY.
In any case, Baseball Prospectus gives HOU a 70% chance to grab the second wild card spot between all of the above scenarios. So for simplicity's sake, let's assume that happens. So HOU will play at NYY in the one game Wild Card game. This is where it gets interesting...

Due to all of the complex battling described above, HOU is trotting out Dallas Keuchel (their best starter) tonight against ARI. It's generally agreed he's a front-runner for the AL Cy Young, and with a win tonight at ARI, he would get to 20 wins on the year. HOU definitely wants him pitching the do-or-die Wild Card game in New York. But here's the issue: Keuchel has to start tonight, and the AL Wild Card game is currently slated for Tuesday, which means he would only get three days's rest between the two games. And he has never started on three day's rest in his career.

Here's where Hurricane Joaquin comes in: as noted in this article, a rain out due to the hurricane would push the AL Wild Card game to Wednesday, which would allow Keuchel to pitch in the Bronx. And this makes a HUGE difference.

I've recently finished my play-by-play baseball simulator (more on that in another post), which allows me to simulate the game in New York with and without Keuchel starting. It seems very likely that Masahiro Tanaka will be the starter for the Yankees, so HOU's starting pitcher is the variable I will change, running a simulation on each scenario 10,000 times. 

With Keuchel as the starter, HOU beats NYY 54.61% of the time by an average score of 4.30 to 3.66. With Collin McHugh (their second-best starter) on the mound, HOU beats NYY 47.37% of the time by an average score of 4.29 to 4.31

Whether Keuchel starts determines whether HOU is the favorite in this game, which in turn will likely be dictated by Hurricane Joaquin's effect on New York City. If you're a Houston fan and they manage to make it to the Wild Card game, you want a rain out on Tuesday.