Tuesday, October 6, 2015

Play-by-Play Simulator for MLB

In some of my recent posts, I've mentioned writing a play-by-play baseball simulator. Just as was the case for NCAAB and NBA, I finally got it done just in time for the playoffs. 

This has allowed me to run 10,000 simulations on this year's MLB playoffs and estimate the probability of each team advancing to each round and ultimately winning the World Series. But first, some notes on the data inputs and methodology used:

There are quite a few complexities in baseball that made this project very challenging. Just determining what happens after the ball is put in play took over 450 lines of code, and printing each player's simulated stats after each simulation greatly increased the computational time it takes to run everything. The end result took months to create in Python, and currently contains over 2,400 lines of code.

So why go to all this trouble when I already have a model (that takes into account starting pitching) for baseball? The complexities of a simulator result in much more accurate predictions, and, in my opinion, simulation is the single-best way to predict future events in complex systems such as sports (more on that in another post at some point).

Advantages

  1. Once lineups are announced, I can simulate things at a very granular level, based exactly on which players are available. There's no guess work based on a team's overall body of work (like there is with almost any mathematical model).
  2. As implied, this allows me to account for player injuries/suspensions.
  3. I account for home-field based on home/away splits for each team.
  4. I account for ballpark effects based on where the game is being played.
  5. I account for the tricky situation involving AL/NL rules regarding a DH in the World Series based on which team is the home team.
  6. I account for pitching matchups, including relievers and closers.
(Current) Disadvantages
  1. I know this is a big one, but I currently don't account for left/right-hand splits for either batters or pitchers. Parsing those splits from Baseball-Reference proved to be extremely difficult, and I hope to have that implemented by next season.
  2. I don't factor in defensive throwing. Whether a runner advances to the next base is determined entirely by their traits on the basepaths.
  3. I don't account for weather (Prediction Machine actually does).
  4. There isn't much complex math that goes into it (yet), it simply "plays" the game exactly as outcomes could occur within the rules of the game.
I gathered data from the following places: 

  • Probable starting pitchers from MLB.com (when available)
  • Likely starters/rosters and stats from Baseball-Reference
  • Projected batting orders using historicals from Baseball Press
  • Ordering of home-field (for example, the 2-3-2 format in the World Series) from Wikipedia
Let's finally get to the results. After playing each game 10,000 times each, here are the probabilities for each team in this year's playoffs:


ALALDS ALCSWSChamp
1KCR100.00%45.31%20.11%9.65%
4NYY45.35%24.35%11.66%6.03%
5HOU54.65%30.34%15.49%8.38%
2TOR100.00%61.66%35.46%20.48%
3TEX100.00%38.34%17.27%8.01%
NLNLDSNLCSWSChamp
1STL100.00%43.80%18.80%7.21%
4PIT42.67%22.82%10.52%4.36%
5CHC57.33%33.39%18.38%9.57%
2LAD100.00%36.59%15.07%6.38%
3NYM100.00%63.41%37.23%20.19%

This leads to two divergent brackets. The first, if you pick strictly based on the most likely team to reach each round:


And the second, if you pick each series based on the likely matchups:


The only difference is in the #1 seed vs WC matchup in each league. HOU would be favored in the AL and CHC would be favored in the NL IF they can make it out of the one-game Wild Card game.

Some notes:
  1. The AL has home-field advantage in the World Series, which helps the league win it all 52.54% of the time.
  2. The most likely matchup, TOR vs NYM, would be a doozy: over the 10,000 sims, TOR would beat NYM 51.06% of the time.
  3. The potential team-that-got-hacked vs team-that-hacked-them (HOU vs STL) series occurs 2.91% of the time.
  4. A NYY-NYM Subway Series happens 4.34% of the time.
In most cases, my simulations line up with those of Prediction Machine, except in the case of the Los Angeles Dodgers. One possible explanation for this is the way I determine the probable starting pitchers: I'm assuming coaches will follow the pattern of last year's World Series, where they use a 4-pitcher rotation, meaning each team's best starter would start Games 1 and 5. Prediction Machine may assume Don Mattingly opts to start Kershaw and/or Greinke on shorter rest. However, even if each pitcher gets two starts a series, look at the rest of their pitching staff. Their bullpen is fairly weak, and the drop-off from Kershaw/Greinke to the rest of their starters is steep. Counter that with the New York Mets, who boast a 5-deep rotation of Syndergaard, deGrom, Harvey, Colon, and Niese.

No comments:

Post a Comment