Tuesday, July 29, 2014

Reconfiguring the MLB Standings by Removing Cluster Luck

As detailed in Grantland's most recent weekly MLB roundup, "cluster luck" has a big impact on how many runs a team scores/gives up, which then translates into wins and losses. Cluster luck is calculated by Grantland contributor Ed Feng, and described as follows:
Cluster luck stems from the idea that teams have little to no control over when hits occur, either the ones batters try to collect or the ones pitchers try to prevent. If a starting pitcher allows nine singles, but scatters them to avoid allowing a run, he’s lucky; and if a team manages eight hits in a game, but gets seven in one inning, it’s also lucky.
So, I determined the luck-independent MLB standings by first accounting for teams' "cluster luck" in their runs scored and runs allowed, and then calculating they're Pythagorean expected record from these adjusted runs scored/allowed (using an exponent of 1.83, which is the most accurate (and used by Baseball-Reference)).

Here are how the standings would look if luck was completely even across the board:

AL EastWinsLossesWin %GB
AL CentralWinsLossesWin %GB
AL WestWinsLossesWin %GB
NL EastWinsLossesWin %GB
NL CentralWinsLossesWin %GB
NL WestWinsLossesWin %GB

Sunday, July 27, 2014

Preseason NCAAF Ratings/Rankings

Since player turnover is very high season to season at the college level, I wanted to create a set of preseason ratings for NCAAF that wasn't simply the end of season ratings from 2013, but instead factored in returning starters on each team, etc. To do so, I used the final Composite ratings from the MDS Model (from last season) as a base, and then also involved ESPN's Preseason FPI and the S&P+ projections, both of which take into account player changes on each team.

I simply averaged the FPI and S&P+, and then calculated how many standard deviations above or below average each team rated. Then, I compared these figures with my MDS Composite ratings (which takes into account both a forward-looking predictive component and a past-performance only retrodictive component), and adjusted the base rating up or down. Once the season starts, this "preseason" rating will be faded out as the season progresses, carrying less and less weight with each ensuing week.

Sunday, July 13, 2014

Including Starting Pitchers in the MLB Model

A major flaw in the MLB version of the MDS Model is that it does not account for starting pitching matchups, which greatly affect each team's chances of winning. For example, the Chicago White Sox are a middling team, but when Chris Sale is on the mound, their odds of winning are much better than their record indicates. The inverse of this applies too; a good team with a bad starting pitcher will have worse chances than they would with an average (or above average) pitcher.

To account for pitching, I utilize a combination of the techniques proposed by FiveThirtyEight and Kenneth Massey: predict the outcome based on the two opposing teams' ratings separate from the two starters, then factor in the pitching.

Thursday, July 3, 2014

Isolating the Advantage of Batting Second, Independent of Home-Field

Home-field advantage exists in all sports, and in MLB it's around 4% (the home teams win 54% of all games). However, baseball has the distinction in that the home offense always gets the last chance (unless they are already ahead, in which case the game ends). My question is whether batting second is an advantage apart from home-field, but in MLB there are never any neutral site games. So, I turned to the College World Series, where all games are neutral site, to try to answer this question.

For the College World Series, I collected data going back 25 years, through 1989. Of these games, the team batting second won 55.53% of their games. However, the higher seeded teams are the teams that get to bat in the bottom half of the inning. So, I went back through 1995 (the first year teams were nationally seeded) and found that the higher seeded team won 54.03% of the time. The difference between these two figures then should be the advantage gained from batting second (from the games in which neither team is seeded): 1.50%.

Extending this to MLB (which assumes that this advantage for batting second is the same at the professional level) means that 2.5% of home-field advantage comes from the home-field itself, and the other 1.5% from batting in the bottom of the inning.

Wednesday, July 2, 2014

Are the Cubs Really Bad Because of Day Games?

Last night David Ortiz (of the Boston Red Sox) cited the high amount of day games the Cubs play as the reason their performance is consistently poor. I sought to check whether their results in day games actually back this theory.

I went back 6 years to 2009, the year after they last made the playoffs. They actually played better in day games: in day games, they had a win percentage of 0.452, and in night games, 0.429. Hypothesis testing that day games are worse for the Cubs than night games resulted in a p-value of 0.2389: meaning that there's only a 23.89% chance Ortiz is correct, although we can't definitively conclude he's wrong either.

Gauging the MDS Model as a Leading Indicator of a Team's Win %

Warning: Bill Simmons-esque bias to follow.

My favorite baseball team is the Tampa Bay Rays, and when I do my weekly updates for the MDS Model, I've noticed the Pyth and Composite ratings (which are designed to be predictive) are consistently higher than the team's current win percentage. My hope is that these ratings are leading indicators of the team's actual win percentage; i.e. that these higher ratings will predict subsequent improvement by the team winning games.

As depicted by the above chart, the two percentages seem to move in lockstep, rather than one following the other. The correlation between these two is very strong: 0.895. I then checked the correlation between the MDS Win % and the next week's actual win percentage. This correlation was also positive, but much weaker: 0.296. So, it seems my hypothesis is just wishful thinking.