Monday, August 18, 2014

Parsing Data in Google Sheets

I figured I would give a rundown how I parse data on Google Sheets (which can be translated to use in Microsoft Excel).

To start, I import data from an external site; it has to be in a table or .csv format. In this example, I'm importing from ESPN's schedule for the Tampa Bay Rays.


=IMPORTHTML("http://espn.go.com/mlb/team/schedule/_/name/tb/half/1/tampa-bay-rays","table",1)
This import statement will then expand outside of the first cell it's written in, and fill the rest of the page. For the first game of the season, it gives the following:
MARCHOPPONENTRESULTW-LWINLOSSSAVEATT
Mon, Mar 31- vs
-
- Toronto
- W
- 9-2
1-0Price (1-0)Dickey (0-1)31,042

All I want is the opponent, the result (win/loss), and the score. First, the opponent, from the following:
OPPONENT
- vs
-
- Toronto

What I want is "Toronto". However, ESPN calls them "Toronto", while other sites like CBS Sports and Yahoo might refer to them as "Toronto Blue Jays" or simply "Blue Jays". So, I need a universal name for "Toronto". I use the typical acronyms used for MLB teams; in this case, "TOR".

So, I have to parse the string of - vs -  - Toronto into "TOR". To do this, I use a VLOOKUP statement that converts "Toronto" into "TOR" from a library (called 'Team Lookup') containing every MLB team and their possible name variations.
=IF(MID(B5,3,1)="@",VLOOKUP(TRIM(MID(B5,16,50)),'Team Lookup'!$A$1:$B$32,2,FALSE),VLOOKUP(TRIM(MID(B5,17,50)),'Team Lookup'!$A$1:$B$32,2,FALSE))
TRIM simply removes any extra spaces from the beginning and/or end of the string, and MID separates out a substring from the - vs -  - Toronto string. The result of this code is "TOR".

Next, the result: a win or loss. This draws from the following:
RESULT
- W
- 9-2
All I need is one character: "W" or "L":
=MID(C5,3,1)
Now for the score from the above cell: the Rays' score and the opponent's score.
Unfortunately, the Rays' score is not always first: the winner's is. So I have to create two cases, one for if the Rays win, and one if they lose, using a simple IF statement:
=IF(K5="W",IF(MID(C5,11,1)="-",MID(C5,10,1),MID(C5,10,2)),IF(MID(C5,11,1)="-",MID(C5,12,2),MID(C5,13,2)))
In this example, this results in "9". I now need the opponent's score, which also has two cases and uses an IF statement based on the result:
=IF(K5="L",IF(MID(C5,11,1)="-",MID(C5,10,1),MID(C5,10,2)),IF(MID(C5,11,1)="-",MID(C5,12,2),MID(C5,13,2)))
This results in "2". I also only have to worry about cases of 1 or 2 digits in the score, since baseball scores have a lower limit of 0 and never hit triple digits. 

Finally, I want to determine which team won the game. I can once again draw on the result cell:
=IF(K5="W",$A$1,J5)
If the result is "W", then the winner is simply the team in question (TB, aka $A$1). Else, it's the opponent in this given game.

This all results in the following: 
OpponentResultScoreOpp ScoreWinner
TORW92TB

This result can then be manipulated mathematically by the MDS Model for each game.
Thankfully, this code can be duplicated for every game (162 in MLB) for every team (30), which is extremely useful in the case of a sport/league like NCAAB, where there are 351 schools with ~13,000 total games in any one season.

No comments:

Post a Comment