Lahman database using dplyr

Introduction

The Lahman database is a publically accessible database of baseball statistics. This database contains multiple tables of data from a large time series, which makes it great to run experiments against.

dplyr is an R package which allows programmatic access to this data using concepts very similar to T-SQL.

Initializations

Once the packages are installed, they need to be loaded into the environemnt. This is accomplished with two lines of code.

library(Lahman)
library(dplyr)

Filtering Columns and Rows

Now to find any New York Yankee with over 50 home runs from any year, run the following:

Batting %>% 
  select(playerID, yearID, teamID, HR) %>%
  filter(teamID == 'NYA' & HR > 50) %>%
  arrange(desc(HR))
##    playerID yearID teamID HR
## 1 marisro01   1961    NYA 61
## 2  ruthba01   1927    NYA 60
## 3  ruthba01   1921    NYA 59
## 4  ruthba01   1920    NYA 54
## 5  ruthba01   1928    NYA 54
## 6 mantlmi01   1961    NYA 54
## 7 rodrial01   2007    NYA 54
## 8 mantlmi01   1956    NYA 52

Examples from 09/06/2017 Class

The first example asks for only players of the New York Yankees that have hit 40 HR or more. I’ve added code to order the results from most HR to lowest to get a quick view of the biggest HR hitters.

Batting %>%
  select(yearID, teamID, playerID, HR) %>%
  filter(teamID == 'NYA' & HR >= 40) %>%
  arrange(desc(HR))
##    yearID teamID  playerID HR
## 1    1961    NYA marisro01 61
## 2    1927    NYA  ruthba01 60
## 3    1921    NYA  ruthba01 59
## 4    1920    NYA  ruthba01 54
## 5    1928    NYA  ruthba01 54
## 6    1961    NYA mantlmi01 54
## 7    2007    NYA rodrial01 54
## 8    1956    NYA mantlmi01 52
## 9    1930    NYA  ruthba01 49
## 10   1934    NYA gehrilo01 49
## 11   1936    NYA gehrilo01 49
## 12   2005    NYA rodrial01 48
## 13   1926    NYA  ruthba01 47
## 14   1927    NYA gehrilo01 47
## 15   1924    NYA  ruthba01 46
## 16   1929    NYA  ruthba01 46
## 17   1931    NYA gehrilo01 46
## 18   1931    NYA  ruthba01 46
## 19   1937    NYA dimagjo01 46
## 20   1997    NYA martiti02 44
## 21   2012    NYA grandcu01 43
## 22   1958    NYA mantlmi01 42
## 23   1923    NYA  ruthba01 41
## 24   1930    NYA gehrilo01 41
## 25   1932    NYA  ruthba01 41
## 26   1980    NYA jacksre01 41
## 27   2002    NYA giambja01 41
## 28   2003    NYA giambja01 41
## 29   2011    NYA grandcu01 41
## 30   1960    NYA mantlmi01 40

Next up, we want to see if these big HR hitters can do so without striking out so much. Ordering by the lest amount of strikouts to most helps us figure out the best HR to SO ratio.

Batting %>%
  select(yearID, teamID, playerID, HR, SO) %>%
  filter(HR > 40 & SO < 60) %>%
  arrange(SO)
##    yearID teamID  playerID HR SO
## 1    1934    NYA gehrilo01 49 31
## 2    1954    CIN kluszte01 49 35
## 3    1937    NYA dimagjo01 46 37
## 4    1929    NY1   ottme01 42 38
## 5    1955    CIN kluszte01 47 40
## 6    2004    SFN bondsba01 45 41
## 7    1947    NY1  mizejo01 51 42
## 8    1936    NYA gehrilo01 49 46
## 9    1969    ATL aaronha01 44 47
## 10   2002    SFN bondsba01 46 47
## 11   1949    BOS willite01 43 48
## 12   1953    CLE rosenal01 43 48
## 13   1940    SLN  mizejo01 43 49
## 14   1922    SLN hornsro01 42 50
## 15   2006    SLN pujolal01 49 50
## 16   1931    NYA  ruthba01 46 51
## 17   2004    SLN pujolal01 46 52
## 18   1993    CHA thomafr04 41 54
## 19   1957    WS1 sievero01 42 55
## 20   1931    NYA gehrilo01 46 56
## 21   1923    PHI willicy01 41 57
## 22   1951    PIT kinerra01 42 57
## 23   1954    NY1  mayswi01 41 57
## 24   1936    CLE troskha01 42 58
## 25   1953    BRO camparo01 41 58
## 26   1957    ML1 aaronha01 44 58
## 27   1971    ATL aaronha01 47 58
## 28   2003    SFN bondsba01 45 58

Switching teams now to the Philadelpia Phillies, we want to only look at players in the 1970s that have hit more than 30 HR. There are two ways of filtering the yearID field, and both are shown below. Only the first filtering method was discussed in class.

Batting %>%
  select(yearID, teamID, playerID, HR) %>%
  filter(teamID == 'PHI' & yearID >= 1970 & yearID <= 1979 & HR > 30) %>%
  arrange(desc(HR))
##   yearID teamID  playerID HR
## 1   1979    PHI schmimi01 45
## 2   1977    PHI luzingr01 39
## 3   1975    PHI schmimi01 38
## 4   1976    PHI schmimi01 38
## 5   1977    PHI schmimi01 38
## 6   1974    PHI schmimi01 36
## 7   1978    PHI luzingr01 35
## 8   1971    PHI johnsde01 34
## 9   1975    PHI luzingr01 34

Having to enter multiple conditions in the filter function can become tedious and hard to read. What we can do is use the ‘between’ function to make the code a bit clearer. The ‘between’ function takes the field and the first and last values in the range and filters the results.

Batting %>%
  select(yearID, teamID, playerID, HR) %>%
  filter(teamID == 'PHI' & between(yearID, 1970, 1979) & HR > 30) %>%
  arrange(desc(HR))
##   yearID teamID  playerID HR
## 1   1979    PHI schmimi01 45
## 2   1977    PHI luzingr01 39
## 3   1975    PHI schmimi01 38
## 4   1976    PHI schmimi01 38
## 5   1977    PHI schmimi01 38
## 6   1974    PHI schmimi01 36
## 7   1978    PHI luzingr01 35
## 8   1971    PHI johnsde01 34
## 9   1975    PHI luzingr01 34

Now we want to start ordering our results into a more sensible method. We use the ‘arrange’ function for this. The desc() function orders values from highest to lowest.

Batting %>%
  select(yearID, teamID, playerID, HR) %>%
  filter(HR > 50) %>%
  arrange(desc(HR))
##    yearID teamID  playerID HR
## 1    2001    SFN bondsba01 73
## 2    1998    SLN mcgwima01 70
## 3    1998    CHN  sosasa01 66
## 4    1999    SLN mcgwima01 65
## 5    2001    CHN  sosasa01 64
## 6    1999    CHN  sosasa01 63
## 7    1961    NYA marisro01 61
## 8    1927    NYA  ruthba01 60
## 9    1921    NYA  ruthba01 59
## 10   1932    PHA  foxxji01 58
## 11   1938    DET greenha01 58
## 12   2006    PHI howarry01 58
## 13   2001    ARI gonzalu01 57
## 14   2002    TEX rodrial01 57
## 15   1930    CHN wilsoha01 56
## 16   1997    SEA griffke02 56
## 17   1998    SEA griffke02 56
## 18   1920    NYA  ruthba01 54
## 19   1928    NYA  ruthba01 54
## 20   1949    PIT kinerra01 54
## 21   1961    NYA mantlmi01 54
## 22   2006    BOS ortizda01 54
## 23   2007    NYA rodrial01 54
## 24   2010    TOR bautijo02 54
## 25   2013    BAL davisch02 53
## 26   1956    NYA mantlmi01 52
## 27   1965    SFN  mayswi01 52
## 28   1977    CIN fostege01 52
## 29   1996    OAK mcgwima01 52
## 30   2001    TEX rodrial01 52
## 31   2002    CLE thomeji01 52
## 32   1947    PIT kinerra01 51
## 33   1947    NY1  mizejo01 51
## 34   1955    NY1  mayswi01 51
## 35   1990    DET fieldce01 51
## 36   2005    ATL jonesan01 51

A more complex example will find players that have at least 400 at-bats (AB), but very few strikeouts (SO less than 10). Here we can order the results by multiple columns. The first column (SO) is ordered lowest to highest, and the second column (desc(AB)) orders the values from highest to lowest. This allows us to view the least amount of strikouts with the most at-bats (in case of ties with SO).

Batting %>%
  select(yearID, teamID, playerID, AB, SO) %>%
  filter(AB >= 400 & SO < 10) %>%
  arrange(SO, desc(AB))
##    yearID teamID  playerID  AB SO
## 1    1932    NYA seweljo01 503  3
## 2    1894    NY1 doyleja01 422  3
## 3    1925    CLE seweljo01 608  4
## 4    1929    CLE seweljo01 578  4
## 5    1933    NYA seweljo01 524  4
## 6    1922    CHN holloch01 592  5
## 7    1893    NY1  wardjo01 588  5
## 8    1922    CLE mcinnst01 537  5
## 9    1936    PIT wanerll01 414  5
## 10   1894    BLN keelewi01 590  6
## 11   1924    BSN mcinnst01 581  6
## 12   1926    CLE seweljo01 578  6
## 13   1895    SLN quinnjo02 543  6
## 14   1894    NY1  wardjo01 540  6
## 15   1889    BSN broutda01 485  6
## 16   1875    BS1 wrighge01 408  6
## 17   1927    CLE seweljo01 569  7
## 18   1893    SLN quinnjo02 547  7
## 19   1929    PIT traynpi01 540  7
## 20   1894    PHI crossla01 529  7
## 21   1889    NY1  wardjo01 479  7
## 22   1956    NY1 muelldo01 453  7
## 23   1896    CIN vaughfa01 433  7
## 24   1927    PHA cochrmi01 432  7
## 25   1893    PHI crossla01 415  7
## 26   1894    BLN brodist01 573  8
## 27   1947    PHI verbaem01 540  8
## 28   1895    PHI crossla01 535  8
## 29   1927    WS1 speaktr01 523  8
## 30   1929    PHA cochrmi01 514  8
## 31   1890    NY1 glassja01 512  8
## 32   1923    CHA collied01 505  8
## 33   1933    PIT wanerll01 500  8
## 34   1893    PIT donovpa01 499  8
## 35   1931    NYA seweljo01 484  8
## 36   1887    IN3 glassja01 483  8
## 37   1895    PIT bierblo01 466  8
## 38   1893    CHN dungasa01 465  8
## 39   1885    NY1 connoro01 455  8
## 40   1893    LS3 pinknge01 446  8
## 41   1895    CIN   hoydu01 429  8
## 42   1925    CHA collied01 425  8
## 43   1921    CIN roushed01 418  8
## 44   1894    SLN quinnjo02 405  8
## 45   1945    BSN holmeto01 636  9
## 46   1929    WS1  ricesa01 616  9
## 47   1928    CLE seweljo01 588  9
## 48   1921    BOS mcinnst01 584  9
## 49   1926    CLE summaho01 581  9
## 50   1896    CL4 mckeaed01 571  9
## 51   1893    WAS   hoydu01 564  9
## 52   1948    CLE boudrlo01 560  9
## 53   1896    BLN keelewi01 544  9
## 54   1894    PIT bierblo01 525  9
## 55   1894    BLN broutda01 525  9
## 56   1931    NY1 leachfr01 515  9
## 57   1952    CLE mitchda01 511  9
## 58   1887    DTN broutda01 500  9
## 59   1894    SLN milledo01 481  9
## 60   1926    BSN  highan01 476  9
## 61   1921    SLA severha01 472  9
## 62   1918    CLE speaktr01 471  9
## 63   1921    CHN  dealch01 422  9
## 64   1945    PHA busched01 416  9
## 65   1883    CHN ansonca01 413  9

Grouping and aggregating data

A simple grouping of data can be done by using the group_by function. Below is an example of getting the best HR hitters with their career HR totals. Note that using the select function is not required, since it will pull the columns being used (the one being grouped and the one that is grouped by).

The summarize() function tells dplyr what to do with the grouping. Other functions are available when grouping data as well.

#Find the career HR totals for all players, limited to players with career total of 600 or more
# order from highest to lowest
Batting %>%
  group_by(playerID) %>%
  summarize(career_HR = sum(HR)) %>%
  filter(career_HR >= 600) %>%
  arrange(desc(career_HR))
## # A tibble: 8 x 2
##    playerID career_HR
##       <chr>     <int>
## 1 bondsba01       762
## 2 aaronha01       755
## 3  ruthba01       714
## 4 rodrial01       696
## 5  mayswi01       660
## 6 griffke02       630
## 7 thomeji01       612
## 8  sosasa01       609

Another way to aggregate data is to take the average. Here we take the average season HR totals for players. The mean() function is used to calculate the average of the values.

#What players have the highest average per season HR totals?
Batting %>%
  group_by(playerID) %>%
  summarize(season_avg = mean(HR)) %>%
  filter(season_avg >= 30) %>%
  arrange(desc(season_avg))
## # A tibble: 13 x 2
##     playerID season_avg
##        <chr>      <dbl>
##  1 pujolal01   36.93750
##  2 bondsba01   34.63636
##  3 mcgwima01   34.29412
##  4 kinerra01   33.54545
##  5 aaronha01   32.82609
##  6 bryankr01   32.50000
##  7  ruthba01   32.45455
##  8  sosasa01   32.05263
##  9 cabremi01   31.85714
## 10 belleal01   31.75000
## 11 rodrial01   31.63636
## 12 schmimi01   30.44444
## 13 abreujo02   30.33333

Additionally you can add multiple aggreated values as new columns. Here we find the top players for HR season, along with their lowest and highest season totals

#multiple column groupings
Batting %>%
  group_by(playerID) %>%
  summarize(season_avg = mean(HR), min_HR = min(HR), max_HR = max(HR)) %>%
  filter(season_avg >= 30) %>%
  arrange(desc(season_avg))
## # A tibble: 13 x 4
##     playerID season_avg min_HR max_HR
##        <chr>      <dbl>  <dbl>  <dbl>
##  1 pujolal01   36.93750     17     49
##  2 bondsba01   34.63636      5     73
##  3 mcgwima01   34.29412      3     70
##  4 kinerra01   33.54545      7     54
##  5 aaronha01   32.82609     10     47
##  6 bryankr01   32.50000     26     39
##  7  ruthba01   32.45455      0     60
##  8  sosasa01   32.05263      1     66
##  9 cabremi01   31.85714     12     44
## 10 belleal01   31.75000      1     50
## 11 rodrial01   31.63636      0     57
## 12 schmimi01   30.44444      1     48
## 13 abreujo02   30.33333     25     36

In some cases, it’s important to filter out data you don’t want first, before you do any grouping and aggregating. In this example, we want players that hit over 50 HR in a season (listed only once), but only from 1970 and later. Here we only want to show the player’s name and not the HR total.

#What players from 1970 to today have hit more than 50 HR at least once? Don't list a player more than once.
Batting %>%
  filter(yearID >= 1970) %>%
  group_by(playerID) %>%
  summarize(season_hr = max(HR)) %>%
  filter(season_hr > 50) %>%
  arrange(desc(season_hr)) %>%
  select(playerID)
## # A tibble: 14 x 1
##     playerID
##        <chr>
##  1 bondsba01
##  2 mcgwima01
##  3  sosasa01
##  4 howarry01
##  5 gonzalu01
##  6 rodrial01
##  7 griffke02
##  8 bautijo02
##  9 ortizda01
## 10 davisch02
## 11 fostege01
## 12 thomeji01
## 13 fieldce01
## 14 jonesan01