Home Runs in 1000 PA

2018/05/27

I recently saw a headline saying Aaron Judge now has the record for quickest to 70 home runs from start of a career. This stat is impressive, but also skewed because Judge started in the majors at age 24, which is fairly late for a superstar. A related question that’s interesting to ask is, which players had streaks of 1000 PA or so with 70 HR at the youngest ages?

data

For this I will use the retrosplits repo from Chadwick Bureau. (https://github.com/chadwickbureau/retrosplits)

unlike the retrosheet events data that needs to be processed by the chadwick command line tools before analyzing, these are already in a “tidy” format. So they’re easy to load with a CSV reader, e.g.

library(readr)
daybyday_original_df = read_csv(
  sprintf("%s/retrosplits/daybyday_playing2017.csv", 
  PATH_TO_CHADWICK_BUREAU))

The column names by default have ‘.’s in them which conflicts with the postgres syntax for specifying database.table so it’s convenient to replace the dots with underscores.

nms = names(daybyday_original_df)
nms = str_replace(nms, "\\.", "_")
names(daybyday_original_df) = nms

We can repeat for the other years to get the full data set loaded.

game_source

One caveat with the retrosplits data is in many instances it includes the same game multiple times with a different source, i.e. event vs boxscore. So you first need to choose the “primary” source to avoid over counting. I did this by choosing event based when available, followed by “ded” (not sure what it means), and finally falling back to boxscore based.

fix_daybyday = function(daybyday_original_df) {

  gm_evt = daybyday_original_df %>% 
    filter(game_source == "evt") %>% 
    group_by(game_key) %>% summarise() %>% ungroup()
  
  gm_ded = daybyday_original_df %>% 
    filter(game_source == "ded") %>% 
    group_by(game_key) %>% summarise() %>% ungroup() %>%
    anti_join(gm_evt, by="game_key")
  
  gm_box = daybyday_original_df %>% 
    filter(game_source == "box") %>% 
    group_by(game_key) %>% summarise() %>% ungroup() %>%
    anti_join(gm_evt, by="game_key") %>% 
    anti_join(gm_ded, by="game_key")

  df_evt = daybyday_original_df %>% 
    inner_join(gm_evt, by="game_key") %>% filter(game_source == "evt")
  df_ded = daybyday_original_df %>% 
    inner_join(gm_ded, by="game_key") %>% filter(game_source == "ded")
  df_box = daybyday_original_df %>% 
    inner_join(gm_box, by="game_key") %>% filter(game_source == "box")

  dplyr::bind_rows(df_evt, df_ded, df_box)
}

Then, starting from the full and unedited data,

daybyday_primary_df = fix_daybyday(daybyday_original_df)

We can validate the process by checking career home runs,

## the wrong way - double counting
daybyday_original_df %>% group_by(person_key, season_phase) %>% summarise(hr=sum(B_HR)) %>% arrange(-hr)
# A tibble: 25,341 x 3
# Groups:   person_key [16,579]
   person_key season_phase    hr
   <chr>      <chr>        <int>
 1 ruthb101   R             1119
 2 foxxj101   R              993
 3 ott-m101   R              935
 4 gehrl101   R              912
 5 aaroh101   R              775
 6 willt103   R              766
 7 bondb001   R              762
 8 rodra001   R              696
 9 maysw101   R              682
10 dimaj101   R              674
# ... with 25,331 more rows
## the right way - single counting
daybyday_primary_df %>% 
  group_by(person_key, season_phase) %>% 
  summarise(hr=sum(B_HR)) %>% 
  arrange(-hr)

# A tibble: 25,341 x 3
# Groups:   person_key [16,579]
   person_key season_phase    hr
   <chr>      <chr>        <int>
 1 bondb001   R              762
 2 aaroh101   R              755
 3 ruthb101   R              714
 4 rodra001   R              696
 5 maysw101   R              660
 6 grifk002   R              630
 7 pujoa001   R              614
 8 thomj002   R              612
 9 sosas001   R              609
10 robif103   R              586
# ... with 25,331 more rows

filter the data

I’m interested in players that had 70 HR in 1000 PA or so before age 27 or so. First I’ll filter the day-by-day data so there’s a smaller set to search through.

First, only players with at least 70 HR in total

df_filtered = daybyday_primary_df %>% 
  group_by(person_key) %>% 
  summarise(hr=sum(B_HR)) %>% 
  filter(hr>=70) %>% ungroup()

Now join this list of players to the Master table from Lahman to get date-of-birth, and also grab the player’s full name while we’re at it.

df_dob = df_filtered %>% 
  left_join(Lahman::Master, by=c("person_key" = "retroID")) %>% 
  mutate(nameFull = paste(nameFirst, nameLast)) %>% 
  select(person_key, birthDate, nameFull)

Append an age column to the day-by-day data,

df_ages = daybyday_primary_df %>% 
  inner_join(df_dob, by="person_key") %>% 
  mutate(age_in_days=as.integer(game_date - birthDate))

Now we can filter the list even further to those players that hit at least 70 home runs before age 27 or so.

df_lt27 = df_ages %>%
  filter(age_in_days < 10000) %>% 
  group_by(person_key) %>% 
  summarise(hr=sum(B_HR)) %>% 
  filter(hr>=70) %>% ungroup() %>% 
  inner_join(df_ages, by="person_key") %>% 
  filter(age_in_days < 10000) %>% 
  group_by(person_key) %>% 
  arrange(age_in_days) %>% 
  mutate(shr=cumsum(B_HR), 
           spa=cumsum(ifelse(is.na(B_PA), B_AB+B_BB, B_PA))) %>% 
  select(person_key, nameFull, age_in_days, shr, spa)

The following function will take the df_lt27 data frame, compute running plate-appearance and home run totals up to age 10000 days, and fill in the blanks using the zoo package. It also computes the lagged difference in career home runs over 1000 PA increments, or in other words the cumulative number of home runs in sliding windows of 1000 PA.

library(zoo)


fill_in = function(dfX, pk) {
  tmp = dfX[dfX$person_key == pk,]

  first_row = min(tmp$spa)
  last_row = max(tmp$spa)
  spa_seq = seq(first_row, last_row, 1)
  spa_df = data.frame(spa=spa_seq)
  mg_df = spa_df %>% left_join(tmp, by="spa")
  fill_df = zoo::na.locf(mg_df) %>% 
    mutate(age_in_days = as.integer(age_in_days),
           shr=as.integer(shr),
           spa=as.integer(spa)) %>% 
    select(person_key, nameFull, age_in_days, spa, shr)
  
  # the lagged differences over 1000 PA
  lag_diff = diff(fill_df$shr, 1000)
  fill_df$lag_diff = NA
  fill_df[1001:nrow(fill_df),]$lag_diff = lag_diff
  fill_df[1001:nrow(fill_df),]
}

To illustrate, here’s the result for Babe Ruth.

tt = fill_in(df_lt27, "ruthb101")
tt %>% arrange(-lag_diff) %>% head(4)

  person_key  nameFull age_in_days  spa shr lag_diff
1   ruthb101 Babe Ruth        9682 2384 146       95
2   ruthb101 Babe Ruth        9682 2385 146       95
3   ruthb101 Babe Ruth        9682 2386 146       95
4   ruthb101 Babe Ruth        9681 2379 145       94
tt %>% ggplot(aes(x=age_in_days, y = lag_diff)) + 
  geom_line() + theme_minimal(base_size=16)

Now collect it for all persons.

persons = unique(df_lt27$person_key)

all_persons = dplyr::bind_rows(
  lapply(persons, function(pk) {
    fill_in(df_lt27, pk)
  })
  )

How many total rows?

nrow(all_persons)
[1] 1172276

Which players had the greatest 1000 PA stretches?

all_persons %>% 
  group_by(person_key, nameFull) %>% 
  summarise(m=max(lag_diff, na.rm=TRUE)) %>% 
  arrange(-m)

# A tibble: 387 x 3
# Groups:   person_key [387]
   person_key nameFull             m
   <chr>      <chr>            <dbl>
 1 ruthb101   Babe Ruth           95
 2 gonzj002   Juan Gonzalez       82
 3 rodra001   Alex Rodriguez      82
 4 foxxj101   Jimmie Foxx         81
 5 howar001   Ryan Howard         80
 6 killh102   Harmon Killebrew    79
 7 kiner101   Ralph Kiner         79
 8 mantm101   Mickey Mantle       79
 9 grifk002   Ken Griffey         78
10 marir101   Roger Maris         78
# ... with 377 more rows

Although it’s busy, make a graph of all players,

all_persons %>% 
  filter(!is.na(lag_diff)) %>% 
  ggplot(aes(x=age_in_days/365.25, y=lag_diff, group=person_key)) +
  geom_line(size=0.25) + 
  theme_minimal(base_size = 16) + 
  labs(x="Age", y="HR in 1000 PA stretches")

The few features that stand out are, the player with about 60 home runs at age 21 or so,

all_persons %>% 
  filter(age_in_days < 365.25 * 22, lag_diff> 60) %>% 
  arrange(age_in_days) %>% head(3)

  person_key nameFull age_in_days  spa shr lag_diff
1   ott-m101  Mel Ott        7772 1605  75       61
2   ott-m101  Mel Ott        7772 1606  75       61
3   ott-m101  Mel Ott        7772 1607  75       61

the player who had about 70 at age 23 or so,

all_persons %>% 
  filter(age_in_days < 365.25 * 23.5, lag_diff> 72) %>% 
  arrange(age_in_days) %>% head(3) 

  person_key   nameFull age_in_days  spa shr lag_diff
1   hornb001 Bob Horner        8389 1103  77       73
2   hornb001 Bob Horner        8389 1104  77       73
3   hornb001 Bob Horner        8389 1105  77       73

the player with 80 or so around age 24,

all_persons %>% filter(age_in_days < 365.25 * 24.5, lag_diff> 80) %>% arrange(age_in_days)  %>% head(3)

  person_key      nameFull age_in_days  spa shr lag_diff
1   gonzj002 Juan Gonzalez        8726 1937 118       81
2   gonzj002 Juan Gonzalez        8726 1938 118       81
3   gonzj002 Juan Gonzalez        8726 1939 118       81

and of course the one with more than 90 at age 26, who we already know,

all_persons %>% filter(age_in_days < 365.25 * 27.5, lag_diff> 90) %>% arrange(-lag_diff, age_in_days)  %>% head(3)

  person_key  nameFull age_in_days  spa shr lag_diff
1   ruthb101 Babe Ruth        9682 2384 146       95
2   ruthb101 Babe Ruth        9682 2385 146       95
3   ruthb101 Babe Ruth        9682 2386 146       95

What exactly is the list of players with 70 or more?

all_persons %>% 
  group_by(person_key, nameFull) %>% 
  summarise(m=max(lag_diff, na.rm=TRUE)) %>% 
  ungroup() %>% arrange(-m) %>% 
  filter(m>=70) %>% 
  select(nameFull, m) %>% as.data.frame()
  
            nameFull  m
1          Babe Ruth 95
2      Juan Gonzalez 82
3     Alex Rodriguez 82
4        Jimmie Foxx 81
5        Ryan Howard 80
6   Harmon Killebrew 79
7        Ralph Kiner 79
8      Mickey Mantle 79
9        Ken Griffey 78
10       Roger Maris 78
11    Rocky Colavito 76
12     Manny Ramirez 76
13      Jose Canseco 74
14        Bob Horner 73
15    Willie McCovey 73
16      Frank Thomas 73
17 Vladimir Guerrero 72
18     Eddie Mathews 72
19     Albert Pujols 72
20       Chris Davis 71
21    Carlos Delgado 70
22 Giancarlo Stanton 70
23         Rudy York 70

Pretty good company for Judge!