Catcher Defense with Linear Mixed Effect Models

2018/09/01

In baseball, measuring catcher defense is notoriously tricky. This is because it includes things like pitch framing and game calling that aren’t captured in traditional stats sheets. One approach to this is to do a “with-or-without you” analysis, in other words look at the difference in outcome depending on catcher, holding as many other contributing factors constant as possible. One prominent example of this is Tom Tango’s analysis of passed balls and wild pitches - events where the responsibility can reasonably be assumed to be limited to the pitcher and catcher.

A related approach is to use generalized linear mixed effect models. Notably, this is the approach taken by Baseball Prospectus in their measurements of catcher framing.

The analysis I present here lies somewhere in between these two - but is somewhat orthogonal as well. It’s different than the Baseball Prospectus approach because it controls for many fewer factors and it’s orthogonal to both because it measures an outcome that has many more causal factors in addition to pitcher and catcher identity.

My analysis

My analysis looks at the number of runs scored by the opponent, and tries to assign credit to the catcher. There are lots of factors besides catcher that contribute to runs scored, for example,

This makes assigning credit to the catcher inherently more noisy than something more constrained like wild pitches, passed balls, or called strikes. Nevertheless, I’m going to fit the model and see what happens - as I’ve written before I don’t believe in letting the perfect be the enemy of the good

The present analysis is an alternative version of a topic I’ve looked at before, which is estimating catcher defense back to 1911 using the same linear mixed effect modeling framework. For the work linked above I used only game final scores. What I do here is to count only the runs scored through the end of the 5th inning. The logic behind that choice is that it will mitigate complications of knowing which relief pitchers were brought in - it’s mostly true that the starting pitcher will still be there in the 5th. In the current analysis I use retrosheet play-by-play data from 1961 - 2017.

linear mixed effect models

Linear mixed effect models are appropriate when there are groups within the data, and there are coefficients that vary across groups. The canonical example I think of is schools - if you model outcomes (test scores, or graduation rates, say) as a function of spending on after school programs or something similar, you would expect a positively correlated relationship between outcomes and spending. However, each school starts from it’s own baseline. So if you threw them all in to the same linear regression model, you’d get a model that doesn’t match the reality. In the linear mixed effect model framework you can fit a slope for outcome vs spending that is constant across groups (the fixed effect) and a intercept that is different for each group (the random effect). In the baseball context, the catchers, pitchers, park, offense and defense are all modeled as random effects.

The linear mixed effect model itself is essentially a regression with a L2 penalty - a ridge regression. One main difference is that in a ridge regression you specify the penalty strength, and the same value applies to all variables. In the linear mixed effect model, the penalty is applied only to the random effects, the strength of the penalty varies across groups, and the strength values aren’t specified but are estimated based on maximum likelihood analysis of the profiled likelihood - the likelihood integrated over the random effect values. Technically what the model gives as output is the mode of the random effects values. In many cases this is a good approximation to the mean of the posterior probability distribution, but in general has a somewhat squishy interpretation.

A great interactive visualization of mixed effects models is available here. A good technical discussion is available in the lme4 “Computational Methods” vignette

The data

For this analysis I pulled event data from my local copy of the retrosheet play-by-play data. I limited the seasons from 1961 through 2017. For each game I took the starting catcher ID, the starting pitcher ID, the home team ID (i.e. the park), the season year, the score through the end of the 5th inning. The data are available as a gist here.

library(dplyr)
# load the data 
df = readr::read_csv("https://gist.githubusercontent.com/bdilday/6f85ed94c2fe51a623a1ec7ddd0e8151/raw/c1d7254163b5eb777025028734cf0b8f53087270/catcher_defense_data.csv")
#> Parsed with column specification:
#> cols(
#>   game_id = col_character(),
#>   year_id = col_integer(),
#>   bat_home_id = col_integer(),
#>   off_score = col_integer(),
#>   def_pitcher = col_character(),
#>   def_catcher = col_character(),
#>   off_team = col_character(),
#>   def_team = col_character(),
#>   park_id = col_character(),
#>   park = col_character()
#> )
df$year_id = factor(df$year_id)
head(df)
#> # A tibble: 6 x 10
#>   game_id   year_id bat_home_id off_score def_pitcher def_catcher off_team
#>   <chr>     <fct>         <int>     <int> <chr>       <chr>       <chr>   
#> 1 WS219610… 1961              0         3 danib102_1… retzk101_1… MIN_1961
#> 2 WS219610… 1961              0         2 mcclj104_1… retzk101_1… MIN_1961
#> 3 WS219610… 1961              0         1 danib102_1… retzk101_1… KC1_1961
#> 4 WS219610… 1961              0         1 hobae101_1… retzk101_1… BOS_1961
#> 5 WS219610… 1961              0         0 burnp102_1… retzk101_1… BOS_1961
#> 6 WS219610… 1961              0         2 gablg102_1… dalep101_1… CHA_1961
#> # ... with 3 more variables: def_team <chr>, park_id <chr>, park <chr>

Note that each pitcher, team, defense, etc have to have the season appended so that each season is treated as a different entity.

The model I use is this,

library(lme4)
#> Loading required package: Matrix
#> Loading required package: methods
model_catcher_def = function(df) {
  lmer_mod = lmer(off_score ~ 
                    1 + year_id + 
                    (1|park_id) + (1|off_team) + 
                    (1|def_team) + (1|def_pitcher) + 
                    (1|def_catcher), data=df)
}

model analysis

Here I execute the model

lmer_mod = model_catcher_def(df)

Here’s the summary of the standard deviation of the random effects. One interpretation of these is that the higher the value, the more significant that factor is in determining the outcome. So we can see that pitcher is most important, followed by park, offensive team, defensive team and finally, catcher - it would certainly be surprising if we had found catcher was the most important factor!

summary(lmer_mod)$varcor
#>  Groups      Name        Std.Dev.
#>  def_pitcher (Intercept) 0.324146
#>  def_catcher (Intercept) 0.099711
#>  def_team    (Intercept) 0.182797
#>  off_team    (Intercept) 0.197680
#>  park_id     (Intercept) 0.237530
#>  Residual                2.306134

In order to extract the random effect values, I define some helpers

# parse random effects to a data frame
ranef_to_df = function(lmer_mod, ranef_nm) {
  rr = ranef(lmer_mod)
  data.frame(k=rownames(rr[ranef_nm][[1]]),
             value=rr[ranef_nm][[1]][,1],
             ranef_nm = ranef_nm,
             stringsAsFactors = FALSE)

}

# loop over all random effects and parse to data frame
parse_ranefs = function(lmer_mod) {
  rfs = names(ranef(lmer_mod))
  ll = lapply(rfs, function(rf) {
    ranef_to_df(lmer_mod, rf)
  }) %>% dplyr::bind_rows()
}

Here I get all the random effects

ranef_summary_df = parse_ranefs(lmer_mod)

The columns are:

Let’s see which keys had the highest (and lowest) values, for each random effect.

First, the team variables

ranef_summary_df %>% 
  group_by(ranef_nm) %>% 
  filter(value==min(value) | value == max(value)) %>% 
  ungroup() %>% filter(ranef_nm %in% 
                         c("def_team", "off_team", "park_id"))
#> # A tibble: 6 x 3
#>   k         value ranef_nm
#>   <chr>     <dbl> <chr>   
#> 1 ATL_1995 -0.341 def_team
#> 2 DET_1996  0.458 def_team
#> 3 HOU_1964 -0.401 off_team
#> 4 TOR_2015  0.464 off_team
#> 5 COL_1996  0.947 park_id 
#> 6 SDN_1998 -0.470 park_id

So the model tells us:

The “defense” value here is based on runs scored so there’s an interdependence between pitchers and defensive play - it’s probably more correct to think of it as the best team run prevention.

The 2015 Blue Jays as best offense since 1961 seems surprising. I can double check by looking at the z-score of runs scored, with the Lahman data.

Lahman::Teams %>% 
  group_by(yearID) %>% 
  mutate(m=mean(R), s=sd(R), z=(R-m)/s) %>% 
  select(yearID, teamID, R, m, s, z) %>% ungroup() %>% 
  arrange(-z) %>% head(10) %>% as.data.frame() 
#>    yearID teamID    R        m         s        z
#> 1    2015    TOR  891 688.2333  58.76175 3.450657
#> 2    1915    DET  778 592.2083  62.16736 2.988573
#> 3    2007    NYA  968 777.4000  69.05450 2.760139
#> 4    1976    CIN  857 645.5000  78.12420 2.707228
#> 5    1953    BRO  955 714.1250  91.90058 2.621039
#> 6    2016    BOS  878 724.8000  59.78259 2.562619
#> 7    1982    ML4  891 696.5385  76.69432 2.535540
#> 8    2006    NYA  930 786.6333  57.18782 2.506944
#> 9    2005    BOS  910 744.1667  66.35541 2.499168
#> 10   1950    BOS 1027 750.8125 110.94096 2.489500

Well, there you have it! 2015 Blue Jays have the highest z-score for runs scored in baseball history!

Seems plausible.

Now for the player based estimates. First I define a table to match the records and get the actual names instead of the esoteric retrosheet IDs.

pl_lkup = Lahman::Master %>% 
  mutate(nameFull = paste(nameFirst, nameLast)) %>%
  dplyr::select(retroID, nameFull)

Get the player based min and max random effects

player_summary = ranef_summary_df %>% 
  group_by(ranef_nm) %>% 
  filter(value==min(value) | value == max(value)) %>% 
  filter(ranef_nm %in% c("def_pitcher", "def_catcher"))

# add a new column to strip the year from the name to merge
# with the player lookup
player_summary$retroID = sapply(
  stringr::str_split(player_summary$k, "_"), 
  function(s) {s[[1]]})

player_summary %>% 
  merge(pl_lkup, by="retroID") %>% 
  dplyr::select(nameFull, k, value, ranef_nm)
#>          nameFull             k      value    ranef_nm
#> 1      Mike Heath heatm001_1985  0.1272693 def_catcher
#> 2 Charles Johnson johnc002_1996 -0.1073388 def_catcher
#> 3   Randy Johnson johnr005_2004 -0.6903628 def_pitcher
#> 4 Todd Van Poppel vanpt001_1996  0.5976675 def_pitcher

So we have

Passes the sniff test.

I recall Charles Johnson being considered a great defensive catcher. As far as Mike Heath, I have no idea if that’s plausible or not.

yearly and career total values

The random effects values tell us run values on a per game basis. Here I define a function to aggregate over a season, and over a career, to define a total-value counting stat.

ranef_rankings = function(ranef_summary_df, ranef_nm_) {
  pl_lkup = Lahman::Master %>% 
    mutate(nameFull = paste(nameFirst, nameLast)) %>%
    dplyr::select(retroID, nameFull)

  aa = ranef_summary_df %>%
    filter(ranef_nm == ranef_nm_) %>%
    merge(df, by.x="k", by.y=ranef_nm_)

  aa$player_id = sapply(stringr::str_split(aa$k, "_"), 
                         function(s) {s[[1]]})
  aa$season = sapply(stringr::str_split(aa$k, "_"), 
                     function(s) {s[[2]]})
  career = aa %>%
    group_by(player_id) %>%
    summarise(mean_value=mean(value), 
              sum_value=sum(value)) %>% arrange(mean_value) %>%
    mutate(mean_rank = row_number()) %>% ungroup() %>%
    merge(pl_lkup, by.x="player_id", by.y="retroID")

  yearly = aa %>%
    group_by(player_id, season) %>%
        summarise(mean_value=mean(value), 
              sum_value=sum(value)) %>% arrange(mean_value) %>%
    mutate(mean_rank = row_number()) %>% ungroup() %>%
    merge(pl_lkup, by.x="player_id",by.y="retroID")

  list(yearly = yearly, career = career)
}

pitchers

Although the point here was to look at catchers, I’ll first apply this to pitchers as a sanity check.

pitcher_rankings = ranef_rankings(ranef_summary_df, "def_pitcher")

The top ten runs-per game by pitcher season

pitcher_rankings$yearly %>% 
  arrange(mean_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id season mean_value sum_value mean_rank       nameFull
#> 1   johnr005   2004 -0.6903628 -24.16270         1  Randy Johnson
#> 2   martp001   2000 -0.6803334 -19.72967         1 Pedro Martinez
#> 3   schic002   2004 -0.6311972 -20.19831         1 Curt Schilling
#> 4   johnr005   1995 -0.6275477 -18.82643         2  Randy Johnson
#> 5   maddg002   1995 -0.6234225 -17.45583         1    Greg Maddux
#> 6   santj003   2006 -0.6000581 -20.40197         1  Johan Santana
#> 7   johnr005   2001 -0.5937011 -20.18584         3  Randy Johnson
#> 8   clemr001   1997 -0.5897730 -20.05228         1  Roger Clemens
#> 9   appik001   1993 -0.5600134 -19.04046         1   Kevin Appier
#> 10  martp001   2003 -0.5593239 -16.22039         2 Pedro Martinez

Seems plausible. I’d like to see Pedro 2001, and I’m not sure I like seeing Kevin Appier on there, but anyway moving on…

The top ten runs per season

pitcher_rankings$yearly %>% 
  arrange(sum_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id season mean_value sum_value mean_rank       nameFull
#> 1   johnr005   2004 -0.6903628 -24.16270         1  Randy Johnson
#> 2   carls001   1980 -0.5438898 -20.66781         1  Steve Carlton
#> 3   santj003   2006 -0.6000581 -20.40197         1  Johan Santana
#> 4   marij101   1963 -0.5057191 -20.22877         2  Juan Marichal
#> 5   schic002   2004 -0.6311972 -20.19831         1 Curt Schilling
#> 6   johnr005   2001 -0.5937011 -20.18584         3  Randy Johnson
#> 7   clemr001   1997 -0.5897730 -20.05228         1  Roger Clemens
#> 8   martp001   2000 -0.6803334 -19.72967         1 Pedro Martinez
#> 9   marij101   1966 -0.5452669 -19.62961         1  Juan Marichal
#> 10  koufs101   1963 -0.4859726 -19.43891         1   Sandy Koufax

The top ten mean per game over the career

pitcher_rankings$career %>% 
  arrange(mean_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id mean_value  sum_value mean_rank        nameFull
#> 1   kersc001 -0.3617644 -104.91167         1 Clayton Kershaw
#> 2   martp001 -0.3222799 -131.81246         2  Pedro Martinez
#> 3   webbb001 -0.3018462  -59.76554         3    Brandon Webb
#> 4   salec001 -0.2989121  -53.80418         4      Chris Sale
#> 5   clemr001 -0.2977737 -210.52597         5   Roger Clemens
#> 6   santj003 -0.2959499  -84.04978         6   Johan Santana
#> 7   koufs101 -0.2847562  -60.08356         7    Sandy Koufax
#> 8   schic002 -0.2841804 -123.90265         8  Curt Schilling
#> 9   johnr005 -0.2605288 -157.09886         9   Randy Johnson
#> 10  hallr001 -0.2538522  -99.00237        10    Roy Halladay

The top ten total over the career

pitcher_rankings$career %>% 
  arrange(sum_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id mean_value sum_value mean_rank       nameFull
#> 1   clemr001 -0.2977737 -210.5260         5  Roger Clemens
#> 2   johnr005 -0.2605288 -157.0989         9  Randy Johnson
#> 3   maddg002 -0.1970610 -145.8252        39    Greg Maddux
#> 4   martp001 -0.3222799 -131.8125         2 Pedro Martinez
#> 5   seavt001 -0.2006751 -129.8368        37     Tom Seaver
#> 6   schic002 -0.2841804 -123.9027         8 Curt Schilling
#> 7   mussm001 -0.2168544 -116.2340        28   Mike Mussina
#> 8   palmj001 -0.2120193 -110.4621        32     Jim Palmer
#> 9   blylb001 -0.1596118 -109.3341        85  Bert Blyleven
#> 10  glavt001 -0.1577112 -107.5590        88    Tom Glavine

catchers

And finally, the best defensive catchers according to this methodology

catcher_rankings = ranef_rankings(ranef_summary_df, "def_catcher")

The top ten runs-per game by catcher season

catcher_rankings$yearly %>% 
  arrange(mean_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id season  mean_value  sum_value mean_rank        nameFull
#> 1   johnc002   1996 -0.10733880 -12.129284         1 Charles Johnson
#> 2   essij001   1980 -0.09486632  -6.261177         1      Jim Essian
#> 3   hernr002   2002 -0.09423886 -11.779858         1 Ramon Hernandez
#> 4   ausmb001   2005 -0.09126695 -10.769500         1     Brad Ausmus
#> 5   pagnt001   1996 -0.08716409  -9.413722         1    Tom Pagnozzi
#> 6   lodup001   2003 -0.08639865 -10.367838         1    Paul Lo Duca
#> 7   kendj001   2007 -0.08443976 -10.977169         1   Jason Kendall
#> 8   lopej001   1994 -0.08405470  -6.051938         1      Javy Lopez
#> 9   cartg001   1979 -0.07873492 -10.629214         1     Gary Carter
#> 10  varij001   2001 -0.07814173  -3.672661         1   Jason Varitek

The top ten runs per season

catcher_rankings$yearly %>% 
  arrange(sum_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id season  mean_value  sum_value mean_rank        nameFull
#> 1   johnc002   1996 -0.10733880 -12.129284         1 Charles Johnson
#> 2   hernr002   2002 -0.09423886 -11.779858         1 Ramon Hernandez
#> 3   kendj001   2007 -0.08443976 -10.977169         1   Jason Kendall
#> 4   ausmb001   2005 -0.09126695 -10.769500         1     Brad Ausmus
#> 5   cartg001   1979 -0.07873492 -10.629214         1     Gary Carter
#> 6   lodup001   2003 -0.08639865 -10.367838         1    Paul Lo Duca
#> 7   piazm001   1996 -0.06918476  -9.893421         1     Mike Piazza
#> 8   cartg001   1982 -0.06477695  -9.781320         2     Gary Carter
#> 9   moliy001   2013 -0.07435972  -9.518044         1   Yadier Molina
#> 10  pagnt001   1996 -0.08716409  -9.413722         1    Tom Pagnozzi

The top ten mean per game over the career

catcher_rankings$career %>% 
  arrange(mean_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id  mean_value  sum_value mean_rank      nameFull
#> 1   sweer101 -0.03734670  -7.207914         1    Rick Sweet
#> 2   cartg001 -0.03618157 -70.698796         2   Gary Carter
#> 3   maill001 -0.03547138  -2.802239         3    Luke Maile
#> 4   skagd101 -0.03333700  -5.700627         4   Dave Skaggs
#> 5   moliy001 -0.03205300 -52.983609         5 Yadier Molina
#> 6   ausmb001 -0.03072853 -54.266592         6   Brad Ausmus
#> 7   moora001 -0.03063716  -2.236513         7    Adam Moore
#> 8   josec002 -0.03035304  -8.468497         8  Caleb Joseph
#> 9   johnr009 -0.03012177  -6.446059         9   Rob Johnson
#> 10  rodgb102 -0.03002153 -24.377485        10  Buck Rodgers

The top ten total over the career

catcher_rankings$career %>% 
  arrange(sum_value) %>% 
  head(10) %>% 
  as.data.frame()
#>    player_id  mean_value sum_value mean_rank         nameFull
#> 1   cartg001 -0.03618157 -70.69880         2      Gary Carter
#> 2   ausmb001 -0.03072853 -54.26659         6      Brad Ausmus
#> 3   moliy001 -0.03205300 -52.98361         5    Yadier Molina
#> 4   piazm001 -0.02755544 -44.14381        16      Mike Piazza
#> 5   penat001 -0.02413506 -42.93628        24        Tony Pena
#> 6   varij001 -0.02997632 -41.12751        11    Jason Varitek
#> 7   piera001 -0.02034347 -37.26924        42 A. J. Pierzynski
#> 8   martr004 -0.02379236 -32.92862        25   Russell Martin
#> 9   fiskc001 -0.01528992 -32.06297        76     Carlton Fisk
#> 10  bencj101 -0.01703744 -27.70288        62     Johnny Bench

Conclusion

I’ve presented a way of using linear mixed effects models to measure overall catcher defense, encompassing framing, game calling, etc. As a consequence of the mixed effect model, we also get pitcher estimates that can be used as a sanity check. There is a tight coupling of pitcher with defense and there’s anecdotal evidence that the model has not adequately distinguished the independent effects of those two factors - and a similar, but probably lesser, effect exists for catcher numbers. So these results should be though of as, ahem, ballpark estimates.

There are a number of ways the study could be improved on in a follow up including,