danmalter.github.io - Personal GitHub site for Danny Malter

Using Markov Chains to Predict Pitches

Quick Introduction to Markov Chains

Markov chains are mathematical systems that hop from one “state” to another. In this demonstration, I will look at how Markov chains can be used to help determine the probability of a specific type of pitch being thrown given the pitch type of the previous pitch. States will restart after each batter, meaning that the last pitch of each batter will not be used to predict the first pitch of the next batter. Additionally, all data used is from the 2015 season and comes from MLB Gameday.

library(pitchRx)
library(RSQLite)
library(dplyr)
library(knitr)
library(rmarkdown)
library(msm)
library(data.table)
library(pander)

setwd("~/pitchRx")
#db2015_All <- src_sqlite("MLB2015_All.sqlite3", create = TRUE)
#scrape(start = "2015-04-05", end = "2015-10-09", connect = db2015_All$con)
db2015_All <- src_sqlite("MLB2015_All.sqlite3", create = FALSE)

# Join the location and names table into a new que table.
locations <- select(tbl(db2015_All, "pitch"), 
                    pitch_type, px, pz, des, num, gameday_link, inning)
names <- select(tbl(db2015_All, "atbat"), pitcher_name, batter_name, 
                num, b_height, gameday_link, event, stand)
que <- inner_join(locations, filter(names, pitcher_name == "Jake Arrieta"),
                  by = c("num", "gameday_link"))
pitchfx <- as.data.frame(collect(que))
pitchfx <- data.table(pitchfx[ do.call(order, pitchfx[ , c("gameday_link","inning", "num") ] ), ])
pitchfx[, batter_num:=as.numeric(factor(num)), by=gameday_link]
pitchfx <- as.data.frame(pitchfx)
pitchfx$batter_num <- ifelse(pitchfx$batter_num %% 9 == 0, 9, (pitchfx$batter_num %% 9))
pitchfx$batter_num <- as.factor(pitchfx$batter_num)
pitchfx$pitch_type <- as.factor(pitchfx$pitch_type)

# table(pitchfx$pitch_type)
pitchfx$pitch_type_full <- factor(pitchfx$pitch_type,
                                  levels=c("FF", "SI", "CH", "CU", "SL", "IN"),
                                  labels=c("4-seam FB","Sinker","Changeup", 
                                           "Curveball", "Slider", "Int. Ball"))

pitcher <- as.data.frame(pitchfx[c(1,5:7,13:14)])
pitcher$uniqueID <- paste(pitcher$num, pitcher$gameday_link, pitcher$inning, sep='')

Jake Arrieta

In this example, we will start off by looking at the overall pitch proportions from Jake Arrieta. The below table shows us the distribution of Arrieta’s pitch choices. In 2015, Arrieta threw a four-seam fastball 18% of the time, a sinker 32% of all pitches, a slider 29%, etc. This is good information, but does not give much predictive power for the batter. We know that Arrieta is most likely to throw a sinker, but I don’t think any batter will be going up to the plate sitting on that pitch given only this information.

Jake Arrieta - Overall Pitch Proportions

pitcher.table <- table(pitcher$pitch_type_full)
prop <- prop.table(pitcher.table)
pitch.prop <- round(prop,3)
pandoc.table(pitch.prop)

</col> </col> </col> </col> </col> </col>

4-seam FB	Sinker	Changeup	Curveball	Slider	Int. Ball
0.182	0.328	0.046	0.154	0.287	0.002

Next, we will look at the multi-class transition matrix for Jake Arrieta. With now more information, the below table tells us the probability of a specific pitch given the previous pitch. The transition matrix shows that when Arrieta threw a four-seam fastball on the previous pitch, he would also throw a four-seam fastball on the next pitch 23.2% of the time. We now have more information than previously, which was that Arrieta threw a four-seam fastball on his previous pitch. However, not too much information was gained through the use of Markov Chains, so maybe Arrieta just really is that good. Let’s find an example where Markov chains can significantly help a batter.

Jake Arrieta - Multi-class Markov Chain

## Multi-class ##

# Include date as to not include last pitch of previous game and first pitch of next game.
pitcher.matrix <- statetable.msm(pitch_type_full, uniqueID, data=pitcher)
transition.matrix <- round(t(t(pitcher.matrix) / rep(rowSums(pitcher.matrix), each=ncol(pitcher.matrix))),3)
transition.matrix

     to
 from           4-seam FB Sinker Changeup Curveball Slider  Int. Ball
    4-seam FB   0.232     0.234    0.042     0.164    0.327   0.000
    Sinker      0.131     0.370    0.038     0.137    0.325   0.000
    Changeup    0.172     0.328    0.090     0.172    0.238   0.000
    Curveball   0.202     0.303    0.048     0.098    0.348   0.000
    Slider      0.175     0.298    0.044     0.173    0.310   0.000
    Int. Ball   0.000     0.000    0.000     0.000    0.000   1.000

Jake Arrieta - First Pitch of an At-Bat

Example of pitch proportions only for the first pitch of an at-bat.

first.pitch <- pitcher %>% 
  group_by(num, gameday_link) %>% 
  filter(row_number() <= 1) 

first.pitch.table <- table(first.pitch$pitch_type_full)
prop.first.pitch <- round(prop.table(first.pitch.table),3)
pandoc.table(prop.first.pitch)

</col> </col> </col> </col> </col> </col>

4-seam FB	Sinker	Changeup	Curveball	Slider	Int. Ball
0.204	0.375	0.053	0.175	0.192	0.001

Chris Sale

In this example, let’s take a look at White Sox starting pitcher, Chris Sale.

que <- inner_join(locations, filter(names, pitcher_name == "Chris Sale"),
                  by = c("num", "gameday_link"))
pitchfx <- as.data.frame(collect(que))
pitchfx <- data.table(pitchfx[ do.call(order, pitchfx[ , c("gameday_link","inning", "num") ] ), ])
pitchfx[, batter_num:=as.numeric(factor(num)), by=gameday_link]
pitchfx <- as.data.frame(pitchfx)
pitchfx$batter_num <- ifelse(pitchfx$batter_num %% 9 == 0, 9, (pitchfx$batter_num %% 9))
pitchfx$batter_num <- as.factor(pitchfx$batter_num)
pitchfx$pitch_type <- as.factor(pitchfx$pitch_type)

# FF, FA, and FS are so small and possibly misclassified.  Change them to FT.
pitchfx$pitch_type[pitchfx$pitch_type == 'FF'] <- 'FT'
pitchfx$pitch_type[pitchfx$pitch_type == 'FA'] <- 'FT'
pitchfx$pitch_type[pitchfx$pitch_type == 'FS'] <- 'FT'
pitchfx$pitch_type <- droplevels(pitchfx$pitch_type) # drop levels FF, FA, FS

# table(pitchfx$pitch_type)
pitchfx$pitch_type_full <- factor(pitchfx$pitch_type,
                                  levels=c("FT", "CH", "SL"),
                                  labels=c("2-seam FB", "Changeup", "Slider"))

pitcher <- as.data.frame(pitchfx[c(1,5:7,13:14)])
pitcher$uniqueID <- paste(pitcher$num, pitcher$gameday_link, pitcher$inning, sep='')

Chris Sale - Overall Pitch Proportions

Chris Sale’s 2015 pitch proportions are shown below. His two main pitches, a two-seam fastball and a changeup are thrown for roughly 53% and 28%, respectively.

</col> </col> </col>

2-seam FB	Changeup	Slider
0.528	0.277	0.195

When analyzing Sale, Markov chains give a bit more insight into predicting his next pitch than of that for Arrieta. Even though a two-seam fastball is Sale's most thrown pitch, when he threw a changeup on the previous pitch, he is 10-12% more likely to come back with another changeup than if he had previous thrown a fastball or slider. This type of information shows the importance of Markov chains because it is simply missed when only looking at overall pitch proportions. Still not necessarily enough information to confidently assume one pitch or another, but enough information to give the batter an edge against one of baseball’s most dominant pitchers.

Chris Sale - Multi-class Markov Chain

## Multi-class ##

# Include date as to not include last pitch of previous game and first pitch of next game.
pitcher.matrix <- statetable.msm(pitch_type_full, uniqueID, data=pitcher)
transition.matrix <- round(t(t(pitcher.matrix) / rep(rowSums(pitcher.matrix), each=ncol(pitcher.matrix))),3)
transition.matrix

     to
 from           2-seam FB Changeup Slider
    2-seam FB     0.556    0.254    0.190
    Changeup      0.467    0.370    0.164
    Slider        0.507    0.278    0.215

Joe Kelly

Finally, we’ll look at one more example from a starting pitcher where Markov chains give more information to a batter than the overall pitch proportions. Below, let’s take a look at Red Sox starting pitcher, Joe Kelly.

que <- inner_join(locations, filter(names, pitcher_name == "Joe Kelly"),
                  by = c("num", "gameday_link"))
pitchfx <- as.data.frame(collect(que))
pitchfx <- data.table(pitchfx[ do.call(order, pitchfx[ , c("gameday_link","inning", "num") ] ), ])
pitchfx[, batter_num:=as.numeric(factor(num)), by=gameday_link]
pitchfx <- as.data.frame(pitchfx)
pitchfx$batter_num <- ifelse(pitchfx$batter_num %% 9 == 0, 9, (pitchfx$batter_num %% 9))
pitchfx$batter_num <- as.factor(pitchfx$batter_num)
pitchfx$pitch_type <- as.factor(pitchfx$pitch_type)

# table(pitchfx$pitch_type)
pitchfx$pitch_type_full <- factor(pitchfx$pitch_type,
                                  levels=c("FF", "FT", "CH", "CU", "SL"),
                                  labels=c("4-seam FB","2-seam FB","Changeup", 
                                           "Curveball", "Slider"))

pitcher <- as.data.frame(pitchfx[c(1,5:7,13:14)])
pitcher$uniqueID <- paste(pitcher$num, pitcher$gameday_link, pitcher$inning, sep='')

Joe Kelly - Overall Pitch Proportions

In 2015, Joe Kelly threw a four-seam fastball 32% of pitches, a 2-seam fastball 34% of pitches, etc.

pitcher.table <- table(pitcher$pitch_type_full)
prop <- prop.table(pitcher.table)
pitch.prop <- round(prop,3)
pandoc.table(pitch.prop, emphasize.strong.cells = which(pitch.prop == 0.315, arr.ind = TRUE))

</col> </col> </col> </col> </col>

4-seam FB	2-seam FB	Changeup	Curveball	Slider
0.315	0.34	0.11	0.092	0.143

When looking at the transition matrix for Joe Kelly, we find much more information than we did from Jake Arrieta. In this case, when Kelly threw a four-seam fastball on the previous pitch, we now know that there is a 48% chance he’ll throw a four-seam fastball on the next pitch. A significant jump from the 32% overall probability of a four-seam fastball. Additionally, when Kelly threw a two-seam fastball on the previous pitch, we now know that he is most likely to come back with that pitch again.

Joe Kelly - Multi-class Markov Chain

## Multi-class ##

# Include date as to not include last pitch of previous game and first pitch of next game.
pitcher.matrix <- statetable.msm(pitch_type_full, uniqueID, data=pitcher)
transition.matrix <- round(t(t(pitcher.matrix) / rep(rowSums(pitcher.matrix), each=ncol(pitcher.matrix))),3)
transition.matrix

     to
 from           4-seam FB 2-seam FB Changeup Curveball Slider
    4-seam FB     0.482     0.221    0.095     0.072    0.130
    2-seam FB     0.191     0.511    0.084     0.087    0.128
    Changeup      0.337     0.194    0.260     0.102    0.107
    Curveball     0.255     0.370    0.109     0.170    0.097
    Slider        0.264     0.247    0.060     0.077    0.353

Summary

When I use the word significance, it should be noted that I do not test for statistical significance, but did use a full season worth of pitches for each pitcher and felt that it was a decent amount of data for a fair representation. Overall, Markov chains are easy to use in R thanks to packages like msm and markovchain. Further, Markov chains can help a batter gain insights that cannot be found on sites like FanGraphs or MLB.com. A potential further analysis can be done to enhance the accuracy of the Markov model by not only using pitch type, but the pitch location too. An example of this would be that if Sale threw a fastball in the bottom third of the zone on the previous pitch, then he is going to come back with a high fastball x% of the time on the next pitch.