Data Science Manager - Accenture
M.S. in Predictive Analytics - DePaul University
Me
Malter Analytics
GitHub
LinkedIn
YouTube Channel
Twitter
Kaggle
Other Work
General Assembly
AriBall
Media
Built In
Markov chains are mathematical systems that hop from one “state” to another. In this demonstration, I will look at how Markov chains can be used to help determine the probability of a specific type of pitch being thrown given the pitch type of the previous pitch. States will restart after each batter, meaning that the last pitch of each batter will not be used to predict the first pitch of the next batter. Additionally, all data used is from the 2015 season and comes from MLB Gameday.
library(pitchRx)
library(RSQLite)
library(dplyr)
library(knitr)
library(rmarkdown)
library(msm)
library(data.table)
library(pander)
setwd("~/pitchRx")
#db2015_All <- src_sqlite("MLB2015_All.sqlite3", create = TRUE)
#scrape(start = "2015-04-05", end = "2015-10-09", connect = db2015_All$con)
db2015_All <- src_sqlite("MLB2015_All.sqlite3", create = FALSE)
# Join the location and names table into a new que table.
locations <- select(tbl(db2015_All, "pitch"),
pitch_type, px, pz, des, num, gameday_link, inning)
names <- select(tbl(db2015_All, "atbat"), pitcher_name, batter_name,
num, b_height, gameday_link, event, stand)
que <- inner_join(locations, filter(names, pitcher_name == "Jake Arrieta"),
by = c("num", "gameday_link"))
pitchfx <- as.data.frame(collect(que))
pitchfx <- data.table(pitchfx[ do.call(order, pitchfx[ , c("gameday_link","inning", "num") ] ), ])
pitchfx[, batter_num:=as.numeric(factor(num)), by=gameday_link]
pitchfx <- as.data.frame(pitchfx)
pitchfx$batter_num <- ifelse(pitchfx$batter_num %% 9 == 0, 9, (pitchfx$batter_num %% 9))
pitchfx$batter_num <- as.factor(pitchfx$batter_num)
pitchfx$pitch_type <- as.factor(pitchfx$pitch_type)
# table(pitchfx$pitch_type)
pitchfx$pitch_type_full <- factor(pitchfx$pitch_type,
levels=c("FF", "SI", "CH", "CU", "SL", "IN"),
labels=c("4-seam FB","Sinker","Changeup",
"Curveball", "Slider", "Int. Ball"))
pitcher <- as.data.frame(pitchfx[c(1,5:7,13:14)])
pitcher$uniqueID <- paste(pitcher$num, pitcher$gameday_link, pitcher$inning, sep='')
In this example, we will start off by looking at the overall pitch proportions from Jake Arrieta. The below table shows us the distribution of Arrieta’s pitch choices. In 2015, Arrieta threw a four-seam fastball 18% of the time, a sinker 32% of all pitches, a slider 29%, etc. This is good information, but does not give much predictive power for the batter. We know that Arrieta is most likely to throw a sinker, but I don’t think any batter will be going up to the plate sitting on that pitch given only this information.
pitcher.table <- table(pitcher$pitch_type_full)
prop <- prop.table(pitcher.table)
pitch.prop <- round(prop,3)
pandoc.table(pitch.prop)
4-seam FB | Sinker | Changeup | Curveball | Slider | Int. Ball |
---|---|---|---|---|---|
0.182 | 0.328 | 0.046 | 0.154 | 0.287 | 0.002 |
Next, we will look at the multi-class transition matrix for Jake Arrieta. With now more information, the below table tells us the probability of a specific pitch given the previous pitch. The transition matrix shows that when Arrieta threw a four-seam fastball on the previous pitch, he would also throw a four-seam fastball on the next pitch 23.2% of the time. We now have more information than previously, which was that Arrieta threw a four-seam fastball on his previous pitch. However, not too much information was gained through the use of Markov Chains, so maybe Arrieta just really is that good. Let’s find an example where Markov chains can significantly help a batter.
## Multi-class ##
# Include date as to not include last pitch of previous game and first pitch of next game.
pitcher.matrix <- statetable.msm(pitch_type_full, uniqueID, data=pitcher)
transition.matrix <- round(t(t(pitcher.matrix) / rep(rowSums(pitcher.matrix), each=ncol(pitcher.matrix))),3)
transition.matrix
to
from 4-seam FB Sinker Changeup Curveball Slider Int. Ball
4-seam FB 0.232 0.234 0.042 0.164 0.327 0.000
Sinker 0.131 0.370 0.038 0.137 0.325 0.000
Changeup 0.172 0.328 0.090 0.172 0.238 0.000
Curveball 0.202 0.303 0.048 0.098 0.348 0.000
Slider 0.175 0.298 0.044 0.173 0.310 0.000
Int. Ball 0.000 0.000 0.000 0.000 0.000 1.000
Example of pitch proportions only for the first pitch of an at-bat.
first.pitch <- pitcher %>%
group_by(num, gameday_link) %>%
filter(row_number() <= 1)
first.pitch.table <- table(first.pitch$pitch_type_full)
prop.first.pitch <- round(prop.table(first.pitch.table),3)
pandoc.table(prop.first.pitch)
4-seam FB | Sinker | Changeup | Curveball | Slider | Int. Ball |
---|---|---|---|---|---|
0.204 | 0.375 | 0.053 | 0.175 | 0.192 | 0.001 |
In this example, let’s take a look at White Sox starting pitcher, Chris Sale.
que <- inner_join(locations, filter(names, pitcher_name == "Chris Sale"),
by = c("num", "gameday_link"))
pitchfx <- as.data.frame(collect(que))
pitchfx <- data.table(pitchfx[ do.call(order, pitchfx[ , c("gameday_link","inning", "num") ] ), ])
pitchfx[, batter_num:=as.numeric(factor(num)), by=gameday_link]
pitchfx <- as.data.frame(pitchfx)
pitchfx$batter_num <- ifelse(pitchfx$batter_num %% 9 == 0, 9, (pitchfx$batter_num %% 9))
pitchfx$batter_num <- as.factor(pitchfx$batter_num)
pitchfx$pitch_type <- as.factor(pitchfx$pitch_type)
# FF, FA, and FS are so small and possibly misclassified. Change them to FT.
pitchfx$pitch_type[pitchfx$pitch_type == 'FF'] <- 'FT'
pitchfx$pitch_type[pitchfx$pitch_type == 'FA'] <- 'FT'
pitchfx$pitch_type[pitchfx$pitch_type == 'FS'] <- 'FT'
pitchfx$pitch_type <- droplevels(pitchfx$pitch_type) # drop levels FF, FA, FS
# table(pitchfx$pitch_type)
pitchfx$pitch_type_full <- factor(pitchfx$pitch_type,
levels=c("FT", "CH", "SL"),
labels=c("2-seam FB", "Changeup", "Slider"))
pitcher <- as.data.frame(pitchfx[c(1,5:7,13:14)])
pitcher$uniqueID <- paste(pitcher$num, pitcher$gameday_link, pitcher$inning, sep='')
Chris Sale’s 2015 pitch proportions are shown below. His two main pitches, a two-seam fastball and a changeup are thrown for roughly 53% and 28%, respectively.
2-seam FB | Changeup | Slider |
---|---|---|
0.528 | 0.277 | 0.195 |
When analyzing Sale, Markov chains give a bit more insight into predicting his next pitch than of that for Arrieta. Even though a two-seam fastball is Sale's most thrown pitch, when he threw a changeup on the previous pitch, he is 10-12% more likely to come back with another changeup than if he had previous thrown a fastball or slider. This type of information shows the importance of Markov chains because it is simply missed when only looking at overall pitch proportions. Still not necessarily enough information to confidently assume one pitch or another, but enough information to give the batter an edge against one of baseball’s most dominant pitchers.
## Multi-class ##
# Include date as to not include last pitch of previous game and first pitch of next game.
pitcher.matrix <- statetable.msm(pitch_type_full, uniqueID, data=pitcher)
transition.matrix <- round(t(t(pitcher.matrix) / rep(rowSums(pitcher.matrix), each=ncol(pitcher.matrix))),3)
transition.matrix
to
from 2-seam FB Changeup Slider
2-seam FB 0.556 0.254 0.190
Changeup 0.467 0.370 0.164
Slider 0.507 0.278 0.215
Finally, we’ll look at one more example from a starting pitcher where Markov chains give more information to a batter than the overall pitch proportions. Below, let’s take a look at Red Sox starting pitcher, Joe Kelly.
que <- inner_join(locations, filter(names, pitcher_name == "Joe Kelly"),
by = c("num", "gameday_link"))
pitchfx <- as.data.frame(collect(que))
pitchfx <- data.table(pitchfx[ do.call(order, pitchfx[ , c("gameday_link","inning", "num") ] ), ])
pitchfx[, batter_num:=as.numeric(factor(num)), by=gameday_link]
pitchfx <- as.data.frame(pitchfx)
pitchfx$batter_num <- ifelse(pitchfx$batter_num %% 9 == 0, 9, (pitchfx$batter_num %% 9))
pitchfx$batter_num <- as.factor(pitchfx$batter_num)
pitchfx$pitch_type <- as.factor(pitchfx$pitch_type)
# table(pitchfx$pitch_type)
pitchfx$pitch_type_full <- factor(pitchfx$pitch_type,
levels=c("FF", "FT", "CH", "CU", "SL"),
labels=c("4-seam FB","2-seam FB","Changeup",
"Curveball", "Slider"))
pitcher <- as.data.frame(pitchfx[c(1,5:7,13:14)])
pitcher$uniqueID <- paste(pitcher$num, pitcher$gameday_link, pitcher$inning, sep='')
In 2015, Joe Kelly threw a four-seam fastball 32% of pitches, a 2-seam fastball 34% of pitches, etc.
pitcher.table <- table(pitcher$pitch_type_full)
prop <- prop.table(pitcher.table)
pitch.prop <- round(prop,3)
pandoc.table(pitch.prop, emphasize.strong.cells = which(pitch.prop == 0.315, arr.ind = TRUE))
4-seam FB | 2-seam FB | Changeup | Curveball | Slider |
---|---|---|---|---|
0.315 | 0.34 | 0.11 | 0.092 | 0.143 |
When looking at the transition matrix for Joe Kelly, we find much more information than we did from Jake Arrieta. In this case, when Kelly threw a four-seam fastball on the previous pitch, we now know that there is a 48% chance he’ll throw a four-seam fastball on the next pitch. A significant jump from the 32% overall probability of a four-seam fastball. Additionally, when Kelly threw a two-seam fastball on the previous pitch, we now know that he is most likely to come back with that pitch again.
## Multi-class ##
# Include date as to not include last pitch of previous game and first pitch of next game.
pitcher.matrix <- statetable.msm(pitch_type_full, uniqueID, data=pitcher)
transition.matrix <- round(t(t(pitcher.matrix) / rep(rowSums(pitcher.matrix), each=ncol(pitcher.matrix))),3)
transition.matrix
to
from 4-seam FB 2-seam FB Changeup Curveball Slider
4-seam FB 0.482 0.221 0.095 0.072 0.130
2-seam FB 0.191 0.511 0.084 0.087 0.128
Changeup 0.337 0.194 0.260 0.102 0.107
Curveball 0.255 0.370 0.109 0.170 0.097
Slider 0.264 0.247 0.060 0.077 0.353
When I use the word significance, it should be noted that I do not test for statistical significance, but did use a full season worth of pitches for each pitcher and felt that it was a decent amount of data for a fair representation. Overall, Markov chains are easy to use in R thanks to packages like msm and markovchain. Further, Markov chains can help a batter gain insights that cannot be found on sites like FanGraphs or MLB.com. A potential further analysis can be done to enhance the accuracy of the Markov model by not only using pitch type, but the pitch location too. An example of this would be that if Sale threw a fastball in the bottom third of the zone on the previous pitch, then he is going to come back with a high fastball x% of the time on the next pitch.