Data Science Manager - Accenture
M.S. in Predictive Analytics - DePaul University
Me
Malter Analytics
GitHub
LinkedIn
YouTube Channel
Twitter
Kaggle
Other Work
General Assembly
AriBall
Media
Built In
Major League Baseball currently has high definition cameras capturing everything from the release point of the ball being thrown to the amount of horizontal movement on the pitch and even the spin rate of the ball. Rather than having a human manually track and record the type of pitch for each pitch thrown, an algorithm does this automatically. The goal of this analysis is to use MLB’s classifications to run a neural network of my own that will match that of Major League Baseball.
The training dataset being used is from all of Jake Arrieta’s pitches thrown in the 2016 regualar season and the model is validated against 98 pitches that Arrieta threw in Game 2 of the 2016 World Series.
# Load in the required packages
library(nnet)
library(caret)
library(pitchRx)
library(dplyr)
library(plyr)
library(RSQLite)
library(devtools)
The following script was used to scrape data from the 2016 season and filtered for Jake Arrieta’s pitches. A similar script was ran for only data on 10/26/16, the date of Game 2 from the 2016 World Series.
db2016 <- src_sqlite("MLB2016_All.sqlite3", create = FALSE)
scrape(start = "2016-04-01", end = "2016-10-01", connect = db2016$con)
locations <- select(tbl(db2016, "pitch"),
pitch_type, start_speed, end_speed, num, gameday_link,
pfx_x, pfx_z, vx0, vy0, vz0, ax, ay, az, break_y,
break_angle, break_length, spin_dir, spin_rate)
names <- select(tbl(db2016, "atbat"), pitcher_name, batter_name,
num, gameday_link, stand)
que <- inner_join(locations, filter(names, pitcher_name == "Jake Arrieta"),
by = c("num", "gameday_link"))
pitchfx <- collect(que, n=Inf)
pitchfx <- as.data.frame(pitchfx)
arrieta <- subset(pitchfx, select = -c(pitcher_name, batter_name, stand, num, gameday_link))
# import function for plotting nnet
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
# Read in 2016 season data and Game 2 of the World Series
arrieta <- read.csv('~/arrieta.csv')
validation <- read.csv('~/world_series.csv')
# Split the regular season data into a new training and testing set.
intrain <- createDataPartition(arrieta$pitch_type, p=0.7, list=FALSE)
myTrain <- arrieta[intrain,]
myTest <- arrieta[-intrain,]
head(myTrain)
Example of the dataset. A glossary of the PITCHf/x fields can be seen here.
pitch_type start_speed end_speed pfx_x pfx_z vx0 vy0 vz0 ax ay az
2 SL 91.2 85.3 4.30 4.92 6.892 -133.311 -7.881 7.829 25.485 -23.139
3 SI 94.6 87.4 -9.14 5.68 9.580 -138.194 -7.213 -17.751 29.649 -21.066
4 FF 92.6 84.7 -7.68 9.66 12.007 -135.195 -4.506 -14.149 30.639 -14.305
5 SI 92.5 84.2 -8.14 7.79 11.627 -134.928 -7.360 -14.827 32.771 -17.919
6 SI 94.0 85.1 -9.72 8.95 12.134 -137.136 -5.906 -18.203 35.086 -15.348
9 FF 94.1 87.3 -6.69 9.64 10.836 -136.968 -12.245 -12.816 27.952 -13.644
break_y break_angle break_length spin_dir spin_rate
2 23.9 -20.1 5.8 139.091 1301.037
3 23.8 35.0 5.7 237.962 2198.319
4 23.8 36.8 4.2 218.372 2445.865
5 23.7 31.6 5.2 226.126 2211.533
6 23.7 42.0 4.9 227.250 2622.399
9 23.8 33.7 4.1 214.668 2386.429
This is a simple feed-forward neural network with a single hidden layer. More complicated methods were tested by using Keras and Theano, but those methods proved to be not necessary for this type of classification.
set.seed(353)
nn_mod1 <- nnet(pitch_type ~ ., data = myTrain,
size = 10, rang = 0.1, decay = 5e-04,
maxit = 1000)
test.pred <- predict(nn_mod1, newdata = myTest, type = "class")
table(test.pred, myTest$pitch_type)
test.pred CH CU FF SI SL
CH 41 0 0 0 0
CU 2 113 0 1 2
FF 1 0 189 2 1
SI 0 0 2 411 0
SL 0 1 1 0 165
# plot each model
plot.nnet(nn_mod1)
This image shows the layout of the neural network with 15 input nodes (features), 10 hidden nodes and 5 output nodes (pitch types).
# structure (15 inputs, 10 hidden layers, 5 outputs)
nn_mod1$n
# summary of model
summary(nn_mod1)
This summary gives an example of the weight and bias values between each node. The table can be read as: the bias value to the first hidden node is 0.0, the weight from the first input node to the first hidden node is -0.44, the weight from the second input node to the first hidden ndoe is -0.38, etc.
# Validation on new data from 2016 World Series game 2
val.pred <- predict(nn_mod1, newdata = validation, type = "class")
# Misclassified Pitch number 13
table(val.pred, validation$pitch_type)
confusionMatrix(val.pred, validation$pitch_type)
The confusion matrix shows that the model misclassified one pitch out of the 98 thrown by Arrieta in Game 2 of the World Series. The thirteenth pitch he threw was a sinker and my model classified the pitch as a four-seam fastball. We can take a look at the actual video footage to see why the model may have been wrong. As it turns out, this particular pitch was not thrown well by Arrieta.
Reference
Prediction CH CU FF SI SL
CH 1 0 0 0 0
CU 0 15 0 0 0
FF 0 0 49 1 0
SI 0 0 0 11 0
SL 0 0 0 0 21
First, let’s take a look at a four-seam fastball and a sinker thrown by Arrieta to show how similar these pitches look and how difficult it would be for a human to classify. The below image shows one of each pitch thrown in the first inning from Game 2, both in nearly the same location and almost identical pitch velocities.
Image from Baseball Savant
Play the videos of each pitch and see for yourself if you would have been able to identify the difference.
Pitch 1: Sinker
Video from MLB.com
Pitch 4: Four-Seam Fastball
Video from MLB.com
Pitch 13 in the video below was marked by MLB as being a sinker and misclassified in my model as a four-seam fastball. Arrieta missed his spot by throwing the ball far outside of the strike zone and one might assume that he did not throw this pitch well. It appears that there may be a very small, downward cut on the pitch, thus identifying the ball as a sinker, but it is very difficult to detect with the naked eye. The table below shows how subtle the differences can be by looking at the mean of various factors grouped by pitch type. Arrieta gets more spin on his sinker than compared to his four-seam fastball, but only a computer can pick up that information.
detach(package:plyr)
mean.group <- validation %>%
group_by(pitch_type) %>%
summarise (mean_break_angle = mean(break_angle),
mean_break_length = mean(break_length),
mean_spin_rate = mean(spin_rate),
mean_spin_dir = mean(spin_dir))
as.data.frame(mean.group)
pitch_type mean_break_angle mean_break_length mean_spin_rate mean_spin_dir
1 CH 26.60000 5.800000 1957.302 228.09300
2 CU -16.50667 13.326667 1976.298 50.75767
3 FF 26.34082 4.057143 2162.419 209.88673
4 SI 31.75000 5.125000 2132.878 226.50800
5 SL -18.04286 7.695238 1120.785 107.67705
Pitch 13: Sinker
Video from MLB.com
Overall, it is clear that you do not need a very complicated model for predicting pitch types. This model showed to have a 98.9% accuracy and even then, there are often times pitches that cannot be categorized as one type of pitch or another for various reasons.