Have expected stats gone too far?
This article is part of a series on my pitch evaluation model PitchingBot, an overview of the model can be found here.
To see the current xxxFIP leaderboards, click here.
ERA, FIP, xFIP, and beyond...
There are a wide range of pitching stats available to the discerning baseball fan. From Wins and ERA, to DRA- and xBACON, there's something for all tastes. In this post I'll introduce a new stat, xxxFIP, which is definitely NSFW (Not Safe For Wise decision making).
Before diving into the details of xxxFIP, here's an overview of its predecessors and what they are trying to measure.
Earned Run Average (ERA) is the grandfather of pitching statistics, existing since the early 1900s, and is one of the most widely accepted measures of pitcher quality. It is a relatively simple statistic, taking the rate of earned runs given up for every nine innings pitched. An earned run is a run which the pitcher gives up that wasn't the fault of a defensive error. A low ERA is associated with good pitchers, put simply: giving up less runs helps you win games.
There are plenty of caveats when using ERA to predict future performance. ERA can heavily depend on batted ball luck, performance with runners on base, and the quality of the defense behind the pitcher. In small sample sizes, one bad outing can drastically skew a pitcher's ERA which makes it an unreliable statistic for measuring a pitcher's underlying ability to prevent runs.
This leads us towards Fielding Independent Pitching (FIP). In the early 2000's it was realized that pitchers do not have much control over whether balls in play fall for hits or not. FIP removes outcomes based on balls in play and judges pitchers on the events they have the most control of: strikeouts, walks, hit by pitches and home runs. These events are added in a weighted sum to produce a scale similar to ERA. Removing unpredictable batted ball outcomes produces a metric which can predict future ERA better than ERA itself!
Home runs are an important, but volatile, component of FIP. A statistic which attempts to address this is xFIP. Instead of using home runs in the FIP calculation, this statistic replaces them with the expected home run rate based on the number of flyballs hit against the pitcher along with the leaguewide homerun per flyball rate. The idea behind this is that pitchers have more control over their flyball rate than their home run rate so using a league average homerun per flyball rate will reduce the variance in home run rate which is mainly caused by the hitter. For a full breakdown see this article.
xxxFIP
Now we enter uncharted territory. There have been attempts to produce better estimates of home run rate through statcast data and by breaking down flyballs in the infield vs the outfield, but I intend to add estimates of strikeouts and walks as well.
Hence the titular statistic: xxxFIP. The three x's stand for xK, xBB, and xHR, each of which is generated by my machine learning based, pitch prediction model, PitchingBot. I'll go through how these predictions are made in the next section.
After creating xxxFIP I did a search online for "xxxFIP", just to make sure I wasn't accidentally plagiarizing anyone. I found four results, three are jokes on various social media platforms, and one is an archived fangraphs chat from 2014. In it, Neil Weinberg makes an offhand comment:
"Fun fact, I built something called xxxFIP but haven’t tested it’s predictive nature. It uses those xK, xBB, etc stats."
Damn! Not only did someone else get there first, but they chose the same pun name as me!
There's no other subsequent reference to xxxFIP, so I'm keeping the name for my own version, it's too good to pass up on.
The Three X's
PitchingBot is a machine learning model which attempts to predict pitch outcomes by using pitch characteristics and contextual information. The inputs used by PitchingBot are:
- Pitch Type
- Pitch location as it crosses the plate
- Vertical and horizontal movement
- Velocity
- Spin rate
- Pitcher arm slot (release point x and z)
- Pitcher handedness
- Batter handedness
- Count (balls and strikes)
I've tried several different prediction models, trying to predict run values or specific events, and also selectively excluding some inputs. For xxxFIP I'm using the model which uses all the input data and tries to predict the probability of the following events on each pitch:
- Swing
- Swinging strike
- Called strike
- Ball
- Foul ball
- Ball in play
- Contact
- Groundball
- Line drive
- Flyball
More details on the training and testing of PitchingBot can be found in this post.
To find the predicted number of strikeouts I simply add up the probabilities of called and swinging strikes for all pitches thrown in two strike counts. The same is done for walks, summing the probability of a ball being called for every pitch thrown in a three ball count.
To get the predicted rate of hit by pitches and flyballs, the probabilities of these events are added together over all the pitches thrown by a pitcher, no need to filter by count.
With these predicted statistics, I can plug them into the formula for xFIP and produce the new statistic xxxFIP. The denominator in the xFIP calculation is usually innings pitched, but here I have used the number of batters faced divided by 4.3 (The average number of batters faced in an inning). This allows for more reliable calculation of the statistic as an inning pitched is of variable length depending on how many runners reach base. The correlations between xxxFIP and other statistics are not strongly changed when converting between batters faced and innings pitched on the denominator
Note that none of this requires any knowledge about what happens to the pitches after they have been thrown, the same pitch could be thrown to Mike Trout or Jeff Mathis and it would have the same xxxFIP value. This means that xxxFIP is an attempt to measure batter independent pitching, this comes with benefits and drawbacks.
Pros and Cons
Pros
A major benefit to using a batter independent metric is that it automatically adjusts for opponent quality, a pitcher who faces a disproportionate number of home run sluggers will not have his xxxFIP penalized to the same extent as his FIP. This is useful as it allows us to be more confident in the reliability of the xxxFIP over small sample sizes.
Another benefit is that a team does not need to view their pitcher in a real game situation against major league batters to measure xxxFIP. A bullpen session that simulates changes in count based on the pitches thrown could do the job just as well. PitchingBot produces probabilities which would allow this to be performed with good accuracy, as shown in the Figures below:
Calculating xxxFIP by using data which doesn't rely on real game experience could be an excellent tool to assist with scouting players and measuring development but I'm not sure how much value it would add over traditional scouting and pitch quality metrics.
Cons
There are drawbacks to xxxFIP which I should mention before moving on. The inputs that PitchingBot uses do not span the full range of factors which affect pitcher quality. Sequencing of pitches, deceptive deliveries, tunnelling effects, spin mirroring and more, can make a pitcher's arsenal greater than the sum of its parts. At this stage, PitchingBot can only measure the parts.
In addition xxxFIP is still somewhat subject to random variation. xK and xBB depend on the counts which a pitcher finds himself in and therefore xxxFIP is not completely immune from the results of pitches and the idiosyncratic swing decisions made by batters.
Evaluating xxxFIP
We can compare the rates of xK, xBB and xFB to the actual rates and see if there are any large discrepancies. In each case only pitchers who faced at least 400 batters in a season were used for the comparison.
In each case the expected rates have a reasonable correlation to the actual rates. However there is an offset, the actual rates are higher than the predicted rates.
Strikeout rates are around 4% higher than expected, walk rates are 12% higher than expected and flyball rates are 22% higher than expected. I assumed there would be some offset between the predicted rates and the actual rates, PitchingBot can only make predictions on pitches with complete tracking data, hence those with incomplete data will need to be thrown away which will slightly reduce the expected rates of Ks, BBs and FBs.
However, I did not expect this difference to vary between Ks, BBs and FBs. Perhaps pitches which are hit into play are more likely to have incomplete tracking data, and balls in the dirt could have the same problem. Alternatively, PitchingBot might be making poorer predictions on some events compared to others. I'll look to investigate the origin of this difference in another post.
Correlations and Use as an ERA Predictor
A useful test is to see how xxxFIP correlates to xFIP, FIP, and ERA. This will tell us whether we should consider testing it as an ERA predictor, or throw it straight in the garbage. The following table shows the R^2 measure between xxxFIP and the other stats for pitcher-seasons from 2015-2020 with a varying cut off for minimum batters faced. R^2 can vary between 0 and 1; 0 means that xxxFIP contains no information about the other statistic, while 1 means that xxxFIP is perfectly correlated with the statistic.
Minimum batters faced | xxxFIP-xFIP R^2 | xxxFIP-FIP R^2 | xxxFIP-ERA R^2 |
---|---|---|---|
10 | 0.49 | 0.27 | 0.10 |
50 | 0.52 | 0.30 | 0.17 |
100 | 0.55 | 0.37 | 0.22 |
400 | 0.65 | 0.50 | 0.25 |
We can compare this to how xFIP correlates to FIP and ERA:
Minimum batters faced | xFIP-FIP R^2 | xFIP-ERA R^2 |
---|---|---|
10 | 0.55 | 0.40 |
50 | 0.60 | 0.35 |
100 | 0.64 | 0.36 |
400 | 0.75 | 0.44 |
Clearly xxxFIP does not correlate as well as xFIP with ERA, it's easier to get a good correlation based on what did happen, rather than what may have happened. However, there is some correlation, even after a very small number of batters faced.
The next test is to see how xxxFIP, xFIP, FIP, and ERA correlate with ERA for the pitcher next year. The idea behind this is that expected stats stabilize more quickly, and therefore will contain predictive ability for a statistic dominated by noise such as ERA.
Statistic | Next Year ERA R^2 | ||
---|---|---|---|
Min. 10 batters | Min. 100 batters | Min. 400 batters | |
xxxFIP | 0.01 | 0.08 | 0.13 |
xFIP | 0.02 | 0.09 | 0.21 |
FIP | 0.02 | 0.06 | 0.15 |
ERA | 0.01 | 0.03 | 0.10 |
In small sample sizes almost nothing can be said about next year's ERA. For a minimum of 100 batters faced, xxxFIP does a better job of predicting next year's ERA than FIP, but xFIP is better still. Finally for large samples of over 400 batters faced, xxxFIP is better than ERA, but falls short of FIP and xFIP.
The following table shows the R^2 values when comparing each statistic on a yearly basis for each pitcher.
Statistic | Year-on-Year R^2 | ||
---|---|---|---|
Min. 10 batters | Min. 100 batters | Min. 400 batters | |
xxxFIP | 0.17 | 0.32 | 0.45 |
xFIP | 0.08 | 0.22 | 0.42 |
FIP | 0.04 | 0.11 | 0.23 |
ERA | 0.01 | 0.03 | 0.10 |
xxxFIP shows the greatest correlation year-on-year, especially in small samples. This means that it is a more stable measure of pitcher quality than the other metrics.
Reliability
The reliability of a statistic is a useful and often overlooked concept. It has been discussed extensively in these articles on Fangraphs: 1 2 3. The reliability of a statistic over a number of plate appearances can tell us how much its value is affected by the player's true talent level vs noise. I would highly recommend reading the linked articles for a better understanding of reliability and sample size for different statistics.
Reliability goes up with a larger sample size but this can vary significantly depending on the statistic being measured. A pitcher's fastball velocity or arm slot is very reliable, after observing only a few plate appearances you know almost everything there is about these metrics for a pitcher. On the other hand a pitcher's BABIP allowed is very unreliable, even after a full season's worth of pitches a player's BABIP can vary significantly from their true talent level.
Reliability for a statistic is useful for making predictions. The more reliable it is, the less we have to regress to the mean when projecting changes in the statistic in the future. In addition, a statistic which is reliable over small samples is more informative in situations where only small samples are available. It would be ludicrous if a scout were to judge a pitcher based on his BABIP after watching one outing.
To test xxxFIP's reliability I shall be using Cronbach's Alpha. This is explained in detail in this article which was also linked above. Without going into much detail, higher alpha means greater reliability. The values of alpha for ERA, FIP, and xFIP were taken from this article.
For comparison with other metrics, the reliability of xxxFIP is around the same as K%.
xxxFIP Leaders
Since we are early in the 2021 season, this is the perfect time to start looking at high reliability metrics such as xxxFIP.
At the time of writing (04/08/21) there are 122 pitchers with at least 20 batters faced, those with the sexiest xxxFIPs are:
Player Name | xxxFIP |
---|---|
Zach Wheeler | 1.74 |
Tyler Glasnow | 2.43 |
Corbin Burnes | 2.44 |
Joe Musgrove | 2.62 |
Alex Cobb | 2.78 |
And those with some of the least attractive xxxFIPs include:
Player Name | xxxFIP |
---|---|
Daniel Ponce de Leon | 5.34 |
Shohei Ohtani | 5.30 |
Chad Kuhl | 5.13 |
Carlos Rodon | 4.95 |
Jorge Lopez | 4.80 |
I've put the current xxxFIP leaderboards online here and will keep them updated throughout the season.
Summary
Using predicted pitch outcomes I've created a metric, xxxFIP, which attempts to predict ERA by isolating the quality of the pitches that a pitcher throws, along with the count that they are thrown in. This metric is more reliable than xFIP and FIP, and has the potential to be calculated without needing pitches to be thrown against real batters.
There are limitations to xxxFIP. Firstly, it only measures individual pitch quality, this ignores other important factors such as pitch sequencing and the relationship between the pitches in a pitcher's arsenal. xxxFIP has lower accuracy than FIP and xFIP when predicting ERA on full season sample sizes. Also xxxFIP is produced by a machine learning model which requires a vast quantity of detailed pitch tracking data, meaning that the predictions can lack explainability and there is limited scope to apply xxxFIP in leagues beyond MLB.
Considering that this is a stat which I created purely because I thought the name would be funny, it turns out to be surprisingly successful and I'll be following the xxxFIP leaderboards closely this season.
Comments
Post a Comment