Missing balls filled by NN solution

EricFeczko

Member
SoSH Member
Apr 26, 2014
4,851
From the THT via fangraphs:
https://www.fangraphs.com/tht/43416-2/

In 2015 and 2016, Statcast had a well-publicized problem with missing data, but now in 2017 every batted ball has exit velocity and launch angle information. So that means the problem was solved, right? Wrong. The problem persists. Around 11.6 percent of batted balls on Baseball Savant have their exit velocity and launch angle decided by an algorithm, not a TrackMan measurement. The consequences of this algorithm may color your understanding of Statcast and which techniques you will want to avoid when analyzing the league or its players.
I'm not sure if others have read this yet, but this seems important for anyone interested in statcast data. Approximately 11 percent of "missing data" was technicially imputed using MLB's "No Nulls Solution". The no nulls solution appears to be a method for imputing velocity and angle from stringer information:

When TrackMan data is missing we lose out on:

  • Exit velocity
  • Launch angle
  • Batted ball distance
  • Batted ball spin
It is important to understand that batted ball distance is lost. If you had batted ball distance, you might be able to reverse engineer an exit velocity using the batted ball type and fielding location data. Alas, we are left without the distance data.

We are left with:

  • Batted ball type
  • Batted ball result
  • Which players fielded the ball
  • Rough approximation of where the ball was fielded (hc_x, hc_y)
The first three items on this list are called the stringer information. The hc_x and hx_y coordinates are very rough estimates, and are not especially reliable (although they are better than nothing when you’re left with no other choice).

Last year, Jeff Zimmerman developed a method in which he found the average launch angle and exit velocity for batted balls fielded by each position. So, for example, a pop-up to the second basemen versus a fly ball to the right fielder, etc. In this way he used all three aspects of the stringer information to estimate the batted ball quality.

I was working on a system of grouping balls using the hc_x and hc_y coordinates along with batted ball type. Before I had a chance to finish this project, MLB announced it would be filling in the missing data on its own. Since then, MLB has retroactively filled in data for 2015 and 2016 and provided data for the 2017 season. Once MLB implemented its solution, accompanied by Tom Tango’s article, I shifted my focus to other aspects of the game. The missing data problem went out of sight, out of mind. But I believe the missing data is still an issue that needs to be addressed.
Unfortunately, batted ball profiles shift over time, and the No Nulls solution appears to be constant. Such issues may introduce bias in these estimates.

When people are using statcast data for visualization or analysis, it may be a good idea to evaluate whether the data are reliable; it appears that averages should be ok, but individual datapoints may be unreliable and thrown out.

When you are looking at major league average launch angle and exit velocity, you should use the No Nulls solution put forward by MLB. You should understand that these numbers are estimates, and even as estimates they appear to have a minor flaw. The league-average launch angle probably will not be off by much, but exit velocity could be off by as much as 1 mph.

The league averages aren’t a big concern, though. Nor are the player averages. Rather, you must be careful when you examine the league-wide results when bucketing balls based upon their launch angle, exit velocity, or batted ball type, particularly balls hit between -30 and -20 degrees or above 60 degrees. Bucketing these batted balls will subject you to the double whammy of both being artificially inflated in frequency and exit velocity due to the No Nulls solution.
Thoughts and discussions?

EDIT: Mods, if there is a thread for this, feel free to merge.
 

charlieoscar

Member
Sep 28, 2014
1,339
Now all you need to do is convince the masses that what they take as gospel isn't necessarily so.

As an aside to this, when PITCHf/x was first introduced, they had someone sitting in front of a screen showing the batter at the plate and he would mark what he thought was the bottom of the strike zone on the screen. This was fed into a database so future in at bats the bottom line could be added automatically. This did not allow for operator error or for batters who changed their stance depending on the pitcher.