A PitchFX Primer for SoSH
Over the last few weeks, I've started doing some analysis with PitchFX data. After a few posts, I was contacted by a few people (namely URI) and asked to write a guide to put on the mainboard that could "explain it all to a 5 year old". I'm not sure I can manage that, but I'll try. I'm going to use the Wiki for now, since it is easier to edit and update; I'll transfer the stuff over to the mainboard if that's preferred.
About This Guide
This guide is not written for the physicist, the statistician, or the hardcore sabermetrician. This guide is written for everyday fans that like baseball and like statistics. Because it's posted on the Wiki, you are free to update, correct, change, or amend anything, but I would appreciate if you would follow a few rules:
No complex formulas or equations should be posted without real-life examples or equations that people can use in Excel. Remember that we are writing for an audience of baseball fans, not an audience of math majors.
Any complex stuff should have pictures, with labeled axes and a caption. If you have something important to say, saying it with a good figure is 10000 times better than saying it with text.
Not everything I (or anyone else) will add is correct. But don't change stuff that you're only "sure" about. If the guide loses its ability to provide quality information, then it stops being useful. If you have questions, use the talk page.
Baseball is fun. It should stay fun and remain fun. This guide is written for people who like baseball, not written for people who like reading dictionaries.
There are a number of other good PitchFX guides on the web. Some of them are far more detailed than this one, but geared at a different audience.
A list of PitchFX Guides:
- Alan Nathan: http://webusers.npl.illinois.edu/~a-nathan/pob/pitchtracker.html
- Mike Fast: http://fastballs.wordpress.com/2007/08/02/glossary-of-the-gameday-pitch-fields/
- Friar Watch: http://www.friarwatch.com/2007/04/23/how-to-use-mlb-gameday-data/
- Joe Sheehan: http://baseballanalysts.com/archives/2007/03/digging_through.php
Before this page is edited into Oblivion, some thanks and proper respect should be paid to (in totally random order): EricVan, SoxScout, Mike Fast, Josh Kalk, Alan Nathan, Alex (Sprowl), Frisbetarian, and anyone else on SoSH who got me interested in this stuff.
The Basics: Starting at the Data
How do we get from a pitcher's performance to the beautiful (read: mind-boggling) charts? I find it easiest to explain this by actually walking through an example, so we are going to be using Clay Buchholz's 4/11/08 start against the Yankees.
The raw data for this game is available here [] if you want to follow along at home with Excel.
Alternatively, you can use my PitchFX Tool [] to generate the graphs for yourself.
If you need a glossary to follow along with what the different parameters mean, I strongly encourage you to use Nathan's guide, mentioned above. Each parameter is clearly broken down into what it represents on the field.
Simple First Steps: Pitch Speed
Clay is a good guy to choose for our first analysis because everyone that follows the Red Sox knows that he's got devastating offspeed stuff. So, our first basic job can be to identify some offspeed pitches.
You can take your data and sort by the timestamp variable and the pitch speed variable. Just graph the results in a line & scatter plot:
This above plot is pretty simple. It just shows the speed (on the Y axis) for each ball that Clay pitched all night. It's pretty obvious to see that Clay uses his offspeed stuff quite a bit. Over the course of the game, as he tires, you might see a very very slight trend down in the velocity, indicating that maybe he's getting tired. But it really isn't much.
There are other, trickier things that we can do with speed, but we will leave those for later.
Location, Location, Location
Let's take everyone's most obvious question next: where did the pitches actually end up?
On its own, that plot isn't all that fancy. It just shows the location for each of those pitches. But what were they? What happened to them? What was the result? Luckily, PitchFX saves that information for us:
We can also get a more fine-grained location analysis:
So, basically: we can know where pitches are, what they were called, and what the batter did with them.
Another common graph for PitchFXers is "release point" - where did the ball start from?
For Buchholz, who doesn't really use different arm angles or anything too fancy, here it is:
This is just showing where the ball is coming from. It's pretty easy to see that it's probably not such a diagnostic feature. He doesn't seem to be tipping any pitches by dropping his arm or changing the angle, but we will come back to this after we run pitch classification on the data.
It's All About the Break
So, you're looking at the above graphs, and you're thinking to yourself, "Man, this shit is fantastic. I can..." and then you stop, because you have this sinking feeling in your heart: I have no idea what any of those pitches really were.
So, that's the big game in PitchFX. We think we have some good idea of what Buchholz throws, but we can't tell that just by looking at where the pitch ended up. I mean, some curves are high, some are low. Can't tell a curve by its final location. Some fastballs are low, and some are high (if you're Kyle Farnsworth, some are at people's heads). The moral of the story here is that basically, location is worth nothing in telling us what the guy actually threw.
If you stop and think about it for a minute, what really separates pitches is only a small subset of variables. This section (about Break) will talk about 2 of them: how far a ball breaks in the horizontal and vertical axes. This is meaningful because it is part of what pitchers try to do to throw different pitches.
First, we should start off with a reference guide. If you're a righty (and if you're Clay, that's true), this is a very rough estimate of how pitches break down by Horizontal and Vertical break:
This image shows the different pitch types, if the imaginary X axis here was "Horizontal Break" (from left to right) and the imaginary Y axis here was "Vertical Break" (top to bottom). It is Idealized Unattainable Data. You will never see this. Unless you are lucky.
Here is Clay's data from that start we've been talking about:
Unless you are a Vulcan mind reader or tell fortunes in your spare time, you can't really divine too much from that graph. But some things are easy to see. For example, his curveball is easy to pick out. Compare the chart for Clay and the chart for Idealized Unattainable Data, and you can see that there's a cluster of pitches in the bottom right that break down and away from right-handed hitters, and there's a circle on the idealized chart that says that ought to be his curve. The rest is kind of blech, but at least we've identified one pitch. Let's move on to try to figure out what the rest of that mess is.
As an aside, this is where most people (I think) start to get frustrated. The graph no longer represents the plate, but rather some variables pulled out of each pitch. There's no speed represented on the graph. There's no obvious delineation between the pitch types. It's hard to visualize this stuff.
But consider what pitches do. In order to be a successful pitcher, you have to move the ball around in the zone. Throwing darts only pays the bills if you're Mike Timlin, and that's only because Francona will never fucking quit you. This way of visualizing the data looks just at the movement of the pitches.
Last, because everyone wants to know: the vertical break in the positive direction does not mean the ball is going up. The baseball never rises. This is the first day of physics. You can't throw a baseball in such a way that it goes up. If you can, you should head straight to my lab and we will be rich together. The positive vertical break represents pitches that fall less than they would due to gravity because of their spin. So, they aren't going up, just not going down quite as fast. This, coupled with poor speed perception, probably accounts for the feeling that the ball is rising.
Here is a slightly different way of visualizing the data: by plotting each of the types of break against speed.
This just shows horizontal break depending on how fast the ball was going. You can see the fastball is clearly separate but the other two pitches are kind of a mess, and we have no idea which pitches might be the few sliders that Varitek remembered to call.
This graph does a *much* better job for Buchholz - because most of his pitches are differentiated in how much they break in the vertical axis. From this, we can identify 3 pitches easily. But for different pitchers, it is difficult to know whether or not to sort by vertical or horizontal break. Often it works best to derive both, as the PitchFX tool does, and then use your eyes to evaluate which produces the clearest clusters.
Spin is fucking brilliant. Spin direction attempts to calculate the direction that the pitcher spun the ball in order to get it to move the way it did. If we know what kind of spin the guy put on the ball and how hard he threw it, we can get a good idea of what pitch he was actually throwing. Pitches should cluster very nicely if we look at the data this way.
Spin is actually a complicated formula. Shamelessly stolen from Mike Fast's webpage, here it is:
Spin in Degrees is:
then add 270 degrees if ax<0 or 90 degrees if ax>0.
This is easy to do in Excel. The formula is:
[code] = ATAN(([az]+32.174)/[ax]) * 180/3.14 [/code]
and then do the correction above.
Let's skip the math and just move on to the results for a few minutes.
Wow! Compare that to the slop we got when we looked at Horizontal and Vertical break, and no guessing as to which might be more important! Three nice, easily identifiable clusters - maybe a few outliers here or there. The top right pitch is the fastball - it's the fastest of the three clusters. The one right below it is likely his changeup - very similar action on the ball thrown with much lower velocity. And the bottom left one, far away from his fastball again, is that nice, hard, biting curve, with a very different spin on it than his fastball is thrown with.
Pitch Classification for the Rest of Us
Pitch Classification is a nasty problem. It can be approached in a bunch of different ways, none of which are wholly adequate. Part of the problem is that pitchers often do not throw what they expected to throw: there is variability in the way they throw individual pitches, despite trying to throw with the same mechanics each time. A second problem is that all of the "obvious" ways to look at what pitchers throw don't really reveal much - we saw that above when we tried to look at location or speed.
This section will examine three approaches to pitch classification: "eyeballing it" based on estimates about the dimensions you're investigating, cluster analysis based on knowledge of the pitcher's repertoire, and neural network approaches.
(author note: this section is likely to be expanded over the next few weeks)
Based on the picture above, what does Clay throw?
Let's see. based on the idealized break graph above, Clay throws a 4 Seamer, a Cutter, A Slider or one really badly thrown ball, and a curve. Wait, what? Clay Buchholz throws a cutter? And what happened to his changeup?
Then we take the same data and plot it in terms of spin:
Hm. Now he's got a pitch that's right underneath his fastball in terms of spin, and it's thrown about the right speed for a changeup. But what happened to his cutter?
The bottom line is that this method is only partially effective. We know that Clay Buchholz throws a changeup - so we find a cluster of dots and we make our best guess. But we're not really sure which clusters really go together. Often, we judge by which cluster matches the anecdotal information concerning what a pitcher throws. If the anecdotes are wrong, or intentionally misleading (for example, it may be to a pitcher's advantage to be credited with six, rather than three, pitches), then the observer is left chasing a ghost.
How can we tell which clusters are really distinct? One way is to do real statistics.
"Real Statistics" are kind of a misnomer for what's really exploratory analysis. For the most part, the best way to tell which clusters are really distinct clusters is to do a... Cluster analysis (clever name, right?).
You'll need a real statistics program to do the rest of this stuff. I recommend SPSS, which, if you're a University student, is likely available at all your computer labs in a folder you've probably never opened. If you need access to a statistical package, you can get a trial version of JMP (another stats package) online. Cluster analysis is usually found in a "cluster and classify" section of your statistics software.
Neural Network Approaches
Part of the problem faced by a piece of software like Gameday is that it needs to classify pitches on a moment-by-moment basis and also needs to do it for every pitch for every pitcher on every team in every game.
A neural network is a classification system that finds some function to categorize a number of inputs.