You often hear a lot about Spring Training and early season performances, and people often wonder aloud if these performances are “fact or fiction.” Baseball has a (sometimes excruciatingly) long season, and pundits some talking points during April when the races don’t yet matter. As a result, different players get discussed as whether or not they can maintain their level of performance over the course of the season based on two or three weeks worth of good or bad play.
The truth of the matter is that the answer to the “fact or fiction” question is always fiction.
This question gets to the heart of an important concept in sabermetrics that is prevalent in anything and everything baseball. The concept in question is called regression to the mean. Regression to the mean is actually a statistical phenomenon, as described here in short by Wikipedia.
In statistics, regression toward the mean refers to the phenomenon that a variable that is extreme on its first measurement will tend to be closer to the center of the distribution on a later measurement. To avoid making wrong inferences, the possibility of regression toward the mean must be considered when designing experiments and interpreting experimental, survey, and other empirical data in the physical, life, behavioral and social sciences.
Essentially, every time you observe a result, the most likely outcome is that the next observation of that result will be closer to the mean (or average) of the population distribution. This means that as a rule, everything and everybody regresses.
Believe it or not, all baseball fans regress players to the mean in their minds; you are using regression to the mean even if you don’t know it. Let’s illustrate this concept with an example.
Here are four players and their batting averages so far during the 2010 season:
Player A: .325
Player B: .324
Player C: .324
Player D: .323
These BA were all picked up in about 80-90 PA. Now, just knowing this information, what do you think these players will hit in their next 90 PA?
That’s a pretty daunting task to try and predict. However, let me help you by saying that the league average BA for the last five years, the National League average (these are all NL players) has been around .260. The AL average is around .267. Call the major league average, without pitchers, around .263. Now what do you think?
Well, you do have some information about the hitters, but you have much more information about the population distribution of those hitters. We don’t have any way of differentiating between them, so we assume they are all in the same population (major league hitters, for example). Based on this, we can use regression and say that we’d expect these hitters to hit around .265 in their next 90 PA (there is a rigorous way to do the calculations, but I’m just guessing as an illustration).
What if I gave you more information? Here are those player’s career BA, alongside their career PA.
Player A: .267, 2379 PA
Player B: .275, 1566 PA
Player C: .323, 113 PA
Player D: .260, 603 PA
You can see that each of those players have slightly varying batting averages and widely varying playing times. That playing time gives us more of an idea of their true talent than what we knew from their first 90 PA. We know the most about Player A, which means we would regress his numbers the least. We know the least about Player C, which means the mean would encompass most of our guess about his next 90 PA. The other two guys are in between. Armed with this much information, we can now make better guesses. Player C is probably still around .264-.265, Player A closer to .266, for example.
Now, what if we included all of that player’s history, including other descriptive statistics that would help us determine batting average, including stuff like Batting Average on Balls in Pay (BABIP), batted ball rates, and home run rates? What if we also weighted recent years more heavily to reflect age and changes in true talent that career numbers couldn’t take into account? Well, then we would have a projection system. The ZiPS projection system does just that with its in-season projections available on FanGraphs. It combines historical data about a player and regression to a mean tailored to that player’s skill to determine what we would expect from that player going forward. Here’s ZiPS batting average projections for those players:
Player A (Jayson Werth): .280
Player B (Ryan Doumit): .283
Player C (David Freese): .273
Player D (Colby Rasmus): .266
Taking all the historical data, properly weighting in, and regressing to the mean yielded these guesses. Note that this is still closer to the .263-ish league average than it is to their 90 PA batting averages from this season so far.
Does this mean that players cannot improve? Of course not. Does this mean it is absolutely true that a player will always perform closer to the mean the next time out? Definitely not. Baseball is a game of chance and skill, and that chance (or luck or random variation, pick your term) will always push some players up and others down. Furthermore, remember that regression to the mean implies that you know what mean to regress to. Players can improve and change the mean to which they must be regressed. I don’t think anyone here thinks we should regress Albert Pujols’ power numbers to the same mean as David Eckstein’s. It just means that our best guess for the future is still going to be closer to that mean than to anywhere else. Keep that in mind the next time you are waiting for a player with questionable peripherals to “finally break out.” Regression to the mean is a powerful entity in the world of baseball.