I don't like to speak for others, but I think Iayork and Crystalline are referring to the use of CIs in a way that many are not familiar.
Confidence intervals are used in many different ways. Simply put, confidence intervals define the range that represents our confidence that the value of a metric is true; We are confident that X is between Y and Z. For example, let's say we are interested in knowing what the career OBP is for any MLB player, we can measure career OBP for a large sample of MLB players and use the distribution of career OBPs to construct confidence intervals for the mean. 95% confidence intervals reflect the range of OBP where we are 95% confident that the mean OBP is between ___ and ___. Confidence intervals here reflect the mean for the population of MLB players. If, in a small sample size, a given player is outside this range, and if the metric is distributed normally across the population, it is more likely the player will have future sample sizes within this range/closer to the mean. This is known as regression to the mean.
You can also construct confidence intervals for an individual player. For example, if we want to know whether an individual player is likely to have a given OBP in a season, we can use the distribution of season OBPs to construct confidence intervals for the given player. Here, 95% confidence intervals reflects the range of OBP where we are 95% confident that any OBP over a full season for the given player will fall in that range. In other words, confidence intervals here reflect the individual MLB player. Here, it is important to know whether the sample size used to calculate OBP is reliable, unreliable sample sizes reduce our ability to interpret these confidence intervals. To answer such a question, one can (and has) performed reliability analyses to determine the minimum number of plate appearances which can be used to calculate the metric.
What Iayork and Crystalline are referring to is constructing confidence intervals for the metric itself. WAR is not a measured metric like OBP. With OBP, you count the number of plate appearances and number of non-outs and take the ratio. With WAR, you are using a regression model (presumably, or some kind of modelling procedure if not regression) to infer the relationship between events on the field and runs, and runs and wins. All models contain some degree of error, and you can use this error to construct confidence intervals for the statistic itself. Here, 95% confidence intervals means that for a given WAR we are 95% confident that the true number of wins is between ___ and ___. In other words, confidence intervals here reflect the likelihood that the metric means what we think it means. If this range is really big (e.g. a full win), then our ability to interpret what the metric means is limited. As an aside, this sounds like Eric Van's biggest problem; he assumed that his regression models were perfect, instead of accounting for the error present in his models.
There are different methods for constructing confidence intervals, depending on whether you want to use them to measure a population (e.g. using standard deviations on normally distributed data), for an individual (e.g. using resampling approaches), or for a metric derived from a model (e.g. via regression). These methods are not always to specific to a type of question (e.g. resampling can be used to assess an individual or a population).
Savin Hillbilly said:
I just went to the
Wikipedia page for "Confidence interval." Then I went to the
Talk page for "Confidence interval." All I can say after that little adventure is that if the bolded is true, there sure are a lot of stupid people in the world, apparently including some statisticians.
Can you recommend an internet source that explains this concept for non-statisticians in language that is both reasonably comprehensible and technically accurate? 'Cause it sure ain't Wikipedia. I feel like I have a rough intuitive understanding of it, but that feeling could be wildly wrong, and if we're going to start using CI's around here routinely I'd like to know what the hell they mean.
I've given an introductory workshop class on working with distributions. However, it is really big. If you want me to send it to you just PM me.
EDIT: It is important to separate the two primary issues of whether WAR/UZR are reliable, and whether they are valid. These questions are not mutually dependent upon one another. You can have a reliable metric that is invalid and vice versa.