Bias of an estimator
In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, "bias" is an objective property of an estimator, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term "bias".
Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes medianunbiased from the usual meanunbiasedness property. Bias is related to consistency in that consistent estimators are convergent and asymptotically unbiased (hence converge to the correct value as the number of data points grows arbitrarily large), though individual estimators in a consistent sequence may be biased (so long as the bias converges to zero); see bias versus consistency.
All else being equal, an unbiased estimator is preferable to a biased estimator, but in practice all else is not equal, and biased estimators are frequently used, generally with small bias. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population or is difficult to compute (as in unbiased estimation of standard deviation); because an estimator is medianunbiased but not meanunbiased (or the reverse); because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful. Further, meanunbiasedness is not preserved under nonlinear transformations, though medianunbiasedness is (see § Effect of transformations); for example, the sample variance is an unbiased estimator for the population variance, but its square root, the sample standard deviation, is a biased estimator for the population standard deviation. These are all illustrated below.
Contents
Definition[edit]
Suppose we have a statistical model, parameterized by a real number θ, giving rise to a probability distribution for observed data, , and a statistic which serves as an estimator of θ based on any observed data . That is, we assume that our data follow some unknown distribution (where θ is a fixed constant that is part of this distribution, but is unknown), and then we construct some estimator that maps observed data to values that we hope are close to θ. The bias of relative to is defined as
where denotes expected value over the distribution , i.e. averaging over all possible observations . The second equation follows since θ is measurable with respect to the conditional distribution .
An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ.
In a simulation experiment concerning the properties of an estimator, the bias of the estimator may be assessed using the mean signed difference.
Examples[edit]
Sample variance[edit]
The sample variance of a random variable demonstrates two aspects of estimator bias: firstly, the naive estimator is biased, which can be corrected by a scale factor; second, the unbiased estimator is not optimal in terms of mean squared error (MSE), which can be minimized by using a different scale factor, resulting in a biased estimator with lower MSE than the unbiased estimator. Concretely, the naive estimator sums the squared deviations and divides by n, which is biased. Dividing instead by n − 1 yields an unbiased estimator. Conversely, MSE can be minimized by dividing by a different number (depending on distribution), but this results in a biased estimator. This number is always larger than n − 1, so this is known as a shrinkage estimator, as it "shrinks" the unbiased estimator towards zero; for the normal distribution the optimal value is n + 1.
Suppose X_{1}, ..., X_{n} are independent and identically distributed (i.i.d.) random variables with expectation μ and variance σ^{2}. If the sample mean and uncorrected sample variance are defined as
then S^{2} is a biased estimator of σ^{2}, because
To continue, we note that by subtracting from both sides of , we get
Meaning, (by crossmultiplication) . Then, the previous becomes:
In other words, the expected value of the uncorrected sample variance does not equal the population variance σ^{2}, unless multiplied by a normalization factor. The sample mean, on the other hand, is an unbiased^{[1]} estimator of the population mean μ.
Note that the usual definition of sample variance is , and this is an unbiased estimator of the population variance.
This can be seen by noting the following formula, which follows from the Bienaymé formula, for the term in the inequality for the expectation of the uncorrected sample variance above:
Algebraically speaking, is unbiased because:
To continue, like above, we note that when is subtracted from both sides of , we get
and crossmultiplying yields: . Then we have:
Thus , and therefore is an unbiased estimator of the population variance, σ^{2}. The ratio between the biased (uncorrected) and unbiased estimates of the variance is known as Bessel's correction.
The reason that an uncorrected sample variance, S^{2}, is biased stems from the fact that the sample mean is an ordinary least squares (OLS) estimator for μ: is the number that makes the sum as small as possible. That is, when any other number is plugged into this sum, the sum can only increase. In particular, the choice gives,
and then
The above discussion can be understood in geometric terms: the vector can be decomposed into the "mean part" and "variance part" by projecting to the direction of and to that direction's orthogonal complement hyperplane. One gets for the part along and for the complementary part. Since this is an orthogonal decomposition, Pythagorean theorem says , and taking expectations we get , as above (but times ). If the distribution of is rotationally symmetric, as in the case when are sampled from a Gaussian, then on average, the dimension along contributes to equally as the directions perpendicular to , so that and . This is in fact true in general, as explained above.
Estimating a Poisson probability[edit]
A far more extreme case of a biased estimator being better than any unbiased estimator arises from the Poisson distribution.^{[2]}^{[3]} Suppose that X has a Poisson distribution with expectation λ. Suppose it is desired to estimate
with a sample of size 1. (For example, when incoming calls at a telephone switchboard are modeled as a Poisson process, and λ is the average number of calls per minute, then e^{−2λ} is the probability that no calls arrive in the next two minutes.)
Since the expectation of an unbiased estimator δ(X) is equal to the estimand, i.e.
the only function of the data constituting an unbiased estimator is
To see this, note that when decomposing e^{−λ} from the above expression for expectation, the sum that is left is a Taylor series expansion of e^{−λ} as well, yielding e^{−λ}e^{−λ} = e^{−2λ} (see Characterizations of the exponential function).
If the observed value of X is 100, then the estimate is 1, although the true value of the quantity being estimated is very likely to be near 0, which is the opposite extreme. And, if X is observed to be 101, then the estimate is even more absurd: It is −1, although the quantity being estimated must be positive.
The (biased) maximum likelihood estimator
is far better than this unbiased estimator. Not only is its value always positive but it is also more accurate in the sense that its mean squared error
is smaller; compare the unbiased estimator's MSE of
The MSEs are functions of the true value λ. The bias of the maximumlikelihood estimator is:
Maximum of a discrete uniform distribution[edit]
The bias of maximumlikelihood estimators can be substantial. Consider a case where n tickets numbered from 1 through to n are placed in a box and one is selected at random, giving a value X. If n is unknown, then the maximumlikelihood estimator of n is X, even though the expectation of X given n is only (n + 1)/2; we can be certain only that n is at least X and is probably more. In this case, the natural unbiased estimator is 2X − 1.
Medianunbiased estimators[edit]
The theory of medianunbiased estimators was revived by George W. Brown in 1947:^{[4]}
An estimate of a onedimensional parameter θ will be said to be medianunbiased, if, for fixed θ, the median of the distribution of the estimate is at the value θ; i.e., the estimate underestimates just as often as it overestimates. This requirement seems for most purposes to accomplish as much as the meanunbiased requirement and has the additional property that it is invariant under onetoone transformation.
Further properties of medianunbiased estimators have been noted by Lehmann, Birnbaum, van der Vaart and Pfanzagl.^{[citation needed]} In particular, medianunbiased estimators exist in cases where meanunbiased and maximumlikelihood estimators do not exist. They are invariant under onetoone transformations.
There are methods of construction medianunbiased estimators for probability distributions that have monotone likelihoodfunctions, such as oneparameter exponential families, to ensure that they are optimal (in a sense analogous to minimumvariance property considered for meanunbiased estimators).^{[5]}^{[6]} One such procedure is an analogue of the Rao—Blackwell procedure for meanunbiased estimators: The procedure holds for a smaller class of probability distributions than does the Rao—Blackwell procedure for meanunbiased estimation but for a larger class of lossfunctions.^{[7]}
Bias with respect to other loss functions[edit]
Any minimumvariance meanunbiased estimator minimizes the risk (expected loss) with respect to the squarederror loss function (among meanunbiased estimators), as observed by Gauss.^{[8]} A minimumaverage absolute deviation medianunbiased estimator minimizes the risk with respect to the absolute loss function (among medianunbiased estimators), as observed by Laplace.^{[9]}^{[10]} Other loss functions are used in statistics, particularly in robust statistics.^{[11]}^{[12]}
Effect of transformations[edit]
As stated above, for univariate parameters, medianunbiased estimators remain medianunbiased under transformations that preserve order (or reverse order).
Note that, when a transformation is applied to a meanunbiased estimator, the result need not be a meanunbiased estimator of its corresponding population statistic. By Jensen's inequality, a convex function as transformation will introduce positive bias, while a concave function will introduce negative bias, and a function of mixed convexity may introduce bias in either direction, depending on the specific function and distribution. That is, for a nonlinear function f and a meanunbiased estimator U of a parameter p, the composite estimator f(U) need not be a meanunbiased estimator of f(p). For example, the square root of the unbiased estimator of the population variance is not a meanunbiased estimator of the population standard deviation: the square root of the unbiased sample variance, the corrected sample standard deviation, is biased. The bias depends both on the sampling distribution of the estimator and on the transform, and can be quite involved to calculate – see unbiased estimation of standard deviation for a discussion in this case.
Bias, variance and mean squared error[edit]
While bias quantifies the average difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample.
One measure which is used to try to reflect both types of difference is the mean square error,
This can be shown to be equal to the square of the bias, plus the variance:
When the parameter is a vector, an analogous decomposition applies:^{[13]}
where
is the trace of the covariance matrix of the estimator.
An estimator that minimises the bias will not necessarily minimise the mean square error.
Example: Estimation of population variance[edit]
For example,^{[14]} suppose an estimator of the form
is sought for the population variance as above, but this time to minimise the MSE:
If the variables X_{1} ... X_{n} follow a normal distribution, then nS^{2}/σ^{2} has a chisquared distribution with n − 1 degrees of freedom, giving:
and so
With a little algebra it can be confirmed that it is c = 1/(n + 1) which minimises this combined loss function, rather than c = 1/(n − 1) which minimises just the bias term.
More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values.
However it is very common that there may be perceived to be a bias–variance tradeoff, such that a small increase in bias can be traded for a larger decrease in variance, resulting in a more desirable estimator overall.
Bayesian view[edit]
Most bayesians are rather unconcerned about unbiasedness (at least in the formal samplingtheory sense above) of their estimates. For example, Gelman et al (1995) write: "From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwise it is potentially misleading."^{[15]}
Fundamentally, the difference between the Bayesian approach and the samplingtheory approach above is that in the samplingtheory approach the parameter is taken as fixed, and then probability distributions of a statistic are considered, based on the predicted sampling distribution of the data. For a Bayesian, however, it is the data which are known, and fixed, and it is the unknown parameter for which an attempt is made to construct a probability distribution, using Bayes' theorem:
Here the second term, the likelihood of the data given the unknown parameter value θ, depends just on the data obtained and the modelling of the data generation process. However a Bayesian calculation also includes the first term, the prior probability for θ, which takes account of everything the analyst may know or suspect about θ before the data comes in. This information plays no part in the samplingtheory approach; indeed any attempt to include it would be considered "bias" away from what was pointed to purely by the data. To the extent that Bayesian calculations include prior information, it is therefore essentially inevitable that their results will not be "unbiased" in sampling theory terms.
But the results of a Bayesian approach can differ from the sampling theory approach even if the Bayesian tries to adopt an "uninformative" prior.
For example, consider again the estimation of an unknown population variance σ^{2} of a Normal distribution with unknown mean, where it is desired to optimise c in the expected loss function
A standard choice of uninformative prior for this problem is the Jeffreys prior, , which is equivalent to adopting a rescalinginvariant flat prior for ln(σ^{2}).
One consequence of adopting this prior is that S^{2}/σ^{2} remains a pivotal quantity, i.e. the probability distribution of S^{2}/σ^{2} depends only on S^{2}/σ^{2}, independent of the value of S^{2} or σ^{2}:
However, while
in contrast
— when the expectation is taken over the probability distribution of σ^{2} given S^{2}, as it is in the Bayesian case, rather than S^{2} given σ^{2}, one can no longer take σ^{4} as a constant and factor it out. The consequence of this is that, compared to the samplingtheory calculation, the Bayesian calculation puts more weight on larger values of σ^{2}, properly taking into account (as the samplingtheory calculation cannot) that under this squaredloss function the consequence of underestimating large values of σ^{2} is more costly in squaredloss terms than that of overestimating small values of σ^{2}.
The workedout Bayesian calculation gives a scaled inverse chisquared distribution with n − 1 degrees of freedom for the posterior probability distribution of σ^{2}. The expected loss is minimised when cnS^{2} = <σ^{2}>; this occurs when c = 1/(n − 3).
Even with an uninformative prior, therefore, a Bayesian calculation may not give the same expectedloss minimising result as the corresponding samplingtheory calculation.
See also[edit]
 Omittedvariable bias
 Consistent estimator
 Estimation theory
 Expected loss
 Expected value
 Loss function
 Median
 Statistical decision theory
 Optimism bias
Notes[edit]
 ^ Richard Arnold Johnson; Dean W. Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall. ISBN 9780131877153. Retrieved 10 August 2012.
 ^ J. P. Romano and A. F. Siegel (1986) Counterexamples in Probability and Statistics, Wadsworth & Brooks / Cole, Monterey, California, USA, p. 168
 ^ Hardy, M. (1 March 2003). "An Illuminating Counterexample". American Mathematical Monthly. 110 (3): 234–238. arXiv:math/0206006. doi:10.2307/3647938. ISSN 00029890. JSTOR 3647938.
 ^ Brown (1947), page 583
 ^ Pfanzagl, Johann. "On optimal median unbiased estimators in the presence of nuisance parameters." The Annals of Statistics (1979): 187193.
 ^ Brown, L. D.; Cohen, Arthur; Strawderman, W. E. A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications. Ann. Statist. 4 (1976), no. 4, 712722. doi:10.1214/aos/1176343543. http://projecteuclid.org/euclid.aos/1176343543.
 ^ Page 713: Brown, L. D.; Cohen, Arthur; Strawderman, W. E. A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications. Ann. Statist. 4 (1976), no. 4, 712722. doi:10.1214/aos/1176343543. http://projecteuclid.org/euclid.aos/1176343543.
 ^ Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: NorthHolland Publishing Co.
 ^ Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: NorthHolland Publishing Co.
 ^ Jaynes, E.T. (2007). Probability theory : the logic of science (5. print. ed.). Cambridge [u.a.]: Cambridge Univ. Press. p. 172. ISBN 9780521592710.
 ^ Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: NorthHolland Publishing Co.
 ^ Chapter 3: Robust and NonRobust Models in Statistics by Lev B. Klebanov, Svetlozar T. Rachev and Frank J. Fabozzi, Nova Scientific Publishers, Inc. New York, 2009.
 ^ Taboga, Marco (2010). "Lectures on probability theory and mathematical statistics".

^ Morris H. DeGroot (1986), Probability and Statistics (2nd edition), AddisonWesley.
ISBN 020111366X. Pp. 414–5.
But compare it with, for example, the discussion in Casella and Berger (2001), Statistical Inference (2nd edition), Duxbury. ISBN 0534243126. P. 332.  ^ A. Gelman et al (1995), Bayesian Data Analysis, Chapman and Hall. ISBN 0412039915. p. 108.
References[edit]
 Brown, George W. "On SmallSample Estimation." The Annals of Mathematical Statistics, vol. 18, no. 4 (Dec., 1947), pp. 582–585. JSTOR 2236236.
 Lehmann, E. L. "A General Concept of Unbiasedness" The Annals of Mathematical Statistics, vol. 22, no. 4 (Dec., 1951), pp. 587–592. JSTOR 2236928.
 Allan Birnbaum, 1961. "A Unified Theory of Estimation, I", The Annals of Mathematical Statistics, vol. 32, no. 1 (Mar., 1961), pp. 112–135.
 Van der Vaart, H. R., 1961. "Some Extensions of the Idea of Bias" The Annals of Mathematical Statistics, vol. 32, no. 2 (June 1961), pp. 436–447.
 Pfanzagl, Johann. 1994. Parametric Statistical Theory. Walter de Gruyter.
 Stuart, Alan; Ord, Keith; Arnold, Steven [F.] (2010). Classical Inference and the Linear Model. Kendall's Advanced Theory of Statistics. 2A. Wiley. ISBN 0470689242..
 Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1993). Unbiased estimators and their applications. 1: Univariate case. Dordrect: Kluwer Academic Publishers. ISBN 0792323823.
 Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1996). Unbiased estimators and their applications. 2: Multivariate case. Dordrect: Kluwer Academic Publishers. ISBN 0792339398.
 Klebanov, Lev [B.]; Rachev, Svetlozar [T.]; Fabozzi, Frank [J.] (2009). Robust and NonRobust Models in Statistics. New York: Nova Scientific Publishers. ISBN 9781607417682.
External links[edit]
 Hazewinkel, Michiel, ed. (2001) [1994], "Unbiased estimator", Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 9781556080104 ^{[clarification needed]}