Assumption of a Random error term in a regression

Here's the general idea - someone who has a better background than I do in statistics could probably give a better explanation. So you have this linear regression model: $$Y = \alpha + \beta X + \epsilon $$ where $\epsilon$ follows a normal distribution with mean $0$.

What exactly does random mean? My back ground in statistics is very low level, but I understand that a random variable is defined as a mapping from a sample space to the real numbers. This definition makes sense, but the assumption of a zero mean is what I get tripped up on. How can we assume this fact?

Personally, I've always taken the idea that $\epsilon$ follows a normal distribution with mean $0$ as an axiom of sorts for the linear regression model. My understanding is that it's just something nice we would like the linear regression model to have and lends itself well to certain properties. Remember:

Essentially, all models are wrong, but some are useful.

which is attributed to George E.P. Box.

Why would we want such an axiom? Well... on average, it would be nice to have zero error.

In my honest opinion (this is based off the little measure-theoretic probability I have studied), it would be best to approach this idea of "randomness" intuitively, as you would in an undergraduate probability course.

The idea about anything that is random is that you will never know the value of it. So, in an undergraduate probability class, what you do is you assign probabilities to the values your quality of interest can take by creating a probabilistic model. Your model, 99% of the time, won't be perfect, but that doesn't stop anyone from not trying.

The normal distribution with mean 0 is just an example of a probabilistic model that statisticians feel is a suitable model for the error term. It isn't perfect, but it's suitable for most purposes. I worked with a professor whose focus is on assuming a skew-normal error term, which complicates things, but is usually more realistic, since, in reality, not everything looks like a bell curve.

My two cents. Hopefully I've helped somewhat.


Basically, the errors represent everything that the model does not have into account. And why is that? Because it would be extremely unlikely for a model to perfectly predict a variable, as it is impossible to control every possible condition that may interfere with the response variable. The errors may also include reading or measuring inaccuracies. Considering the regression line of best fit, the errors are based on the distance from each point to that line.

The Central Limit Theorem is behind the assumption of the errors following a normal distribution. It states that the distribution of the sum of a large number of random variables will tend towards a normal distribution. And actually, in the real world, the majority of the observable errors appear to be distributed that way; which helps us to extrapolate to the unobservable errors.

Another assumption made is that each data point has its own independent associated error, i.e., the errors are independent from one another, which helps us assume they occur randomly.

And because the errors occur randomly, it is expected each data point has equal probability of appearing above or bellow the line of best fit created by the regression (positive error values for the data points with a higher value than the one predicted by the line, and negative error values for the data points with a smaller value predicted by the line), meaning if you summed up every error it would result in a value very close to zero.

Hope to have helped.


The assumption of mean 0 is a normalization that must be made because you already have a constant term in the regression. It relates to the issue of identification - that you as the researcher cannot tell the difference between the constant term in the regression and the mean of the error term.

Proof: Suppose that $\epsilon$ is not mean 0

Let $\bar{\epsilon}$ denote the mean of $\epsilon$. Then I can re-write your model as

$Y = (\alpha + \bar{\epsilon}) + \beta X + (\epsilon - \bar{\epsilon})$.

let $\tilde{\alpha} = \alpha + \bar{\epsilon} $ and $\tilde{\epsilon} = \alpha + \bar{\epsilon}$

-->$Y = \tilde{\alpha}+ \beta X + \tilde{\epsilon} $.

This model is identical to yours except it now has a mean-zero error term and the new constant combines the old constant and the mean of the original error term.