An Explanation of the Kalman Filter

Let's start from what a Kalman filter is: It's a method of predicting the future state of a system based on the previous ones.

To understand what it does, take a look at the following data - if you were given the data in blue, it may be reasonable to predict that the green dot should follow, by simply extrapolating the linear trend from the few previous samples. However, how confident would you be predicting the dark red point on the right using that method? how confident would you about predicting the green point, if you were given the red series instead of the blue?

$\quad\quad\quad$enter image description here

From this simple example, we can learn three important principles:

  1. It's not good enough to give a prediction - you also want to know the confidence level.
  2. Predicting far ahead into the future is less reliable than nearer predictions.
  3. The reliability of your data (the noise), influences the reliability of your predictions.

$$\color{red} {\longleftarrow \backsim *\ *\ *\sim\longrightarrow}$$

Now, let's try and use the above to model our prediction.

The first thing we need is a state. The state is a description of all the parameters we will need to describe the current system and perform the prediction. For the example above, we'll use two numbers: The current vertical position ($y$), and our best estimate of the current slope (let's call it $m$). Thus, the state is in general a vector, commonly denoted $\bf{x}$, and you of course can include many more parameters to it if you wish to model more complex systems.

The next thing we need is a model: The model describes how we think the system behaves. In an ordinary Kalman filter, the model is always a linear function of the state. In our simple case, our model is: $$y(t) = y(t-1) + m(t-1)$$ $$m(t) = m(t-1)$$ Expressed as a matrix, this is: $${\bf{x}}_{t} = \left(\begin{array}{c} y(t)\\ m(t)\\ \end{array} \right) = \left(\begin{array}{c} 1 & 1\\0 & 1\\ \end{array} \right)\cdot \left(\begin{array}{c} y(t-1)\\ m(t-1)\\ \end{array} \right) \equiv F {\bf{x}}_{t-1}$$

Of course, our model isn't perfect (else we wouldn't need a Kalman Filter!), so we add an additional term to the state - the process noise, $\bf{v_t}$ which is assumed to be normally distributed. Although we don't know the actual value of the noise, we assume we can estimate how "large" the noise is, as we shall presently see. All this gives us the state equation, which is simply:

$${\bf{x}}_{t} = F {\bf{x}}_{t-1} + {\bf{v}}_{t-1}$$ The third part, and final part we are missing is the measurement. When we get new data, our parameters should change slightly to refine our current model, and the next predictions. What is important to understand is that one does not have to measure exactly the same parameters as the those in the state. For instance, a Kalman filter describing the motion of a car may want to predict the car's acceleration, velocity and position, but only measure say, the wheel angle and rotational velocity. In our example, we only "measure" the vertical position of the new points, not the slope. That is: $$\text{measurement} = \left(\begin{array}{c} 1 & 0 \end{array} \right) \cdot \left(\begin{array}{c} y(t) \\ m(t) \end{array} \right) $$ In the more general case, we may have more than one measurement, so the measurement is a vector, denoted by $\bf{z}$. Also, the measurements themselves are noisy, so the general measurement equation takes the form: $${\bf{z}}_t = H {\bf{x}}_t +{\bf{w}}_t$$ Where $\bf{w}$ is the measurement noise, and $H$ is in general a matrix with width of the number of state variables, and height of the number of measurement variables.

$$\color{orange} {\longleftarrow \backsim *\ *\ *\sim\longrightarrow}$$

Now that we have understood what goes into modeling the system, we can now start with the prediction stage, the heart of the Kalman Filter.

Let's start by assuming our model is perfect, with no noise. How will we predict what our state will be at time $t+1$? Simple! It's just our state equation: $${\bf{\hat x}}_{t+1} = F {\bf{x}}_{t}$$ What do we expect to measure? simply the measurement equation: $${\bf{\hat z}}_{t+1} = H {\bf{\hat x}}_{t+1}$$ Now, what do we actually measure? probably something a bit different: $$\bf{y} \equiv {\bf{z}}_{t+1} - {\bf{\hat z}}_{t+1} \neq 0 $$ The difference $\bf{y}$ (also called the innovation) represents how wrong our current estimation is - if everything was perfect, the difference would be zero! To incorporate this into our model, we add the innovation to our state equation, multiplied by a matrix factor that tells us how much the state should change based on this difference between the expected and actual measurements: $${\bf{\hat x}}_{t+1} = F {\bf{x}}_{t} + W \bf{y}$$ The matrix $W$ is known as the Kalman gain, and it's determination is where things get messy, but understanding why the prediction takes this form is the really important part. But before we get into the formula for $W$, we should give thought about what it should look like:

  • If the measurement noise is large, perhaps the error is only a artifact of the noise, and not "true" innovation. Thus, if the measurement noise is large, $W$ should be small.
  • If the process noise is large, i.e., we expect the state to change quickly, we should take the innovation more seriously, since it' plausible the state has actually changed.
  • Adding these two together, we expect: $$W \sim \frac{\text{Process Noise}}{\text{Measurement Noise}} $$

$$\color{green} {\longleftarrow \backsim *\ *\ *\sim\longrightarrow}$$

One way to evaluate uncertainty of a value is to look at its variance. The first variance we care about is the variance of our prediction of the state:

$$P_t = Cov({\bf{\hat x}}_t)$$

Just like with ${\bf{x}}_t$, we can derive $P_t$ from its previous state:

$$P_{t+1} = Cov({\bf{\hat x}}_{t+1}) = \\ Cov(F {\bf{x}}_t) = \\ F Cov({\bf{x}}_t) F^\top = \\ F P_t F^\top$$

However, this assumes our process model is perfect and there's nothing we couldn't predict. But normally there are many unknowns that might be influencing our state (maybe there's wind, friction etc.). We incorporate this as a covariance matrix $Q$ of process noise ${\bf{v}}_t$ and prediction variance becomes:

$$P_{t+1} = F P_t F^\top + Q$$

The last source of noise in our system is the measurement. Following the same logic we obtain a covariance matrix for ${\bf{\hat z}}_{t+1}$.

$$S_{t+1} = Cov({\bf{\hat z}}_{t+1}) \\ Cov(H {\bf{\hat x}}_{t+1}) = \\ H Cov({\bf{\hat x}}_{t+1}) H^\top = \\ H P_{t+1} H^\top$$

As before, remember that we said that our measurements ${\bf{z}}_t$ have normally distributed noise ${\bf w}_t$. Let's say that the covariance matrix which describes this noise is called $R$. Add it to measurement covariance matrix:

$$S_{t+1} = H P_{t+1} H^\top + R$$

Finally we can obtain $W$ by looking at how two normally distributed states are combined (predicted and measured):

$$W = P_{t+1} H^{\top} S_{t+1}^{-1}$$

To be continued...


I'll provide a simple derivation of the KF directly and concisely from a probabilistic perspective. This approach is the best way to first understand a KF. I personally find non-probabilistic derivations (e.g. best linear filter in terms of some arbitrary least squares cost function) as unintuitive and not meaningful. You can skip the first few headings that I personally would have found helpful when learning KFs if you're already familiar with the material and skip directly to the KF derivation or you can come back to them later.

Variance and Covariance

Variance is a measure of how spread out the density of a random variable is. It is defined as: \begin{equation} \text{Var}(x) = \mathbb{E}\left[{ ( x - \mathbb{E}\left[{x}\right] )^2 }\right]. \end{equation} The covariance is an unnormalized measure of how much two variables correlate with each other. The covariance of two random variables $x$ and $y$ is \begin{equation} \sigma(x,y) = \mathbb{E}\left[{ ( x - \mathbb{E}\left[{x}\right] )( y - \mathbb{E}\left[{y}\right] ) }\right]. \end{equation} The Equation above can be rewritten as \begin{align} \sigma(x,y) &= \mathbb{E}\left[{ ( x - \mathbb{E}\left[{x}\right] )( y - \mathbb{E}\left[{y}\right] ) }\right] \\ &= \mathbb{E}\left[{ xy - x \mathbb{E}\left[{y}\right] - y\mathbb{E}\left[{x}\right] + \mathbb{E}\left[{x}\right]\mathbb{E}\left[{y}\right] }\right] \\ &= \mathbb{E}\left[{xy}\right] - \mathbb{E}\left[{x}\right]\mathbb{E}\left[{y}\right] - \mathbb{E}\left[{y}\right]\mathbb{E}\left[{x}\right] + \mathbb{E}\left[{x}\right]\mathbb{E}\left[{y}\right] \\ &= \mathbb{E}\left[{xy}\right] - \mathbb{E}\left[{x}\right]\mathbb{E}\left[{y}\right]. \end{align} Note that $\text{Var}(x) = \sigma(x,x)$. Because $\mathbb{E}\left[{xy}\right] = \mathbb{E}\left[{x}\right]\mathbb{E}\left[{y}\right]$ for independent random variables, the covariance of independent random variables is zero. The covariance matrix for vector-valued random variables is defined as: \begin{align} \sigma(\boldsymbol{x},\boldsymbol{y}) &= \mathbb{E}\left[{ ( \boldsymbol{x} - \mathbb{E}\left[{\boldsymbol{x}}\right] ) {( \boldsymbol{y} - \mathbb{E}\left[{\boldsymbol{y}}\right] )}^T }\right] \\ &= \mathbb{E}\left[{\boldsymbol{x}{\boldsymbol{y}}^T}\right] - \mathbb{E}\left[{\boldsymbol{x}}\right]{\mathbb{E}\left[{\boldsymbol{y}}\right]}^T. \end{align}

Mean and Covariance of a Linear Combination of Random Variables

Let $\boldsymbol{y} = \boldsymbol{A}\boldsymbol{x}_1 + \boldsymbol{B}\boldsymbol{x}_2$ where $\boldsymbol{x}_1$ and $\boldsymbol{x}_2$ are multivariate random variables. The expected value of $\boldsymbol{y}$ is \begin{align} \mathbb{E}\left[{\boldsymbol{y}}\right] &= \mathbb{E}\left[{ \boldsymbol{A}\boldsymbol{x}_1 + \boldsymbol{B}\boldsymbol{x}_2 }\right] \\ &= \boldsymbol{A}\mathbb{E}\left[{\boldsymbol{x}_1}\right] + \boldsymbol{B}\mathbb{E}\left[{\boldsymbol{x}_2}\right]. \end{align} The covariance matrix for $\boldsymbol{y}$ is \begin{align} \sigma(\boldsymbol{y},\boldsymbol{y}) &= \mathbb{E}\left[{ ( \boldsymbol{y} - \mathbb{E}\left[{\boldsymbol{y}}\right] ) {( \boldsymbol{y} - \mathbb{E}\left[{\boldsymbol{y}}\right] )}^T }\right] \\ &= \mathbb{E}\left[{ ( \boldsymbol{A}(\boldsymbol{x}_1 -\mathbb{E}\left[{ \boldsymbol{x}_1}\right]) + \boldsymbol{B}(\boldsymbol{x}_2 -\mathbb{E}\left[{ \boldsymbol{x}_2}\right]) ) {( \boldsymbol{A}(\boldsymbol{x}_1 -\mathbb{E}\left[{ \boldsymbol{x}_1}\right]) + \boldsymbol{B}(\boldsymbol{x}_2 -\mathbb{E}\left[{ \boldsymbol{x}_2}\right]) )}^T }\right] \\ &= \boldsymbol{A} \sigma(\boldsymbol{x}_1,\boldsymbol{x}_1){\boldsymbol{A}}^T + \boldsymbol{B} \sigma(\boldsymbol{x}_2,\boldsymbol{x}_2){\boldsymbol{B}}^T +\boldsymbol{A} \sigma(\boldsymbol{x}_1,\boldsymbol{x}_2){\boldsymbol{B}}^T + \boldsymbol{B} \sigma(\boldsymbol{x}_2,\boldsymbol{x}_1){\boldsymbol{A}}^T. \end{align} If $\boldsymbol{x}_1$ and $\boldsymbol{x}_2$ are independent, then \begin{equation} \sigma(\boldsymbol{y},\boldsymbol{y}) = \boldsymbol{A} \sigma(\boldsymbol{x}_1,\boldsymbol{x}_1){\boldsymbol{A}}^T + \boldsymbol{B} \sigma(\boldsymbol{x}_2,\boldsymbol{x}_2){\boldsymbol{B}}^T. \end{equation}

Conditional Density of Multivariate Normal Random Variables

Let $\boldsymbol{x}$, $\boldsymbol{y}$ be jointly normal with means $\mathbb{E}\left[{\boldsymbol{x}}\right]=\boldsymbol{\mu}_{\boldsymbol{x}}$, $\mathbb{E}\left[{\boldsymbol{y}}\right] = \boldsymbol{\mu}_{\boldsymbol{y}}$ and covariance \begin{equation} \left[ \begin{array}{cc} \boldsymbol{\Sigma}_{xx} & \boldsymbol{\Sigma}_{xy} \\ \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_{yy} \end{array} \right]. \end{equation}

Let \begin{equation} \boldsymbol{z} = \boldsymbol{x} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y}. \end{equation}

$\boldsymbol{z}$ and $\boldsymbol{y}$ are independent because $\boldsymbol{z}$ and $\boldsymbol{y}$ are jointly normal and \begin{align} \sigma( \boldsymbol{z}, \boldsymbol{y} ) &= \sigma( \boldsymbol{x} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y}, \boldsymbol{y} ) \\ \sigma( \boldsymbol{z}, \boldsymbol{y} ) &= \mathbb{E}\left[{ \left(\boldsymbol{x} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y} \right) {\boldsymbol{y}}^T }\right] - \mathbb{E}\left[{\boldsymbol{x} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y} }\right] {\mathbb{E}\left[{\boldsymbol{y}}\right]}^T \\ \sigma( \boldsymbol{z}, \boldsymbol{y} ) &= \mathbb{E}\left[{ \boldsymbol{x}{\boldsymbol{y}}^T }\right] - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \mathbb{E}\left[{ \boldsymbol{y} {\boldsymbol{y}}^T}\right]- \mathbb{E}\left[{\boldsymbol{x}}\right]{\mathbb{E}\left[{\boldsymbol{y}}\right]}^T -\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \mathbb{E}\left[{\boldsymbol{y}}\right] {\mathbb{E}\left[{\boldsymbol{y}}\right]}^T \\ \sigma( \boldsymbol{z}, \boldsymbol{y} ) &= \boldsymbol{\Sigma}_{xy} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yy} = \boldsymbol{0}\\ \end{align}

Let $\boldsymbol{t}\sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{t}},\boldsymbol{T})$. Note that because $\boldsymbol{z}$ and $\boldsymbol{y}$ are independent, $\mathbb{E}\left[{\boldsymbol{z}|\boldsymbol{y}=\boldsymbol{t}}\right] = \mathbb{E}\left[{\boldsymbol{z}}\right] = \boldsymbol{\mu}_{\boldsymbol{x}} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1}\boldsymbol{\mu}_{\boldsymbol{y}}$. The conditional expectation for $\boldsymbol{x}$ given $\boldsymbol{y}$ is \begin{align} \mathbb{E}\left[{\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}}\right] &= \mathbb{E}\left[{\boldsymbol{z}+\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y}|\boldsymbol{y}=\boldsymbol{t}}\right] \\ \mathbb{E}\left[{\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}}\right] &= \mathbb{E}\left[{\boldsymbol{z}|\boldsymbol{y}=\boldsymbol{t}}\right]+\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \mathbb{E}\left[{\boldsymbol{y}|\boldsymbol{y}=\boldsymbol{t}}\right] \\ \mathbb{E}\left[{\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}}\right] &= \boldsymbol{\mu}_{\boldsymbol{x}} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1}\boldsymbol{\mu}_{\boldsymbol{y}}+\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\mu}_{\boldsymbol{t}} \\ \mathbb{E}\left[{\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}}\right] &= \boldsymbol{\mu}_{\boldsymbol{x}} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1}( \boldsymbol{\mu}_{\boldsymbol{t}} - \boldsymbol{\mu}_{\boldsymbol{y}} ) \\ \end{align}

The conditional covariance is \begin{align} \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \text{Var}(\boldsymbol{z}+\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y}|\boldsymbol{y}=\boldsymbol{t}) \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \text{Var}(\boldsymbol{z}|\boldsymbol{y}) + \text{Var}(\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y}|\boldsymbol{y}=\boldsymbol{t}) \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \text{Var}(\boldsymbol{z}) + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{T} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \text{Var}(\boldsymbol{x} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{y}) + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{T} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \boldsymbol{\Sigma}_{xx} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yy} {(\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1})}^T - \boldsymbol{\Sigma}_{xy} {(\boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}_{yy}^{-1})}^T - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{T} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \boldsymbol{\Sigma}_{xx} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1}\boldsymbol{\Sigma}_{yx} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{T} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \boldsymbol{\Sigma}_{xx} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{T} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \boldsymbol{\Sigma}_{xx} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \left(\boldsymbol{T} - \boldsymbol{\Sigma}_{yy} \right) \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} \\ \end{align}

In summary, \begin{align} \mathbb{E}\left[{\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}}\right] &= \boldsymbol{\mu}_{\boldsymbol{x}} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1}( \boldsymbol{\mu}_{\boldsymbol{t}} - \boldsymbol{\mu}_{\boldsymbol{y}} ) \\ \text{Var}(\boldsymbol{x}|\boldsymbol{y}=\boldsymbol{t}) &= \boldsymbol{\Sigma}_{xx} + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1}\left(\boldsymbol{T} - \boldsymbol{\Sigma}_{yy}\right) \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx} \\ \mathbb{E}\left[{\boldsymbol{y}|\boldsymbol{x}=\boldsymbol{s}}\right] &= \boldsymbol{\mu}_{\boldsymbol{y}} + \boldsymbol{\Sigma}_{yx} \boldsymbol{\Sigma}_{xx}^{-1}( \boldsymbol{\mu}_{\boldsymbol{s}} - \boldsymbol{\mu}_{\boldsymbol{x}} ) \\ \text{Var}(\boldsymbol{y}|\boldsymbol{x}=\boldsymbol{s}) &= \boldsymbol{\Sigma}_{yy} + \boldsymbol{\Sigma}_{yx} \boldsymbol{\Sigma}_{xx}^{-1}\left(\boldsymbol{S} - \boldsymbol{\Sigma}_{xx}\right) \boldsymbol{\Sigma}_{xx}^{-1} \boldsymbol{\Sigma}_{xy}\end{align}

System Description

Let $\boldsymbol{x}_t \sim \mathcal{N}(\boldsymbol{\hat{x}}_t,\boldsymbol{P}_t)$ be the state vector that needs to be estimated. The state transition equation is \begin{equation} \boldsymbol{x}_{t+1} = \boldsymbol{A}_t\boldsymbol{x}_t + \boldsymbol{B}_t\boldsymbol{u}_t + \boldsymbol{q}_t, \end{equation} where $\boldsymbol{A}_t$ is a state transition matrix, $\boldsymbol{u}_t \sim \mathcal{N}(\boldsymbol{\hat{u}}_t,\boldsymbol{U}_t)$ is an input vector, and $\boldsymbol{q}_t \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{Q}_t)$ is a noise. The measurement equation is \begin{equation} \boldsymbol{y}_{t} = \boldsymbol{C}_t\boldsymbol{x}_t + \boldsymbol{D}_t \boldsymbol{u}_t + \boldsymbol{r}_t, \end{equation} where $\boldsymbol{y}_{t}$ is the measurement, $\boldsymbol{C}_t$ is the output matrix, $\boldsymbol{D}_t$ is the feed-forward matrix, and $\boldsymbol{r}_t \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{R}_t)$. It is assumed that $\boldsymbol{q}_t$, $\boldsymbol{u}_t$, $\boldsymbol{x}_t$, and $\boldsymbol{r}_t$ are uncorrelated.

State Propagation Update

The mean and covariance of the a priori state estimate(The a priori state estimate is the state estimate before any measurements are taken into account.) follows directly from the mean and covariance of a linear combination of random variables, \begin{align} \boldsymbol{\hat{x}}_{t} &= \boldsymbol{A}_{t-1} \boldsymbol{\hat{x}}_{t-1} + \boldsymbol{B}_{t-1} \boldsymbol{\hat{u}}_{t-1} \\ \boldsymbol{P}_t &= \boldsymbol{A}_{t-1} \boldsymbol{P}_{t-1} { \boldsymbol{A}_{t-1} }^T + \boldsymbol{B}_{t-1} \boldsymbol{U}_{t-1} {\boldsymbol{B}_{t-1}}^T + \boldsymbol{Q}_{t-1} \end{align}

Measurement Propagation Update

Let the joint distribution between $\boldsymbol{x}_t$ and $\boldsymbol{y}_t$ be \begin{equation} \left[ \begin{array}{c} \boldsymbol{x}_t \\ \boldsymbol{y}_t \end{array} \right] \sim \mathcal{N} \left( \left[ \begin{array}{c} \boldsymbol{\hat{x}}_t \\ \boldsymbol{\hat{y}}_t \end{array} \right], \left[ \begin{array}{cc} \boldsymbol{\Sigma}_{xx,t} & \boldsymbol{\Sigma}_{xy,t} \\ \boldsymbol{\Sigma}_{yx,t} & \boldsymbol{\Sigma}_{yy,t} \end{array} \right] \right) \end{equation}

The expected measurement $\boldsymbol{y}_t$ is \begin{equation} \boldsymbol{\hat{y}}_{t} = \boldsymbol{C}_t\boldsymbol{\hat{x}}_t + \boldsymbol{D}_t \boldsymbol{\hat{u}}_t. \end{equation}

The covariance sub-matrices are \begin{align} \boldsymbol{\Sigma}_{xx,t} &= \boldsymbol{P}_t \\ \boldsymbol{\Sigma}_{xy,t} &= \mathbb{E}\left[{\boldsymbol{x}_t {\boldsymbol{y}_t}^T}\right] - \mathbb{E}\left[{\boldsymbol{x}_t}\right] {\mathbb{E}\left[{\boldsymbol{y}_t}\right]}^T \\ &= \mathbb{E}\left[{\boldsymbol{x}_t {\left( \boldsymbol{C}_t\boldsymbol{x}_t + \boldsymbol{D}_t \boldsymbol{u}_t + \boldsymbol{r}_t \right)}^T}\right] - \mathbb{E}\left[{\boldsymbol{x}_t}\right] {\mathbb{E}\left[{\boldsymbol{C}_t\boldsymbol{x}_t + \boldsymbol{D}_t \boldsymbol{u}_t + \boldsymbol{r}_t}\right]}^T \\ &= \boldsymbol{P}_t { \boldsymbol{C}_t }^T \\ \boldsymbol{\Sigma}_{yx,t} &= {\boldsymbol{\Sigma}_{xy,t}}^T = \boldsymbol{C}_t \boldsymbol{P}_t \\ \boldsymbol{\Sigma}_{yy,t} &= \boldsymbol{C}_t\boldsymbol{P}_t{\boldsymbol{C}_t}^T + \boldsymbol{D}_t \boldsymbol{U}_t {\boldsymbol{D}_t}^T + \boldsymbol{R}_t \end{align}

Given a measurement $\boldsymbol{m}$, the conditional distribution for x given y is a normal distribution with the following properties \begin{align} \boldsymbol{\hat{x}}_{t|\boldsymbol{y}_t=\boldsymbol{m}} &= \mathbb{E}\left[{\boldsymbol{x}_t|\boldsymbol{y}_t=\boldsymbol{m}}\right] = \boldsymbol{\hat{x}_t} + \boldsymbol{\Sigma}_{xy,t} \boldsymbol{\Sigma}_{yy,t}^{-1}( \boldsymbol{m} - {\hat{\boldsymbol{y}}_t} ) \\ \boldsymbol{P}_{t|\boldsymbol{y}_t=\boldsymbol{m}} &= \text{Var}(\boldsymbol{x}_t|\boldsymbol{y}_t=\boldsymbol{m}) = \boldsymbol{\Sigma}_{xx,t} - \boldsymbol{\Sigma}_{xy,t} \boldsymbol{\Sigma}_{yy,t}^{-1} \boldsymbol{\Sigma}_{yx,t} \\ \end{align} Substituting, \begin{align} \boldsymbol{\hat{x}}_{t|\boldsymbol{y}_t=\boldsymbol{m}} &= \boldsymbol{\hat{x}_t} + \boldsymbol{P}_t { \boldsymbol{C}_t }^T \left( \boldsymbol{C}_t\boldsymbol{P}_t{\boldsymbol{C}_t}^T + \boldsymbol{D}_t \boldsymbol{U}_t {\boldsymbol{D}_t}^T + \boldsymbol{R}_t\right)^{-1}( \boldsymbol{m} - \boldsymbol{C}_t\boldsymbol{\hat{x}}_t - \boldsymbol{D}_t \boldsymbol{\hat{u}}_t ) \\ \boldsymbol{P}_{t|\boldsymbol{y}_t=\boldsymbol{m}} &= \boldsymbol{P}_t - \boldsymbol{P}_t { \boldsymbol{C}_t }^T \left( \boldsymbol{C}_t\boldsymbol{P}_t{\boldsymbol{C}_t}^T + \boldsymbol{D}_t \boldsymbol{U}_t {\boldsymbol{D}_t}^T + \boldsymbol{R}_t\right)^{-1} \boldsymbol{C}_t \boldsymbol{P}_t \\ \end{align}

Using a similar approach, it's also straightforward to derive the RTS-Smoother. Try writing out the joint distribution between $x_t$ and $x_{t+1}$ and write out the conditional distribution of $x_t$ given $x_{t+1}$ and the result follows.


I am dare to post the link to my explanation: http://david.wf/kalmanfilter/

Kalman Filter

enter image description here
There are a lot of different articles on Kalman filter, but it is difficult to find the one which contains an explanation, where all filtering formulas come from. I think that without understanding of that this science becomes completely non understandable. Here I will try to explain everything in a simple way. Kalman filter is very powerful tool for filtering of different kinds of data. The main idea behind this that one should use an information about the physical process. For example, if you are filtering data from a car’s speedometer then its inertia give you a right to treat a big speed deviation as a measuring error. Kalman filter is also interesting by the fact that in some way it is the best filter. We will discuss precisely what does it mean. In the end I will show how it is possible to simplify the formulas.

Preliminaries

At first, let’s memorize some definitions and facts from probability theory.

Random variable

When one says that it is given a random variable $\xi$, it means that it may take random values. Different values come with different probabilities. For example, if someone drops a dice then the set of values is discrete $\{1,2,3,4,5,6\}$. When you deal with a speed of moving particle then obviously you should work with a continuous set of values. Values which come out after each experiment (measurement) we would denote by $x_1, x_2,...$, but sometimes we would use the same letter as we use for a random variable $\xi$. In the case of continuous set of values a random variable is characterized by its probability density function $\rho(x)$. This function shows a probability that the random variable falls into a small neighbourhood $dx$ of the point $x$. As we can see on the picture, this probability is equal to the area of the hatched rectangle below the graph $\rho(x)dx$.
enter image description here

Quite often in our life, random variables have the Gauss Distribution, when the probability density is $\rho(x)\sim e^{-\frac{(x-\mu)^2}{2\sigma^2}}$.
enter image description here

We can see that the bell-shaped function $\rho(x)$ is centered at the point $\mu$ and its characteristic width is around $\sigma$. Since we are talking about the Gaussian Distribution, then it would be a sin not to mention from where does it come from. As well as the number $e$ and $\pi$ are firmly penetrated in mathematics and can be found in the most unexpected places, so Gaussian Distribution has deep roots in the theory of probability. The following remarkable statement partly explains presence of the Gauss Distribution in a lot of processes:
Let a random variable $\xi$ has an arbitrary distribution (in fact there are some restrictions on arbitrariness, but they are not restrictive at all). Let’s perform $n$ experiments and calculate a sum $\xi_1+...+\xi_n$, of fallen values. Let’s make a lot of experiments. It is clear that every time we will get a different value of the sum. In other words, this sum is a random variable with its own distribution law. It turns out that for sufficiently large $n$, the law of distribution of this sum tends to a Gaussian Distribution (by the way, the characteristic width of a bell is growing like $\sqrt n$. Read more in the Wikipedia: Central limit theorem. In real life there are a lot of values ​​which are a sum of large number of independent and identically distributed random variables. So this values have Gauss Distribution.

Mean Value

By definition, a mean value of a random variable is a value which we get in a limit if we perform more and more experiments and calculate a mean of fallen values. A mean value is denoted in different ways: mathematicians denote by $E\xi$ (expectation), Physicists denote it by $\overline{\xi}$ or $<\xi>$. We will denote it as mathematicians do. For instance, a mean value of Gaussian Distribution $\rho(x)\sim e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ is equal to $\mu$.

Variance

For Gaussian distribution, we clearly see that the random variable tends to fall within a certain region of its mean value $\mu$. Let us enjoy the Gaussian distribution once again:
enter image description here

On the picture, one may see that a characteristic width of a region where values mostly fall is $\sigma$. How can we estimate this width for an arbitrary random variable? We can draw a graph of its probability density function and just visually evaluate the characteristic range. However it would be better to choose a precise algebraic way for this evaluation. We may find a mean length of deviation from the mean value: $E|\xi-E\xi|$. This value is a good estimation of a characteristic deviation of $\xi$ . However we know very well, how problematic it is to use absolute values in formulas, thus this formula is rarely used in practice. A simpler approach (simple from calculation’s point of view) is to calculate $E(\xi-E\xi)^2$. This value called variance and denoted by $\sigma_\xi^2$. The quadratic root of the variance is a good estimation of random variable’s characteristic deviation. It’s called the standard deviation. For instance, one can compute that for the Gaussian distribution $\rho(x)\sim e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ the variance is equal to $\sigma^2$ thus the standard deviation is $\sigma$. This result really corresponds to our geometrical intuition. In fact a small cheating is hidden here. Actually in a definition of the Gauss distribution you see the number $2$ in a denominator of expression $-\frac{(x-\mu)^2}{2\sigma^2}$. This $2$ stands there in purpose, for the standard deviation $\sigma_\xi$ to be equal exactly to $\sigma$. So the formula of Gauss distribution is written in a way, which keep in mind that one would compute its standard deviation.

Independent random variables

Random variables may depend on each other or not. Imagine that you are throwing a needle on the floor and measuring coordinates of its both ends. This two coordinates are random variables, but they depend on each other, since a distance between them should be always equal to the length of the needle. Random variables are independent from each other if falling results of the first one doesn’t depend on results of the second. For two independent variables $\xi_1$ and $\xi_2$ the mean of their product is equal to the product of their mean: $E(\xi_1\cdot\xi_2) = E\xi_1\cdot\xi_2$

Proof
For instance to have blue eyes and finish a school with higher honors are independent random variables. Let say that there are $20\% = 0.2$ of people with blue eyes and $5\%=0.05$ of people with higher honors. So there are $0.2\cdot 0.5 = 0.01 = 1\%$ of people with blue eyes and higher honors. This example helps us to understand the following. For two independent random variables $\xi_1$ and $\xi_2$ which are given by their density of probability $\rho_1(x)$ and $\rho_2(y)$, the density of probability $\rho(x,y)$ (the first variable falls at $x$ and the second at $y$) can by find by the formula $$\rho(x,y) = \rho_1(x)\cdot\rho_2(y)$$ It means that $$ \begin{array}{l} \displaystyle E(\xi_1\cdot\xi_2)=\int xy\rho(x,y)dxdy=\int xy\rho_1(x)\rho_2(y)dxdy=\\ \displaystyle \int x\rho_1(x)dx\int y\rho_2(y)dy=E\xi_1\cdot E\xi_2 \end{array} $$ As you see, the proof is done for random variables which have a continuous spectrum of values and are given by their density of probability function. The proof is similar for general case.

Kalman filter

Problem statement

Let denote by $x_k$ a value which we intend to measure and then filter. It can be a coordinate, velocity, acceleration, humidity, temperature, pressure, e.t.c. Let us start with a simple example, which will lead us to the formulation of the general problem. Imagine that you have a radio control toy car which can run only forward and backward. Knowing its mass, shape, and other parameters of the system we have computed how the way how a controlling joystick acts on a car’s velocity $v_k$.
enter image description here

The the coordinate of the car would by the following formula $$x_{k+1}=x_k+v_kdt$$ In real life we can’t , we can’t have a precise formula for the coordinate since some small disturbances acting on the car as wind, bumps, stones on the road, so the real speed of the car will differ from the calculated one. So we add a random variable $\xi_k$ to the right hand side of last equation: $$x_{k+1}=x_k+v_kdt+\xi_k$$

We also have a GPS sensor on the car which tries to measure the coordinate of the car $x_k$. Of course there is an error in this measuring, which is a random variable $\eta_k$. So from the sensor we would get a wrong data: $$z_k=x_k+\eta_k$$

Our aim is to find a good estimation for true coordinate $x_k$, knowing a wrong sensor’s data $z_k$. This good estimation we will denote by $x^{opt}$. In general the coordinate $x_k$ may stands for any value (temperature, humidity,...) and the controlling member we would denote by $u_k$ ( in the example with a car $u_k = v_k\cdot dt$). The equations for the coordinate and the sensor measurements would be the following:
enter image description here   (1)
Let us discuss, what do we know in these equations.

  • $u_k$ is a known value which controls an evolution of the system. We do know it from the model of the system.
  • The random variable $\xi$ represents the error in the model of the system and the random variable $\eta$ is a sensor’s error. Their distribution laws don’t depend on time (on iteration index $k$).
  • The means of errors are equal to zero: $E\eta_k = E\xi_k = 0$.
  • We might not know a distribution law of the random variables, but we do know their variances $\sigma_\xi$ and $\sigma_\eta$. Note that the variances don’t depend on time (on $k$) since the corresponding distribution laws neither.
  • We suppose that all random errors are independent from each other: the error at the time $k$ doesn’t depend on the error at the time $k’$.

Note that a filtration problem is not a smoothing problem. Our aim is not to smooth a sensor’s data, we just want to get the value which is as close as it is possible to the real coordinate $x_k$.

Kalman algorithm

We would use an induction. Imagine that at the step $k$ we have already found the filtered sensor’s value $x^{opt}$, which is a good estimation of the real coordinate $x_k$. Recall that we know the equation which controls the real coordinate: $$x_{k+1} = x_k + u_k + \xi_k$$

Therefore before getting the sensor’s value we might state that it would show the value which is close to $x^{opt}+u_k$. Unfortunately so far we can’t say something more precise. But at the step $k+1$ we would have a non precise reading from the sensor $z_{k+1}$. The idea of Kalman is the following. To get the best estimation of the real coordinate $x_{k+1}$ we should get a golden middle between the reading of non precise sensor $z_{k+1}$ and $x^{opt}+u_k$ - our prediction, what we have expected to see on the sensor. We will give a weight $K$ to the sensor’s value and $(1-K)$ to the predicted value: $$x^{opt}_{k+1} = K\cdot z_{k+1} + (1-K)\cdot(x_k^{opt}+u_k)$$ The coefficient $K$ is called a Kalman coefficient. It depends on iteration index, and strictly speaking we should rather write $K_{k+1}$. But to keep formulas in a nice shape we would omit the index of $K$. We should choose the Kalman coefficient in way that the estimated coordinate $x^{opt}_{k+1}$ would be as close as it is possible to the real coordinate $x_{k+1}$. For instance, if we do know that our sensor is very super precise then we would trust to its reading and give him a big weight ($K$ is close to one). If the sensor conversely is not precise at all, then we would rely on our theoretically predicted value $x^{opt}_k+u_k$. In general situation, we should minimize the error of our estimation: $$e_{k+1} = x_{k+1}-x^{opt}_{k+1}$$ We use equations (1) (those which are on a blue frame), to rewrite the equation for the error: $$e_{k+1} = (1-K)(e_k+\xi_k) - K\eta_{k+1}$$

Proof $$ \begin{array}{l} { e_{k+1}=x_{k+1}-x^{opt}_{k+1}=x_{k+1}-Kz_{k+1}-(1-K)(x^{opt}_k+u_k)=\\ =x_k+u_k+\xi_k-K(x_k+u_k+\xi_k+\eta_{k+1})-(1-K)(x^{opt}_k+u_k)=\\=(1-K)(x_k-x_k^{opt}+\xi_k)-K\eta_{k+1}=(1-K)(e_k+\xi_k)-K\eta_{k+1} } \end{array} $$

Now it comes a time to discuss, what does it mean the expression “to minimize the error”? We know that the error is a random variable so each time it takes different values. Actually there is no unique answer on that question. Similarly it was in the case of the variance of a random variable, when we were trying to estimate the characteristic width of its probability density function. So we would choose a simple criterium. We would minimize a mean of the square: $$E(e^2_{k+1})\rightarrow min$$ Let us rewrite the last expression: $$E(e^2_{k+1})=(1-K)^2(E_k^2+\sigma^2_\xi)+K^2\sigma^2_\eta$$

Key to the proof From the fact that all random variables in the equation for $e_{k+1}$ don’t depend on each other and the mean values $E\eta_{k+1}=E\xi_k=0$, follows that all cross terms in $E(e^2_{k+1})$ become zeros: $$E(\xi_k\eta_{k+1})=E(e_k\xi_k)=E(e_k\eta_{k+1})=0.$$ Indeed for instance $E(e_k\xi_k) = E(e_k)E(\xi_k)=0.$
Also note that formulas for the variances looks much simpler: $\sigma^2_\eta = E\eta^2_k$ and $\sigma^2_\eta = E\eta^2_{k+1}$ (since $E\eta_{k+1}=E\xi_k=0$)

The last expression takes its minimal value, when its derivation is zero. So when: $$\displaystyle K_{k+1} = \frac{Ee^2_k + \sigma^2_\xi}{Ee^2_k+\sigma^2_\xi+\sigma^2_\eta}$$ Here we write the Kalman coefficient with its subscript, so we emphasize the fact that it do depends on the step of iteration. We substitute to the equation for the mean square error $E(e^2_{k+1})$ the Kalman coefficient $K_{k+1}$ which minimize its value: $$\displaystyle E(e^2_{k+1}) = \frac{\sigma^2_\eta(Ee^2_k + \sigma^2_\xi)}{Ee^2_k+\sigma^2_\xi+\sigma^2_\eta}$$ So we have solved our problem. We got the iterative formula for computing the Kalman coefficient. All formulas in one place:
enter image description here

Example

One the plot from the beginning of this article there are filtered datum from the fictional GPS sensor, settled on the fictional car, which moves with the constant acceleration $a$. $$x_{t+1} = x_t+ at\cdot dt+ \xi_t$$ Look at the filtered results once again: enter image description here

The code on matlab:
clear all;
N=100  % number of samples
a=0.1 % acceleration
sigmaPsi=1
sigmaEta=50;
k=1:N
x=k
x(1)=0
z(1)=x(1)+normrnd(0,sigmaEta);
for t=1:(N-1)
  x(t+1)=x(t)+a*t+normrnd(0,sigmaPsi); 
   z(t+1)=x(t+1)+normrnd(0,sigmaEta);
end;
%kalman filter
xOpt(1)=z(1);
eOpt(1)=sigmaEta; % eOpt(t) is a square root of the error dispersion (variance).
% It's not a random variable. 
for t=1:(N-1)
  eOpt(t+1)=sqrt((sigmaEta^2)*(eOpt(t)^2+sigmaPsi^2)/(sigmaEta^2+eOpt(t)^2+sigmaPsi^2))
  K(t+1)=(eOpt(t+1))^2/sigmaEta^2
 xOpt(t+1)=(xOpt(t)+a*t)*(1-K(t+1))+K(t+1)*z(t+1)
end;
plot(k,xOpt,k,z,k,x)

Analysis

If one look at how the Kalman coefficient $K_k$ changes from the iteration $k$, it is possible to see that it stabilizes to the certain value $K_{stab}$. For instance if the square mean errors of sensor and the model respect to each other as ten to one, then the plot of the Kalman coefficient dependency from the iteration step would be the following:
enter image description here

In the next example we would discuss how that can simplify our life.

Second example

In practice it happens that we don’t know almost anything from the physical model what we are filtering. Imagine you have decided to filter that measurements from your favourite accelerometer. Actually you don’t know in forward how the accelerometer would be moved. Only thing you might know is the variance of sensor’s error $\sigma^2_\eta$. In this difficult problem we might put all unknown information from the physical model to the random variable $\xi_k$: $$x_{k+1} = x_k + \xi_k$$ Strictly speaking this kind of system doesn’t satisfy the condition that we have imposed on the random variable $\xi_k$. Since it holds the information unknown for us physics of the movement. We can’t say that it different moment of times the errors are independent from each other and their means are equals to zero. In other words, it means that for this kind of situations the Kalman theory isn’t applied. But anyway we can try to use the machinery of Kalman theory just by choosing some reasonable values for $\sigma_\xi^2$ and $\sigma_\eta^2$ to get just a nice graph of filtered datum. But there is a much simpler way. We saw that with increasing of the step $k$ the Kalman coefficient always stabilizes to the certain value $K_{stab}$. So instead of guessing the values of the coefficients $\sigma^2_\xi$ and $\sigma^2_\eta$ and computing the Kalman coefficient $K_k$ by difficult formulas, we can assume that this coefficient is constant and select just this constant. This assumption would not affect much on filtering results. At first we anyway the Kalman machinery is not exactly applicable to our problem and secondly, the Kalman coefficient quickly stabilizes to the constant. In the end everything becomes very simple. We don’t need almost any formula from Kalman theory, we just need to select a reasonable value $K_{stab}$ and insert it to the iterative formula $$x^{opt}_{k+1} = K_{stab}\cdot z_{k+1}+(1-K_{stab})\cdot x_k^{opt}$$ On the next graph you can see the filtered by two different ways measurements from an imaginary sensor. The first method is the honest one, with all formulas from Kalman theory. The second method is the simplified one.
enter image description here

We see that there is not a big difference between two this methods. There is a small variation in the beginning, when the Kalman coefficient still is not stabilized.

Discussion

We have seen that the main idea of Kalman filter is to choose the coefficient $K$ in a way that the filtered value $$x^{opt}_{k+1}= Kz_{k+1}+(1-K)(x^{opt}_k+u_k)$$ in average would be as close as possible to the real coordinate $x_{k+1}$. We see that the filtered value $x^{opt}_{k+1}$ is a linear function from the sensor's measurement $z_{k+1}$ and the previous filtered value $x^{opt}_k$. But the previous filtered value $x^{opt}_k$ itself is a linear function of the sensor’s measurement $z_k$ and the pre-previous filtered value $x^{opt}_{k-1}$. And so on until the end of the chain. So the filtered value linearly depends on all previous sensor’s readings: $$x^{opt}_{k+1}= \lambda + \lambda_0z_0 + \ldots + \lambda_{k+1}z_{k+1}$$ That is a reason that the Kalman filter is called a linear filter. It is possible to prove that the Kalman filter is the best from all linear filters. The best in a sense that it minimizes a square mean of the error.

Multidimensional case

It is possible to generalise all the Kalman theory to the multidimensional case. Formulas there looks a bit more ellaborated but the idea of their deriving still remains the same as in a one dimension. For instance, In this nice video you can see the example.

Literature

The original article which is written by Kalman you can download here: http://www.cs.unc.edu/~welch/kalman/media/pdf/Kalman1960.pdf