In the field of state estimation (which includes the Kalman Filter), an estimator is simply a tool or a rule—often a mathematical formula—that uses incomplete, noisy data (measurements) to calculate a "best guess" for an unknown quantity (the true state).
The need for estimators arises because in the real world, we rarely know the true state of a system (e.g., a robot's exact location, the precise temperature of an engine) due to two unavoidable problems:
Noise: All sensor readings contain errors or random fluctuations.
Incompleteness: We often only measure parts of the system or measure them infrequently.
The goal of an estimator is to process these noisy measurements and produce a single value that is as close as possible to the true, unobservable value.
A statistic is a random variable $S$ that is a function of the data $\mathcal{D}$, i.e. $S=f(\mathcal{D})$
An estimator is a statistic attended to approximate a parameter governing the distribution of the data $\mathcal{D}$.
Notation: $\hat \theta$ is an estimator of $\theta$
Let's imagine you are trying to find the true, stable position of a stationary object (e.g., a buoy bobbing slightly in calm water). This position is a continuous random variable that we assume follows a 1D Normal (Gaussian) distribution.
Because of sensor noise, every measurement $z_i$ will be slightly different from $\mu_{true}$. You have collected the following four measurements:
$$Z = \{-2.1, -1.9, -2.4, -2.0\}$$
How can we get the best estimate for the true position ($\mu_{true}$)?
Here:
The simplest and most common estimator for the true underlying mean ($\mu$) of a set of noisy measurements is the Sample Mean ($\bar{z}$).
This choice is not arbitrary; for a Gaussian (Normal) distribution, the sample mean is mathematically equivalent to the Maximum Likelihood Estimator (MLE), see below.
Maximum Likelihood Estimator (MLE) Note: The MLE is a powerful estimation technique that asks: "Given the data we observed, what value of the unknown parameter ($\mu$) makes this data most probable?" When the measurement noise is assumed to be Gaussian (which is standard for the Kalman Filter), the true mean that maximizes the probability of observing the specific data collected is precisely the Sample Mean.
The rule (or formula) for this estimator is to sum all the measurements and divide by the number of measurements taken ($N$):
$$\text{Estimator: } \hat x = \bar{z} = \frac{1}{N} \sum_{i=1}^{N} z_i$$
where the hat over the $x$ indicates an estimator.
Using the four measurements from our buoy example ($N=4$):
Sum the Measurements: $$\sum z_i = (-2.1) + (-1.9) + (-2.4) + (-2.0) = -8.4$$
Divide by $N$:
$$\bar{z} = \frac{-8.4}{4} = -2.1$$
Our estimator concludes that the best estimate for the buoy's true position ($\hat x =\mu_{true}$) is $-2.1$. This simple example illustrates the core function of all point estimators: they are rules that process noisy input data to produce the single, most likely value for an unknown state.
To estimate the state of a robot or the parameters of a model, we use Bayes' Rule to relate our observations to our hypotheses:
The relationship is defined as:
$$ p(\theta | \mathcal D) = \frac{p(\mathcal D | \theta) p(\theta)}{ p(\mathcal D)} $$
The components of this equation are:
In prose, this is often summarized as:
$$\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}$$
The Likelihood FunctionThe likelihood function is viewed as a function of the parameter $\theta$, given a fixed set of observed data $\mathcal{D}$:
$$ \mathcal L (\theta) = p(\mathcal D | \theta) $$
"Never say 'the likelihood of the data'. Always say 'the likelihood of the parameters'. The likelihood function is not a probability distribution." (from D. MacKay: Information Theory, Inference and Learning Algorithms, Page 29, Cambride 2003, http://www.inference.phy.cam.ac.uk/itila/book.html)
Key Distinction: While $p(\mathcal{D} \mid \theta)$ is a probability distribution over the data (summing to 1 over $\mathcal{D}$), $\mathcal{L}(\theta)$ is a function over the parameter space and does not necessarily integrate to 1.
The Maximum Likelihood Estimator (MLE) is the value of $\theta$ that makes the observed data most probable:$$\hat{\theta}_{ML} := \arg\max_{\theta} \mathcal{L}(\theta)$$In practice, particularly in robotics and deep learning, we often minimize the Negative Log-Likelihood (NLL). Because the logarithm is a monotonic function, the location of the maximum remains the same, but the products transform into sums, which are numerically more stable and easier to differentiate:$$\hat{\theta}_{ML} = \arg\max_{\theta} \mathcal{L}(\theta) = \arg\min_{\theta} \left( -\log \mathcal{L}(\theta) \right)$$
Table for the technical terms:
| Term | Notation | Meaning in Robotics |
|---|---|---|
| Likelihood | $p(\mathcal{D} \mid \theta)$ | How well does the state $\theta$ explain the sensor data? |
| Prior | $p(\theta)$ | What did we believe about the state before the measurement? |
| Posterior | $p(\theta \mid \mathcal{D})$ | Our updated belief after incorporating new data. |
| MLE | $\hat{\theta}_{ML}$ | The "best guess" for the state based purely on the data. |
Context: Imagine you are testing a robot’s gripper success rate by performing $n$ trials. In each trial, the gripper either succeeds (heads) or fails (tails). The probability of success is $\theta$.
Given: In $n$ independent trials, we observe exactly $k$ successes. The probability of this observation is given by the Binomial Distribution, which serves as our Likelihood Function $\mathcal L(\theta)$:
$$L(\theta) = P(k | n, \theta) = \binom{n}{k} \theta^k (1-\theta)^{n-k}$$
We want to find the Maximum Likelihood Estimator (MLE), denoted as $\hat{\theta}_{ML}$, that maximizes $\mathcal L(\theta)$. To simplify the math, we minimize the negative log-likelihood:$$\hat{\theta}_{ML} = \arg\min_\theta \left[ - \log \mathcal L(\theta) \right] = \arg\min_\theta \left[ - \log \left( \binom{n}{k} \theta^k (1-\theta)^{n-k} \right) \right]$$Using logarithm rules to simplify:$$- \log \mathcal L(\theta) = -\log \binom{n}{k} - k \log \theta - (n-k) \log (1-\theta)$$
Necessary condition for a minimum:We take the derivative with respect to $\theta$ and set it to zero (note that the constant $\binom{n}{k}$ disappears):$$0 = \frac{d}{d\theta} \left[ -k \log \theta - (n-k) \log (1-\theta) \right]$$$$0 = -\frac{k}{\theta} + \frac{n-k}{1-\theta}$$Solving for $\theta$:Multiply both sides by $\theta(1-\theta)$:$$k(1-\theta) = (n-k) \theta$$$$k - k\theta = n\theta - k\theta$$$$k = n\theta$$$$\hat{\theta}_{ML} = \frac{k}{n}$$
The Maximum Likelihood Estimator (MLE) is a foundational concept in estimation theory. It provides a formal, statistical justification for selecting the Sample Mean as the "best guess" for the true position ($\mu$).
The Core Idea: Maximizing Probability
The MLE seeks the value of the unknown parameter ($\mu$) that maximizes the likelihood function, which is defined as the joint probability of observing all your collected data ($Z$), given a specific value of $\mu$.
$$\hat{\mu}_{\text{MLE}} = \underset{\mu}{\operatorname{argmax}} \left[ P(Z \mid \mu) \right]$$
MLE for Gaussian Noise
In our 1D example, we make the standard assumption that the measurement noise ($v_i$) is independent and normally distributed (Gaussian):
$$z_i = \mu_{\text{true}} + n_i, \quad \text{where } n_i \sim \mathcal{N}(0, \sigma^2)$$
Since the measurements are independent, the joint likelihood function $P(Z \mid \mu)$ is the product of the individual probability density functions (PDFs) for each measurement $z_i$:
$$P(Z \mid \mu) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(z_i - \mu)^2}{2\sigma^2} \right)$$
To find the $\mu$ that maximizes this product, we typically maximize the log-likelihood function (which simplifies the math by replacing multiplication with summation):
$$\mathcal{L}(\mu) = \sum_{i=1}^N \ln\left[ \frac{1}{\sqrt{2\pi\sigma^2}} \right] - \sum_{i=1}^N \frac{(z_i - \mu)^2}{2\sigma^2}$$
Note that maximizing $\mathcal{L}(\mu)$ is equivalent to minimizing the sum of the squared errors—the term $\sum_{i=1}^N (z_i - \mu)^2$.
The Equivalence to Sample Mean
To find the maximum, we take the derivative of the log-likelihood function $\mathcal{L}(\mu)$ with respect to $\mu$ and set it to zero.
$$ \frac{\partial \mathcal{L}(\mu)}{\partial \mu} = \sum_{i=1}^N \frac{(z_i - \mu)}{\sigma^2} = \frac{1}{\sigma^2} \left( \sum_{i=1}^N z_i - N \mu\right) $$
Solving $\frac{\partial \mathcal{L}(\mu)}{\partial \mu} = 0$ yields the result:$$\hat{\mu}_{\text{MLE}} = \frac{1}{N} \sum_{i=1}^{N} z_i = \bar{z}$$
Conclusion:
For a set of independent, Gaussian-distributed measurements, the Sample Mean ($\bar{z}$) is the single value that makes the observed data set most probable, thus providing the mathematically optimal estimate under the Maximum Likelihood criterion.
Two crucial properties determine the quality of any estimator: Bias and Variance. The goal of powerful filters like the Kalman Filter is to find an estimate that is both unbiased and has minimal variance.
Bias measures whether the estimator, on average, correctly hits the true value.
Definition: An estimator $\hat{\theta}$ is unbiased if the expected value of the estimate is equal to the true, unknown parameter $\theta$. The bias is the difference between these two values:
$$\text{Bias}(\hat \theta) := \mathbb{E}[\hat \theta] - \theta$$
The Expectation ($\mathbb{E}[\hat \theta]$) is taken over all possible data sets (measurements) that could have been collected.
An estimator is unbiased if the bias is zero:
$$\text{Bias}(\hat \theta) = 0 \quad \text{or equivalently} \quad \mathbb{E}[\hat \theta] = \theta$$
The sample mean estimator ($\bar{z}$) is an unbiased estimator for the true population mean ($\mu_{\text{true}}$).
The Population Mean ($\mu_{\text{true}}$) represents the true, fixed, underlying value of the parameter we are trying to estimate. It is the theoretical mean of the entire population or process being measured (e.g., the exact center of the buoy's stable position). It is a constant value that is unknown to us.
Meaning: If you were to hypothetically repeat the four-measurement experiment an infinite number of times, the average of all your calculated $\bar{z}$ values would exactly equal $\mu_{\text{true}}$. The estimator doesn't systematically over- or under-estimate the true position.
Result: The sample mean is a good choice for position estimation because its long-run average is accurate.
Variance measures the spread or uncertainty of the estimator's results. It tells you how much the estimator's result varies if you repeat the experiment (take new measurements). It is a measure of the estimator's precision or consistency.
Definition: The variance of an estimator $\hat{\theta}$ is the expected squared difference between the estimate and its own mean (expected value $\mathbb{E}[\hat{\theta}]$):
$$\text{Var}(\hat \theta) := \mathbb{E}[(\hat \theta - \mathbb{E}[\hat \theta])^2]$$
Note the alternative definition: $$\text{Var}(\hat \theta) := \mathbb{E}[\hat \theta^2] - (\mathbb{E}[\hat \theta])^2$$
Proof of equivalence:
$$\begin{aligned} \text{Var}(\hat \theta) &= \mathbb{E}[(\hat \theta - \mathbb{E}[\hat \theta])^2] \\ &= \mathbb{E}[\hat \theta^2 - 2 \hat \theta \mathbb{E}[\hat \theta] + (\mathbb{E}[\hat \theta])^2] \\ &= \mathbb{E}[\hat \theta^2] - \mathbb{E}[2 \hat \theta \mathbb{E}[\hat \theta]] + \mathbb{E}[(\mathbb{E}[\hat \theta])^2] \\ &= \mathbb{E}[\hat \theta^2] - 2 \mathbb{E}[\hat \theta] \mathbb{E}[\hat \theta] + (\mathbb{E}[\hat \theta])^2 \\ &= \mathbb{E}[\hat \theta^2] - 2 (\mathbb{E}[\hat \theta])^2 + (\mathbb{E}[\hat \theta])^2 \\ &= \mathbb{E}[\hat \theta^2] - (\mathbb{E}[\hat \theta])^2 \end{aligned}$$
Since your individual measurements are noisy (modeled by a Gaussian with variance $\sigma^2_{\text{sensor}}$), the estimated mean $\bar{z}$ also has a variance.
The variance of the sample mean ($\sigma^2_{\bar{z}}$) is given by the formula:
$$\sigma^2_{\bar{z}} = \frac{\sigma^2_{\text{sensor}}}{N}$$
Where:
Derivation of $\text{Var}(\bar{z}) = \frac{\sigma^2_{\text{sensor}}}{N}$
Assumption: The measurements $z_i$ are independent, identically distributed random variables, each with the same true mean ($\mu$) and variance ($\sigma^2_{\text{sensor}}$).
Start with the definition of the Sample Mean estimator ($\bar{z}$):
$$\bar{z} = \frac{1}{N} \sum_{i=1}^{N} z_i$$
Apply the Variance Operator to the estimator $\bar{z}$:
$$\text{Var}(\bar{z}) = \text{Var} \left( \frac{1}{N} \sum_{i=1}^{N} z_i \right)$$
Use the property $\text{Var}(c X) = c^2 \text{Var}(X)$: The constant $\frac{1}{N}$ comes out squared:
$$\text{Var}(\bar{z}) = \left( \frac{1}{N} \right)^2 \text{Var} \left( \sum_{i=1}^{N} z_i \right)$$
Use the property that the Variance of a sum of independent random variables is the sum of their variances:Since the measurements $z_i$ are independent (e.g., noise at one time step does not affect the next), we have:
$$\text{Var} \left( \sum_{i=1}^{N} z_i \right) = \sum_{i=1}^{N} \text{Var}(z_i)$$
Substitute this back into the equation:
$$\text{Var}(\bar{z}) = \frac{1}{N^2} \sum_{i=1}^{N} \text{Var}(z_i)$$
Use the property $\text{Var}(z_i) = \sigma^2_{\text{sensor}}$ for all measurements:Since all measurements come from the same sensor, the variance of each $z_i$ is $\sigma^2_{\text{sensor}}$. The summation is therefore adding $\sigma^2_{\text{sensor}}$ a total of $N$ times:
$$\text{Var}(\bar{z}) = \frac{1}{N^2} \left( N \cdot \sigma^2_{\text{sensor}} \right)$$
This relationship demonstrates a fundamental principle in estimation: taking more measurements reduces the estimate's uncertainty.
The Kalman Filter leverages this principle recursively: every time it incorporates a new measurement ($z_t$), it performs an operation that is mathematically equivalent to reducing the variance of the overall state estimate.
The primary goal of statistical estimation is not just to find the best guess ($\hat{\mu}$), but also to rigorously quantify the confidence in that guess. In practice, our final measurement result must include an uncertainty term.
The best estimate for the true state ($\mu$) is reported as the Sample Mean ($\bar{z}$), and the corresponding uncertainty is calculated using the square root of the estimated variance of the mean.
Calculating the Final Uncertainty
$$\text{Uncertainty} = \text{SEM} = \sqrt{\hat{\sigma}^2_{\bar{z}}} = \sqrt{\frac{\hat{\sigma}^2_{\text{unbiased}}}{N}}$$
The Practical Measurement Report
In scientific and engineering fields, the result of combining $N$ noisy measurements is reported in the standard format: $$\text{Measurement} = (\text{Best Estimate} \pm \text{Uncertainty}) \text{ units}$$
For the buoy example:$$\text{Buoy Position} = (\bar{z} \pm \text{SEM}) \text{ meters}$$
Important: "In scientific reporting and state estimation, the goal is not merely to calculate a single 'best guess' (the mean). It is mandatory to provide a measure of uncertainty (such as the Standard Error of the Mean). Without an uncertainty value, the estimate lacks context, making it impossible for others to judge its precision, compare it against theoretical models, or use it effectively in subsequent calculations."
Error versus uncertainty
It is important not to confuse the terms error and uncertainty:
When the true sensor variance ($\sigma^2_{\text{sensor}}$) is unknown, we must estimate it from the $N$ collected measurements ($Z$). This requires choosing between a biased and an unbiased estimator.
1. The Biased Estimator (MLE for Variance) The Maximum Likelihood Estimator (MLE) for variance, often denoted $\hat{\sigma}^2_{\text{MLE}}$, is calculated by dividing the sum of squared errors by the number of samples, $N$.$$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \bar{z})^2$$
2. The Unbiased Estimator (Sample Variance) To correct for this underestimation, the unbiased estimator for variance, $\hat{\sigma}^2_{\text{unbiased}}$, divides the sum of squared errors by $N-1$ instead of $N$. This correction is known as Bessel's Correction. $$\hat{\sigma}^2_{\text{unbiased}} = \frac{1}{N-1} \sum_{i=1}^{N} (z_i - \bar{z})^2$$
Conclusion for EstimationIn estimation theory and practical filtering (like the initial tuning of the Kalman Filter), the unbiased sample variance ($\hat{\sigma}^2_{\text{unbiased}}$ with $N-1$) is almost always preferred for determining the sensor's measurement uncertainty ($\sigma^2_{\text{sensor}}$), as it provides a more statistically robust value for the sensor's reliability.
The Mean Squared Error (MSE) is the single, combined metric used to judge an estimator's overall quality, balancing the trade-off between bias and variance.
Minimum Mean Square Error (MMSE): The ultimate goal of many optimal estimators is to minimize the MSE.
Definition
The MSE of an estimator $\hat{\theta}$ is the expected squared difference between the estimate and the true, unknown parameter $\theta$:
$$\text{MSE}(\hat \theta) := \mathbb{E}[(\hat \theta - \theta)^2]$$
It measures the average magnitude of the error produced by the estimator. Note that, unlike Bias, the MSE will always be non-negative.
The Bias-Variance Decomposition
A crucial identity in estimation theory is that the MSE can always be decomposed into the sum of the estimator's variance and the square of its bias.6 This identity makes the MSE the perfect metric for balancing these two properties.
$$\begin{aligned} \text{MSE}(\hat \theta) &:= \mathbb{E}[(\hat \theta - \theta)^2] \\ &= \mathbb{E}[\theta^2] - 2 \mathbb{E}[\hat \theta \theta] + \mathbb{E}[\hat \theta^2] \\ &= \theta^2 - 2 \theta \mathbb{E}[\hat \theta ] + \mathbb{E}[\hat \theta^2] + \mathbb{E}[\hat \theta]^2 - \mathbb{E}[\hat \theta]^2 \\ &= \left(\theta^2 - 2 \theta \mathbb{E}[\hat \theta ]+ \mathbb{E}[\hat \theta]^2 \right) + \left( \mathbb{E}[\hat \theta^2] - \mathbb{E}[\hat \theta]^2 \right) \\ &= \left( \mathbb{E}[\hat \theta] - \theta \right)^2 + \text{Var}(\hat \theta) \\ &= \left(\text{Bias}(\hat \theta)\right)^2 + \text{Var}(\hat \theta) \end{aligned}$$
Conclusion
This decomposition is the formal expression of the Bias-Variance Tradeoff: to minimize the total error (MSE), a good estimator must strive to minimize both the variance and the bias. The optimal estimator (MMSE) in the Kalman Filter framework is the one that achieves the lowest possible sum of these two components.
While the Maximum Likelihood Estimator (MLE) only considers the data collected, the Maximum A Posteriori (MAP) Estimator incorporates prior knowledge or a previous belief about the unknown parameter $\theta$.
The MAP estimator seeks to find the value of $\theta$ that maximizes the posterior probability—the probability of the parameter $\theta$ being true, given the data ($\mathcal{D}$) we just observed:$$\hat{\theta}_{\text{MAP}} = \underset{\theta}{\operatorname{argmax}} \left[ P(\theta \mid \mathcal{D}) \right]$$This is governed by Bayes' Theorem:
$$P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) P(\theta)}{P(\mathcal{D})}$$
The numerator combines two critical components:
The denominator is $P(\mathcal{D})$, also known as the Evidence or Marginal Likelihood. It is irrelevant for finding the MAP estimate because it has no dependency on $\theta$.
Since $P(\mathcal{D})$ is a constant factor that simply ensures the posterior distribution integrates to one, it does not change the location of the peak of the posterior curve.
Therefore, the posterior is proportional to the numerator:$$P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta)$$To find the maximum of the posterior (the MAP estimate), we only need to maximize the numerator:
$$\hat{\theta}_{\text{MAP}} = \underset{\theta}{\operatorname{argmax}} \left[ P(\mathcal{D} \mid \theta) P(\theta) \right]$$
The MAP estimate ($\hat{\theta}_{\text{MAP}}$) provides only the peak (the mode) of the posterior probability. The Full Bayesian Estimator is the entire Posterior Probability Distribution $P(\theta \mid \mathcal{D})$ itself.
The full posterior is superior to any single point estimate because it retains all the information about our belief in $\theta$, including the central tendency and the uncertainty.
The most useful scenario for engineering and filtering is when the Prior and the Likelihood are both modeled as Gaussian distributions.
In estimation theory, when the prior and likelihood belong to the same family of distributions (called conjugate priors), the resulting Posterior also belongs to that family. Since the Gaussian distribution is self-conjugate, the fusion of two Gaussians always results in a new Gaussian.
The Full Bayesian Estimate calculates both the mean and the variance of this resulting posterior Gaussian. The new mean $\mu_{\text{post}}$ represents the most probable value, and the new variance $\sigma_{\text{post}}^2$ represents the reduced uncertainty.
The Full Bayesian Connection to the Kalman Filter
The Kalman Filter (KF) is precisely the recursive algorithm that calculates the parameters of this Full Bayesian Posterior Distribution at every time step.
| Bayesian Component | Kalman Filter Result | Parameter of the Posterior Gaussian |
|---|---|---|
| Best Guess | $\hat{\mathbf{x}}_k^+$ (Updated Estimate) | The Mean ($\mu$) of the Posterior. |
| Uncertainty | $\mathbf{P}_k^+$ (Updated Covariance) | The Covariance ($\sigma$ resp. $\Sigma$) of the Posterior. |
The mathematical operations within the KF are the specialized, closed-form equations for multiplying two Gaussian probability density functions (the Prior and the Likelihood) and finding the resulting Gaussian's parameters.
Since the Mean, Mode (MAP estimate), and Median of a Gaussian are all identical, the Kalman Filter's single output ($\hat{\mathbf{x}}_k^+$) simultaneously achieves three forms of optimality: