Bayes Filter¶

In [1]:

%matplotlib inline
import numpy as np
from ipynb.fs.defs.plot_Bayes_Filter import plot_markov_chain, plot_HMM, plot_locations, plot_updates

Setting the Stage: From Probabilities to State Estimation¶

The challenge in fields like robotics and control is to continuously determine the true state of a system (e.g., the exact location of a robot, $X_t$) despite two major sources of uncertainty:

Motion Uncertainty (Process Noise): The system's movement from one state to the next is inherently noisy (the robot's movement command is not perfectly executed). This is captured by the Motion Model $P(X_t \mid X_{t-1})$.
Measurement Uncertainty: Sensor readings are noisy and imperfectly related to the true state. This is captured by the Measurement Model $P(Z_t \mid X_t)$.

The Bayes Filter is the general probabilistic algorithm designed to solve this challenge. It provides the mathematical framework for recursively fusing uncertain information from the Motion Model and the Measurement Model to produce the best possible estimate of the system's state at any given time.

The Recursive Cycle

The Bayes Filter operates in a continuous, two-step cycle:

Prediction: Using the Law of Total Probability, the filter predicts the new state distribution $P(X_t)$ based on the movement from the previous state $P(X_{t-1})$. This step always increases uncertainty.
Update (Correction): Using Bayes' Rule, the filter incorporates the current sensor reading $Z_t$ to correct the predicted distribution, resulting in the refined posterior estimate $P(X_t \mid Z_t)$. This step reduces uncertainty.

Markov Chain¶

A Markov Chain is a mathematical system that experiences transitions from one state to another on a state space, with the property that the probability of the next state depends only on the current state and not on the sequence of events that preceded it. This "memoryless" property is its defining characteristic.

Imagine a sequence of states, $X_1, X_2, X_3, \dots$, where $X_t$ represents the state of the system at time $t$. The chain describes how the system moves from one state to the next over time.

Markov Assumption¶

The assumption states that the future state of the process is conditionally independent of the past states, given the present state.

Mathematically, this is expressed as:

$$ P(X_t \mid X_1, \dots ,X_{t-1}) = P(X_t \mid X_{t-1}) $$

Where:

$P(\dots)$ is the probability distribution.
$X_t$ is the current state (the state we are predicting).
$X_{t-1}$ is the state immediately preceding the current state (the present state).
$X_1, \dots ,X_{t-2}$ are all states that occurred before $X_{t-1}$ (the past states).

This equation means that to predict the state at time $t$, we only need to know the state at time $t-1$. Knowing $X_{t-2}, X_{t-3}$, and so on, provides no additional information because the influence of those earlier states is entirely captured by $X_{t-1}$.

Conditional Independence¶

Your second equation formalizes the idea of statistical independence resulting from the Markov Assumption:

$$\forall t \in \mathbb N^{+}, \forall t' \in \mathbb N^{+}: (t-t' \geq 2) \Rightarrow X_t \perp X_{t'} \mid X_{t-1}$$

This means that any state $X_t$ is statistically independent ($\perp$) of any earlier state $X_{t'}$ (where $t'$ is at least two time steps before $t$, i.e., $t-t' \geq 2$), once you are given the state $X_{t-1}$ that lies between them.

For example, $X_3$ is independent of $X_1$, given $X_2$. The history $X_1$ only influences $X_2$, and it is only $X_2$ that influences $X_3$.

This property simplifies the joint probability distribution of the entire sequence of states. Using the chain rule of probability and applying the Markov Assumption:

$$P(X_1, \dots ,X_T) = P(X_1) \cdot P(X_2 \mid X_1) \cdot P(X_3 \mid X_1, X_2) \cdots P(X_T \mid X_1, \dots ,X_{T-1})$$

Becomes:

$$P(X_1, \dots ,X_T) = P(X_1) \prod_{t=2}^{T} P(X_t \mid X_{t-1})$$

This simplification is what makes Markov Chains, and extensions like the Hidden Markov Model (HMM) and Kalman Filter, computationally tractable and widely applicable in sequence modeling.

In [2]:

plot_markov_chain()

No description has been provided for this image

Localization with motion on a grid¶

In this example, we consider the problem of tracking a robot's location on a discrete grid, a classic application of a Markov Chain.

$X_t$ is the random variable representing the true state (location) of the robot at time $t$.
$\text{Val}(X_t)$ is the set of possible grid indices (e.g., $1, 2, 3, \dots, N$).
The statement $X_t = i$ means the robot is located in the grid cell with index $i$ at time $t$.

Motion Equation (The Prediction Step)¶

The motion equation describes how the probability distribution over the robot's location evolves over one time step, based on the Markov Assumption ($P(X_t \mid X_1, \dots, X_{t-1}) = P(X_t \mid X_{t-1})$).

To find the probability that the robot is at cell $i$ at time $t$, $P(X_t=i)$, we must sum up the probabilities of all possible previous locations $j$ at $t-1$ that could have led to state $i$ at time $t$. This is done using the law of total probability:

$$P(X_t=i) = \sum_{j} P(X_{t-1}=j) \cdot P(X_t=i \mid X_{t-1}=j) $$

$P(X_{t-1}=j)$: The probability that the robot was at grid cell $j$ at time $t-1$. This is the prior belief (the state distribution from the previous step).
$P(X_t=i \mid X_{t-1}=j)$: The Transition Probability (or Motion Model). This is the probability of moving from cell $j$ to cell $i$ in one time step.

This process, where a probability distribution is updated by considering all possibilities from the previous step, is called temporal prediction or Bayesian filtering prediction.

The Convolution Connection¶

In many localization problems, especially on a uniform grid, the transition probabilities exhibit translation invariance (or stationarity). This means the probability of moving by a certain amount is independent of the absolute starting position.

Example of Translation Invariance

The probability of moving one cell to the left is the same whether you start at cell 3 and move to 2, or start at cell 2 and move to 1.

$$P(X_t=2 \mid X_{t-1}=3) = P(X_t=1 \mid X_{t-1}=2) = P(\text{move left by } 1) $$

If we define a function $g$ for the relative motion probability:

$$g(k) \equiv P(X_t - X_{t-1} = k)$$ where $k = i - j$ is the displacement.

Then the transition probability can be written in terms of this relative difference:

$$P(X_t=i \mid X_{t-1}=j) = g(i-j)$$

Substituting this into the motion equation, and switching to a functional notation where $f_t(i) = P(X_t=i)$: $$f_t(i) = \sum_{j} f_{t-1}(j) \cdot g(i-j)$$

This equation is exactly the definition of a convolution $(f_{t-1} * g)(i)$:$$(f_{t-1} * g)(i) = \sum_{j} f_{t-1}(j) \cdot g(i-j)$$

Conclusion: The operation of predicting the robot's new location probability by incorporating the motion model is mathematically equivalent to convolving the previous location probability distribution ($f_{t-1}$) with the motion uncertainty distribution ($g$).

The "Blurry" Effect"

So if we know the localization of the robot at time $t=1$ and we do not observe any variable (no measurement), our knowledge of the robot location becomes more 'blurry' with time.

Convolution, in this context, always involves spreading the probability mass:

Start: If at $t=1$, the robot is known to be exactly at cell $c$ (e.g., $P(X_1=c)=1$), the probability distribution is a sharp peak.
After one step ($t=2$): Convolution with the motion uncertainty $g$ (e.g., moves to $c-1$, $c$, or $c+1$ with some probability) spreads the probability over multiple cells.
After multiple steps ($t=3, 4, \dots$): Repeated convolution makes the distribution wider and flatter (more spread out).

Without any observations (measurements) to correct the belief, the uncertainty (or entropy) of the robot's location continuously increases because the cumulative error from the motion model grows. This highlights the crucial need for the measurement update step in the full Kalman Filter/Bayesian Filter framework.

Example: Discrete Localization with Convolution¶

This example illustrates how the initial probability distribution of the robot's location, $P(X_t)$, spreads out over time due to the uncertainty in the motion model.

Grid Size: 10 cells (indices $0$ to $9$).
Initial State ($t=0$): The robot is known to be exactly at cell index 4.
$P(X_0=4) = 1$, and $P(X_0=i)=0$ for all $i \neq 4$.
Motion Model (Transition Probabilities, $\mathbf{tp}$):
- $P(X_t = i \mid X_{t-1} = i-1)$ (move right): $20\%$ (0.2)
- $P(X_t = i \mid X_{t-1} = i)$ (stay in place): $70\%$ (0.7)
- $P(X_t = i \mid X_{t-1} = i+1)$ (move left): $10\%$ (0.1)

In [3]:

# 1. Initialize Probability Matrix (5 time steps x 10 cells)
p = np.zeros([5, 10])

# 2. Set Initial State at t=0
p[0, 4] = 1 # The robot is located at time t=0 in cell with grid index 4

# 3. Define Transition Probabilities (tp)
# [P(X_t = i | X_{t-1} = i-1), P(X_t = i | X_{t-1} = i), P(X_t = i | X_{t-1} = i+1)]
# Represents: [Move Left (0.1), Stay (0.7), Move Right (0.2)]
tp = np.array([0.1, 0.7, 0.2]) # convolution kernel

# Set NumPy print options for clean output
np.set_printoptions(precision=3, suppress=True)

# 4. Perform Prediction (Convolution) over 4 time steps
for t in range(4):
    # Convolve the current distribution p[t] with the transition probabilities tp
    # 'same' mode ensures the output array p[t+1] has the same size (10)
    p[t+1] = np.convolve(p[t], tp, mode='same')    
    
print("Calculated Probability Distributions P(X_t=i) over 5 timesteps:")
print(p)

# 5. Plotting the Results
plot_locations(p)

Calculated Probability Distributions P(X_t=i) over 5 timesteps:
[[0.    0.    0.    0.    1.    0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.1   0.7   0.2   0.    0.    0.    0.   ]
 [0.    0.    0.01  0.14  0.53  0.28  0.04  0.    0.    0.   ]
 [0.    0.001 0.021 0.153 0.427 0.306 0.084 0.008 0.    0.   ]
 [0.    0.003 0.03  0.154 0.36  0.308 0.121 0.022 0.002 0.   ]]

Note: This procedure also works effectively even if the distribution is multimodal.

Hidden Markov Models and the Measurement Process¶

A Hidden Markov Model (HMM) is an extension of a Markov Chain where the states ($X_t$) are not directly observable—they are hidden. Instead, we receive noisy observations or measurements ($Z_t$) that depend on the true hidden state. The relationship between these variables is defined by two crucial sets of conditional independence assumptions.

In [4]:

plot_HMM()

The Two Core Conditional Independence Assumptions

The graph structure ($\mathcal{G}$) of the HMM is encoded by these two independence assumptions:

1. The Observation/Measurement Model (Sensor Independence)

This assumption defines the relationship between the hidden state ($X_t$) and the measurement ($Z_t$).

$$P(Z_t \mid Z_1, \dots, Z_{t-1}, X_1, \dots, X_{t}) = P(Z_t \mid X_t)$$

Explanation:

The Left Side: This is the probability of obtaining the current measurement $Z_t$, given all past measurements ($Z_1$ through $Z_{t-1}$) and all past and current hidden states ($X_1$ through $X_t$).
The Right Side: This states that, given the true current state $X_t$, the measurement $Z_t$ is independent of everything else (all past states and all past measurements).

In the robot example: The reading from your sensor ($Z_t$) only depends on where the robot truly is right now ($X_t$). It doesn't matter how the robot got there ($X_1, \dots, X_{t-1}$) or what measurements were taken previously ($Z_1, \dots, Z_{t-1}$).

2. The Transition/Motion Model (Markov Assumption)

This assumption defines the evolution of the hidden state itself, retaining the fundamental property of the Markov Chain.

$$P(X_t \mid Z_1, \dots, Z_{t-1}, X_1, \dots, X_{t-1}) = P(X_t \mid X_{t-1})$$

Explanation:

The Left Side: This is the probability of the robot transitioning to the current state $X_t$, given all past states and all past measurements.
The Right Side: This is the standard Markov Assumption. It states that the transition to $X_t$ only depends on the immediately preceding state $X_{t-1}$.

In the robot example: The robot's motion is purely autonomous. Where it moves next ($X_t$) depends only on where it was one step ago ($X_{t-1}$), and is not influenced by any past sensor readings ($Z_1, \dots, Z_{t-1}$). The motion occurs independently of the observations we make.

These two assumptions collectively decouple the full problem into the manageable, recursive Prediction (motion) and Update (measurement) steps used in the Bayesian Filter and, consequently, the Kalman Filter.

Example: Robot Locatization with Observations (The Measurement Update)¶

The Prediction Step (motion/convolution) calculates the prior belief, $P(X_t)$, but it always increases uncertainty. To correct this, we use a sensor measurement, $Z_t$, and Bayes' Rule to compute the updated belief, $P(X_t \mid Z_t)$, called the Posterior.

The core update equation is: $$P(X_t=i \mid Z_t) = \frac{P(Z_t \mid X_t=i) \cdot P(X_{t}=i)}{P(Z_t)}$$

Components of the Update:

Term	Role in the Equation	Interpretation
$P(X_t=i)$	Prior (from Prediction)	The predicted probability distribution before seeing the measurement.
$P(Z_t \mid X_t=i)$	Likelihood (Measurement Model)	The sensor's probability: How likely is it to get measurement $Z_t$ if the robot is truly at cell $i$?
$P(Z_t)$	Normalizer/Evidence	A scaling factor ensuring the Posterior sums to 1. It is calculated by marginalizing the joint probability over all states $i$: $$P(Z_t) = \sum_{i} P(Z_t \mid X_t=i) P(X_{t}=i)$$
$P(X_t=i \mid Z_t)$	Posterior	The final, updated, and corrected probability distribution after incorporating the observation.

Code Demonstration: Fusing Prior and Measurement

This code takes the prior distribution $P(X_4)$ (the result of the motion prediction at $t=4$ from the earlier example), introduces a new sensor likelihood, and calculates the posterior distribution.

In [5]:

# --- 1. Define the Prior (P(X_t)) ---
# Assuming p is the array from the prediction code (p[4] is the prior distribution at t=4)
# We use the predicted distribution P(X_t=4) from the previous convolution example.
# If p is not defined, we'd use p = np.array([0.003, 0.021, 0.088, 0.231, 0.315, 0.231, 0.088, 0.021, 0.003, 0.000])
prior = np.array([0.003, 0.021, 0.088, 0.231, 0.315, 0.231, 0.088, 0.021, 0.003, 0.000]) 

# --- 2. Define the Likelihood (P(Z_t | X_t)) ---
# This is the probability of the measurement Z_t, given the robot is at cell i.
# The measurement indicates a value centered around cell 5.
p_measurement = np.array([0., 0., 0., 0.05, 0.18, 0.54, 0.18, 0.05, 0., 0.])

# --- 3. Compute the Posterior (P(X_t | Z_t)) ---
# STEP 1: Numerator (un-normalized posterior) = Likelihood * Prior
unnormalized_posterior = p_measurement * prior

# STEP 2: Normalizer P(Z_t) = sum(unnormalized_posterior)
normalizer = unnormalized_posterior.sum()

# STEP 3: Normalize to get the Posterior
posterior = unnormalized_posterior / normalizer
# print("Normalizer P(Z_t):", normalizer)
# print("Posterior P(X_t | Z_t):\n", posterior)

# --- 4. Plotting the Results ---
plot_updates(prior, p_measurement, posterior)

Interpretation

The crucial effect is that the Posterior distribution is narrower than the Prior. The observation $Z_t$ essentially "pulls" the predicted belief towards the location indicated by the sensor (cell 5 in the likelihood) and reduces the overall uncertainty. This is the core function of the measurement update in any state estimation filter.

Literature and Links:¶

Fred Hamprecht: Lecture in Course Pattern Recognition (Video), Directed Graphical Models; State Space Models, Uni Heidelberg
C. Bishop: "Graphical Models" Talk at MLSS 2013 (1:38-)
Udacity MOOC: Thrun, Sebastian: Artifical Intelligence for Robotics Lesson 2 Kalman Filter
[Thu06] Thrun, Sebastian, Wolfram Burgard, and Dieter Fox: Probabilistic robotics. MIT press, 2006.
Roger Labbe: Kalman and Bayesian Filters in Python, Online Book as jupyter notebook files