Primer on Probabilities¶

Core Concepts in Probability¶

Outcome (Elementary Event)¶

An outcome is a single possible result of a random experiment. It is the most basic element in the sample space.

Experiment: Rolling a standard six-sided die.
Outcomes: The specific number that lands face-up.
Set of all Outcomes (Sample Space $\Omega$): $\{1, 2, 3, 4, 5, 6\}$

Event¶

An event is any subset of the sample space $\Omega$. It can be a single outcome or a collection of outcomes.

Examples for the experiment "rolling a standard six-sided die":

Example Event $A$ (Single Outcome): Rolling a 3. $\text{A} = \{3\}$.
Example Event $B$ (Collection of Outcomes): Rolling an even number. $\text{B} = \{2, 4, 6\}$.
Example Event $C$ (Collection of Outcomes): Rolling a number greater than 4. $\text{C} = \{5, 6\}$.

Random Variable ($X$)¶

A random variable is a function that assigns a numerical value to each outcome in the sample space. Random variables are usually denoted by capital letters (e.g., $X, Y, H_t$).

While the outcome is the physical result (e.g., the number 3), the random variable maps that result to a quantity of interest.

Example: Dice Roll Random Variables

Let's define two different random variables, $X$ and $Y$, based on the roll of a die:

Random Variable $X$: Even or Odd

Description: $X$ maps the outcome to the value $0$ (for even) or $1$ (for odd).
Outcomes $\rightarrow$ Value of $X$:
- Outcomes $\{1, 3, 5\} \rightarrow X = 1$ (Odd)
- Outcomes $\{2, 4, 6\} \rightarrow X = 0$ (Even)
Possible Values for $X$: $\text{Val}(X) = \{0, 1\}$.
Probability Distribution $P(X)$:
- $P(X=1) = P(\{1, 3, 5\}) = 3/6 = 0.5$
- $P(X=0) = P(\{2, 4, 6\}) = 3/6 = 0.5$

Random Variable $Y$: Smaller 4 or Greater 3

Description: $Y$ maps the outcome to the value $0$ (for $\le 3$) or $1$ (for $> 3$).
Outcomes $\rightarrow$ Value of $Y$:
- Outcomes $\{1, 2, 3\} \rightarrow Y = 0$ (Smaller or equal to 3)
- Outcomes $\{4, 5, 6\} \rightarrow Y = 1$ (Greater than 3)
Possible Values for $Y$: $\text{Val}(Y) = \{0, 1\}$.
Probability Distribution $P(Y)$:
- $P(Y=0) = P(\{1, 2, 3\}) = 3/6 = 0.5$
- $P(Y=1) = P(\{4, 5, 6\}) = 3/6 = 0.5$

The key takeaway is that the random variable provides a structured, numerical framework for analyzing the underlying random process, allowing us to define probabilities like $P(X=1)$ or $P(Y=0)$.

Discrete Probability Distributions¶

A discrete probability distribution defines the likelihood of all possible outcomes for a discrete random variable. It is essentially a complete model of our belief regarding the state of the system at a given time.

Detailed Definition¶

A discrete probability distribution is a function that maps every unique, countable outcome of a random variable, e.g. $H$, to its corresponding probability value.

Since the possible outcomes are finite or countably infinite (like the number of grid cells), the distribution can be represented explicitly:

List/Table: Tabulating each outcome and its probability (as shown below).
Histogram: A graphical representation where the height of each bar corresponds to the probability of that outcome.

The distribution $P(H)$ provides a full picture of uncertainty—it doesn't just give the single most likely state, but quantifies how likely every possible state is.

Concrete Example: Robot Probability on a 1D-Grid¶

Let's assume a small grid with $N=10$ cells (indices $0$ to $9$).

Random Variable: $H_t$ (The true location of the robot at time $t$).
Outcomes/Sample Space: $\text{Val}(H_t) = \{0, 1, 2, \dots, 9\}$.
Distribution: $P(H_t)$ is the probability distribution over these cells.

At a specific time, say $t=5$, our current knowledge (belief) about the robot's location is represented by the following distribution, $P(H_5)$:

Grid Cell Index $i$	Probability $P(H_5 = i)$
0, 1, 2	0.00
3	0.20
4	0.50
5	0.25
6	0.05
7, 8, 9	0.00

Interpretation: Quantifying Belief

The total probability sums to $1.00$ ($0.20 + 0.50 + 0.25 + 0.05 = 1.00$).
Highest Confidence (Mode): $P(H_5=4) = 0.50$. We are most confident (50%) that the robot is in cell 4. This location is the mode of the distribution.
Spreading Belief: The mass is spread across cells 3, 4, 5, and 6, quantifying the uncertainty in the robot's exact position. If the probability were concentrated entirely in one cell (e.g., $P(H_5=4)=1.0$), we would have perfect certainty.
Zero Belief: $P(H_5=7) = 0.00$. We are certain the robot is not in cell 7.

This distribution represents the entirety of our knowledge about the robot's state at time $t=5$ (before any further motion or measurements occur).

Key Properties of Probability Distributions

Non-negativity: For any cell $i$, $P(H_t=i) \ge 0$.
Normalization: The sum of all probabilities must equal one:

$$\sum_{i \in \text{Val}(H_t)} P(H_t=i) = 1$$

Note on the "Abuse of Notation" in Probability¶

In probability theory and filter design, we frequently use a shorthand known as "abuse of notation" to keep our equations readable. Strictly speaking, a function should be defined by its symbol (e.g., $f(x)$), but in this field, we identify the distribution by its argument.

$P(x)$ refers to the probability distribution of the Random Variable $X$.
$P(y)$ refers to the probability distribution of the Random Variable $Y$.

Why is this an "abuse"?¶

In standard calculus, if $f(x) = x^2$, then $f(y)$ must be $y^2$. However, in probability, $P(x)$ might be a Gaussian distribution representing a robot's position, while $P(y)$ could be a Bernoulli distribution representing a sensor's binary state. Even though we use the same letter $P$, the underlying mathematical functions are completely different. Typically the probability distribution is identified the argument, i.e.

$P(X)$ is the probability distribution of the Random Variable $X$
$P(Y)$ is the probability distribution of the Random Variable $Y$

Rule of Thumb: Whenever you see $P(\cdot)$ or $p(\cdot)$, look at the variable inside the parentheses. That variable tells you which physical "part" of the system the distribution belongs to

Joint Probability Distribution $P(X, Y)$¶

The joint probability distribution of two or more random variables, $X$ and $Y$, specifies the probability of every possible combination of their values occurring simultaneously.

If we know $P(X, Y)$, we know the chance of $X=x$ and $Y=y$ happening at the same time. This is written as $P(X=x, Y=y)$.

Example: Dice Roll Joint Probability

Let's use our two random variables from the previous example based on the roll of a standard six-sided die ($\Omega = \{1, 2, 3, 4, 5, 6\}$):

Random Variable $X$ (Even/Odd): $X=1$ (Odd: $\{1, 3, 5\}$), $X=0$ (Even: $\{2, 4, 6\}$)
Random Variable $Y$ (Small/Large): $Y=0$ ($\le 3$: $\{1, 2, 3\}$), $Y=1$ ($> 3$: $\{4, 5, 6\}$)

We can construct the joint distribution $P(X, Y)$ by counting the outcomes for each combination:

$X$ (Even/Odd)	$Y$ ($\le 3 / > 3$)	Outcomes	Count	Joint Probability $P(X=x, Y=y)$
$X=1$ (Odd)	$Y=0$ ($\le 3$)	$\{1, 3\}$	2	$2/6 \approx 0.333$
$X=1$ (Odd)	$Y=1$ ($> 3$)	$\{5\}$	1	$1/6 \approx 0.167$
$X=0$ (Even)	$Y=0$ ($\le 3$)	$\{2\}$	1	$1/6 \approx 0.167$
$X=0$ (Even)	$Y=1$ ($> 3$)	$\{4, 6\}$	2	$2/6 \approx 0.333$
TOTAL			6	1.000

Marginalization¶

A key property of the joint distribution is that you can calculate the individual (unconditional) probability distributions, $P(X)$ and $P(Y)$, from it. These are called marginal distributions because they are found by summing over the "margins" of the joint table.

Calculating $P(X)$ (Marginalizing Out $Y$)

To find the probability of $X=1$ (Odd), we sum the probabilities of all outcomes where $X=1$:

$$P(X=1) = \sum_{y \in \text{Val}(Y)} P(X=1, Y=y)$$

$$P(X=1) = P(X=1, Y=0) + P(X=1, Y=1) = \frac{2}{6} + \frac{1}{6} = \frac{3}{6} = 0.5$$

This matches our earlier finding that the probability of rolling an odd number is $0.5$.

Conditional Probability and Conditional Independence¶

Conditional Probability¶

The conditional probability of $X$ given $Y$ is defined as the joint probability of $X$ and $Y$ divided by the marginal probability of $Y$:

$$P(X \mid Y) = \frac{P(X, Y)}{P(Y)} $$

Where:

$P(X, Y)$ is the Joint Probability (e.g., $P(\text{Odd and } \le 3)$).
$P(Y)$ is the Marginal Probability (e.g., $P(\le 3)$).

Example Calculation:

Let's calculate the probability of the roll being Odd ($X=1$), given that the roll was smaller or equal to 3 ($Y=0$).

Find the Joint Probability $P(X=1, Y=0)$:
- Outcomes where $X=1$ (Odd) AND $Y=0$ ($\le 3$) are $\{1, 3\}$.
- From the table: $P(X=1, Y=0) = 2/6$.
Find the Marginal Probability $P(Y=0)$:
- Outcomes where $Y=0$ ($\le 3$) are $\{1, 2, 3\}$.
- $P(Y=0) = 3/6 = 0.5$. (This is the total probability for the condition being met).
Calculate the Conditional Probability $P(X=1 \mid Y=0)$:

$$P(X=1 \mid Y=0) = \frac{P(X=1, Y=0)}{P(Y=0)} = \frac{2/6}{3/6} = \frac{2}{3} \approx 0.667 $$

Interpretation

Initial Belief (Unconditional $P(X=1)$): Before knowing anything about $Y$, the probability of rolling an odd number was $P(X=1) = 3/6 = 0.5$.
Updated Belief (Conditional $P(X=1 \mid Y=0)$): Once we are given the information that the roll was $\le 3$ ($Y=0$), our focus narrows to the set $\{1, 2, 3\}$. Within this new, restricted sample space, two out of the three outcomes are odd, so the probability increases to $2/3 \approx 0.667$.

Note: This process of using observed information ($Y=0$) to update the probability of an unknown state ($X=1$) is the fundamental idea behind the Measurement Update step (Bayes' Rule) in localization and filtering.

Conditional Independence¶

Two random variables, $A$ and $B$, are conditionally independent given a third variable $C$ if knowing the value of $C$ makes $A$ and $B$ statistically unrelated.

In simpler terms, if $C$ is known, information about $B$ provides no additional information about $A$.

Formal Definition

$A$ and $B$ are conditionally independent given $C$ if the following holds:

$$P(A \mid B, C) = P(A \mid C)$$

This is equivalent to:

$$P(A, B \mid C) = P(A \mid C) P(B \mid C)$$

Notation for the conditionally independence: $$ A \perp B \mid C. $$

Application to Markov Chains (The Markov Assumption)

This concept is the bedrock of sequential estimation filters. In the context of your robot localization:

$A$ = The Future state ($H_t$)
$B$ = The Past states ($H_1, \dots, H_{t-2}$)
$C$ = The Present state ($H_{t-1}$)

The Markov Assumption states that the future location of the robot ($H_t$) is conditionally independent of all history ($H_1, \dots, H_{t-2}$) given only the immediate previous location ($H_{t-1}$).$$P(H_t \mid H_1, \dots ,H_{t-1}) = P(H_t \mid H_{t-1})$$

Why it's Critical for Filtering

If we had to track the entire history of the robot's movement ($H_1, H_2, \dots$), the calculations would quickly become unmanageable (the complexity would grow exponentially with time).

By assuming conditional independence, the system gains the memoryless property: we only need to store and update the belief from the last time step, $P(H_{t-1})$, to predict the next state, $P(H_t)$. This dramatically simplifies the prediction step and makes recursive filters, like the Kalman Filter, computationally practical.

Directed Graphical Models (Bayesian Networks)¶

A Directed Graphical Model (or Bayesian Network) uses nodes to represent random variables and directed edges (arrows) to represent probabilistic dependencies (causation or correlation). The structure of the graph is defined such that it satisfies the Markov Property: given its parents, a node is independent of all its non-descendants.

Encoding Conditional Independence

The core rule for reading independence from a directed graph is based on tracing paths between nodes and observing the state of conditioning nodes (nodes whose values are known).

For the Markov Chain:

The nodes are $H_{t-2}$, $H_{t-1}$, and $H_t$.
The edges are $H_{t-2} \to H_{t-1}$ and $H_{t-1} \to H_t$.

The structure $H_{t-2} \to H_{t-1} \to H_t$ forms a serial connection (or chain). In a serial connection, the start node ($H_{t-2}$) and the end node ($H_t$) are independent once the middle node ($H_{t-1}$) is observed (or known).

Rule: The conditioning node ($H_{t-1}$) blocks the flow of influence between $H_{t-2}$ and $H_t$.
Result: This perfectly encodes the Markov Assumption: $H_{t-2} \perp H_t \mid H_{t-1}$.

This graphical structure directly shows why the future state $H_t$ is conditionally independent of the past state $H_{t-2}$ given the present state $H_{t-1}$.

The Chain Rule of Probability¶

General Definition

For two random variables, $A$ and $B$, the joint probability $P(A, B)$ can be written in two symmetric ways, derived directly from the definition of conditional probability $P(A \mid B) = P(A, B) / P(B)$: $$P(A, B) = P(A) \cdot P(B \mid A)$$ or $$P(A, B) = P(B) \cdot P(A \mid B)$$

For a Sequence of Variables

When dealing with a sequence of $T$ random variables—like the robot's location over time, $H_1, H_2, \dots, H_T$—the Chain Rule can be expanded to express the full joint probability:$$P(H_1, H_2, \dots ,H_T) = P(H_1) \cdot P(H_2 \mid H_1) \cdot P(H_3 \mid H_1, H_2) \cdots P(H_T \mid H_1, \dots ,H_{T-1})$$

Application to Robot Localization

This general formula applies to any sequence of random variables. However, when we apply the Markov Assumption (Conditional Independence) to the Chain Rule, the formula simplifies dramatically.

Recall the Markov Assumption: $P(H_t \mid H_1, \dots, H_{t-1}) = P(H_t \mid H_{t-1})$.

Applying this assumption to the expanded Chain Rule above simplifies the terms:

$$P(H_1, H_2, \dots ,H_T) = P(H_1) \cdot P(H_2 \mid H_1) \cdot P(H_3 \mid H_2) \cdots P(H_T \mid H_{T-1})$$

This can be written compactly using product notation:$$P(H_1, \dots ,H_T) = P(H_1) \prod_{t=2}^{T} P(H_t \mid H_{t-1})$$

Conclusion for Filtering

This simplified form, derived from the Chain Rule and the Markov Assumption, is what makes recursive state estimation possible. It means that the long-term dependency of the system's state is encapsulated entirely by a sequence of short-term, one-step transitions $P(H_t \mid H_{t-1})$. This directly translates to your Prediction Step (convolution), where you only need the previous belief to compute the next.

Bayes' Rule¶

Bayes' Rule is the mathematical foundation for updating a belief (prior) with new evidence (measurement) to achieve a refined belief (posterior). It provides a way to calculate $P(A \mid B)$ when we only know the reverse conditional, $P(B \mid A)$.

The Rule

For two events (or random variables) $A$ and $B$:

$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$

Components Explained:

Term	Name	Meaning
$P(A \mid B)$	Posterior Probability	Our goal. The probability of the hypothesis $A$ being true, after observing the evidence $B$.
$P(B \mid A)$	Likelihood	The probability of observing the evidence $B$, if the hypothesis $A$ were true. This is the Measurement Model in filtering.
$P(A)$	Prior Probability	Our existing belief in the hypothesis $A$ before observing the evidence $B$. This comes from the Prediction Step.
$P(B)$	Evidence / Normalizer	The total probability of observing the evidence $B$ across all possible hypotheses $A$. It ensures the final posterior distribution sums to one.

The Core Idea

Bayes' Rule provides the link e.g. between the cause ($A$) and the effect ($B$): if we know how likely an effect is given a cause ($P(B \mid A)$), the rule lets us reverse the logic to find out how likely the cause is, given we saw the effect ($P(A \mid B)$).

In filtering, $H_t$ (Hypothesis $A$) is the robot's true location, and $Z_t$ (Evidence $B$) is the sensor reading. We combine the motion-based prediction ($P(H_t)$) with the sensor's known accuracy ($P(Z_t \mid H_t)$) to determine the best possible updated location estimate ($P(H_t \mid Z_t)$).