• Praktische Grundlagen der Informatik
  • Algorithmen und Datenstrukturen
  • Foundations for Robotic and AI
  • Übersicht Data Science:
  • Data Science / Machine Learning
  • Projekt: Deep Teaching
  • Alte Veranstaltungen:
  • Grundlagen der Informatik (NF)
  • Octave/Matlab Tutorial
  • Game Physik
  • Home
  • Lehre
  • Publikationen
  • Kontakt/Impressum

Primer on Probabilities¶

Core Concepts in Probability¶

Outcome (Elementary Event)¶

An outcome is a single possible result of a random experiment. It is the most basic element in the sample space.

  • Experiment: Rolling a standard six-sided die.
  • Outcomes: The specific number that lands face-up.
  • Set of all Outcomes (Sample Space $\Omega$): $\{1, 2, 3, 4, 5, 6\}$

Event¶

An event is any subset of the sample space $\Omega$. It can be a single outcome or a collection of outcomes.

Examples for the experiment "rolling a standard six-sided die":

  • Example Event $A$ (Single Outcome): Rolling a 3. $\text{A} = \{3\}$.
  • Example Event $B$ (Collection of Outcomes): Rolling an even number. $\text{B} = \{2, 4, 6\}$.
  • Example Event $C$ (Collection of Outcomes): Rolling a number greater than 4. $\text{C} = \{5, 6\}$.

Random Variable ($X$)¶

A random variable is a function that assigns a numerical value to each outcome in the sample space. Random variables are usually denoted by capital letters (e.g., $X, Y, H_t$).

While the outcome is the physical result (e.g., the number 3), the random variable maps that result to a quantity of interest.

Example: Dice Roll Random Variables

Let's define two different random variables, $X$ and $Y$, based on the roll of a die:

Random Variable $X$: Even or Odd

  • Description: $X$ maps the outcome to the value $0$ (for even) or $1$ (for odd).
  • Outcomes $\rightarrow$ Value of $X$:
    • Outcomes $\{1, 3, 5\} \rightarrow X = 1$ (Odd)
    • Outcomes $\{2, 4, 6\} \rightarrow X = 0$ (Even)
  • Possible Values for $X$: $\text{Val}(X) = \{0, 1\}$.
  • Probability Distribution $P(X)$:
    • $P(X=1) = P(\{1, 3, 5\}) = 3/6 = 0.5$
    • $P(X=0) = P(\{2, 4, 6\}) = 3/6 = 0.5$

Random Variable $Y$: Smaller 4 or Greater 3

  • Description: $Y$ maps the outcome to the value $0$ (for $\le 3$) or $1$ (for $> 3$).
  • Outcomes $\rightarrow$ Value of $Y$:
    • Outcomes $\{1, 2, 3\} \rightarrow Y = 0$ (Smaller or equal to 3)
    • Outcomes $\{4, 5, 6\} \rightarrow Y = 1$ (Greater than 3)
  • Possible Values for $Y$: $\text{Val}(Y) = \{0, 1\}$.
  • Probability Distribution $P(Y)$:
    • $P(Y=0) = P(\{1, 2, 3\}) = 3/6 = 0.5$
    • $P(Y=1) = P(\{4, 5, 6\}) = 3/6 = 0.5$

The key takeaway is that the random variable provides a structured, numerical framework for analyzing the underlying random process, allowing us to define probabilities like $P(X=1)$ or $P(Y=0)$.

Discrete Probability Distributions¶

A discrete probability distribution defines the likelihood of all possible outcomes for a discrete random variable. It is essentially a complete model of our belief regarding the state of the system at a given time.

Detailed Definition¶

A discrete probability distribution is a function that maps every unique, countable outcome of a random variable, e.g. $X$, to its corresponding probability value.

Since the possible outcomes are finite or countably infinite (like the number of grid cells), the distribution can be represented explicitly:

  • List/Table: Tabulating each outcome and its probability (as shown below).
  • Histogram: A graphical representation where the height of each bar corresponds to the probability of that outcome.

The distribution $P(X)$ provides a full picture of uncertainty—it doesn't just give the single most likely state, but quantifies how likely every possible state is.

Concrete Example: Robot Probability on a 1D-Grid¶

Let's assume a small grid with $N=10$ cells (indices $0$ to $9$).

  • Random Variable: $H_t$ (The true location of the robot at time $t$).
  • Outcomes/Sample Space: $\text{Val}(H_t) = \{0, 1, 2, \dots, 9\}$ is the space of all possible values of $H_t$
  • Distribution: $P(H_t)$ is the probability distribution over these cells.

At a specific time, say $t=5$, our current knowledge (belief) about the robot's location is represented by the following distribution, $P(H_5)$:

Grid Cell Index $i$ Probability $P(H_5 = i)$
0, 1, 2 0.00
3 0.20
4 0.50
5 0.25
6 0.05
7, 8, 9 0.00

Interpretation: Quantifying Belief

  • The total probability sums to $1.00$ ($0.20 + 0.50 + 0.25 + 0.05 = 1.00$).
  • Highest Confidence (Mode): $P(H_5=4) = 0.50$. We are most confident (50%) that the robot is in cell 4. This location is the mode of the distribution.
  • Spreading Belief: The mass is spread across cells 3, 4, 5, and 6, quantifying the uncertainty in the robot's exact position. If the probability were concentrated entirely in one cell (e.g., $P(H_5=4)=1.0$), we would have perfect certainty.
  • Zero Belief: $P(H_5=7) = 0.00$. We are certain the robot is not in cell 7.

This distribution represents the entirety of our knowledge about the robot's state at time $t=5$ (before any further motion or measurements occur).

Key Properties of Probability Distributions

  • Non-negativity: For any cell $i$, $P(H_t=h) \ge 0$.

  • Normalization: The sum of all probabilities must equal one:

    $$\sum_{h \in \text{Val}(H_t)} P(H_t=h) = 1$$

Note on Notation¶

  • $P(X)$ refers to the entire probability distribution. It isn't a single number; it is the set of all possible outcomes and their associated probabilities.
  • $P(x)$—often written more clearly as $P(X=x)$—is the probability that the random variable $X$ happens to result in the specific value $x$. This is a single number between 0 and 1.

Note on the "Abuse of Notation" in Probability¶

In probability theory and filter design, we frequently use a shorthand known as "abuse of notation" to keep our equations readable.

In formal mathematics, a function should be defined by its name or symbol, e.g., for $f(x)$, the symbol is $f$, while the argument $x$ is just a placeholder. However, in this field, we identify the specific distribution by its argument:

  • $P(X)$ refers to the probability distribution of the Random Variable $X$.
  • $P(Y)$ refers to the probability distribution of the Random Variable $Y$.
  • $P(H_t)$ refers to the probability distribution of the Random Variable $H_t$.

Why is this an "abuse"?¶

In standard calculus, if $f(x) = x^2$, then $f(y)$ must be $y^2$. However, in probability, $P(H_t)$ might be a distribution representing a robot's position at time $t$, while $P(Y)$ could be a Bernoulli distribution representing a sensor's binary state. Even though we use the same letter $P$, the underlying mathematical functions are completely different. Analog for $P(x)$ vs. $P(h_t)$.

Rule of Thumb: Whenever you see $P(\cdot)$ (or $p(\cdot)$), look at the variable inside the parentheses. That variable tells you which physical "part" of the system the distribution belongs to.

Joint Probability Distribution $P(X, Y)$¶

The joint probability distribution of two or more random variables, $X$ and $Y$, specifies the probability of every possible combination of their values occurring simultaneously.

If we know $P(X, Y)$, we know the chance of $X=x$ and $Y=y$ happening at the same time. This is written as $P(X=x, Y=y)$.

Example: Dice Roll Joint Probability

Let's use our two random variables from the previous example based on the roll of a standard six-sided die ($\Omega = \{1, 2, 3, 4, 5, 6\}$):

  • Random Variable $X$ (Even/Odd): $X=1$ (Odd: $\{1, 3, 5\}$), $X=0$ (Even: $\{2, 4, 6\}$)
  • Random Variable $Y$ (Small/Large): $Y=0$ ($\le 3$: $\{1, 2, 3\}$), $Y=1$ ($> 3$: $\{4, 5, 6\}$)

We can construct the joint distribution $P(X, Y)$ by counting the outcomes for each combination:

$X$ (Even/Odd) $Y$ ($\le 3 / > 3$) Outcomes Count Joint Probability $P(X=x, Y=y)$
$X=1$ (Odd) $Y=0$ ($\le 3$) $\{1, 3\}$ 2 $2/6 \approx 0.333$
$X=1$ (Odd) $Y=1$ ($> 3$) $\{5\}$ 1 $1/6 \approx 0.167$
$X=0$ (Even) $Y=0$ ($\le 3$) $\{2\}$ 1 $1/6 \approx 0.167$
$X=0$ (Even) $Y=1$ ($> 3$) $\{4, 6\}$ 2 $2/6 \approx 0.333$
TOTAL 6 1.000

Independence¶

Two random variables, $A$ and $B$, are independent if the occurrence of one does not change the probability distribution of the other.

The Mathematical Definition

In formal notation, $A$ and $B$ are independent if and only if their joint probability is equal to the product of their individual probabilities:

$$P(A, B) = P(A)P(B)$$

Think of this as "Information Silence." If you know the value of $A$, you gain zero new information about $B$.

Notation

If two random variables are independent, we use the notation:

$$A \perp B$$

The symbol $\perp$ signifies that $A$ and $B$ are "perpendicular" or orthogonal in an informational sense—they do not overlap or influence one another.

Note: Testing for Independence (Dice Example)¶

We can use the Joint Probability table from the previous section to check if $X$ (Even/Odd) and $Y$ (Small/Large) are independent.

If they were independent, the following would hold true for every row:

$$P(x, y) = P(x)P(y)$$

Let's check the first row ($X=1, Y=0$):

  1. Joint Probability $P(x=1, y=0)$: From the table, this is $0.333$.
  2. Individual Probability $P(x=1)$: There are 3 odd numbers $\{1, 3, 5\}$, so $P(x=1) = 3/6 = \mathbf{0.5}$.
  3. Individual Probability $P(y=0)$: There are 3 small numbers $\{1, 2, 3\}$, so $P(y=0) = 3/6 = \mathbf{0.5}$.

The Test: $$0.5 \times 0.5 = 0.25$$

Conclusion: Since $0.25 \neq 0.333$, these variables are not independent.

Why does this make sense?

In this specific die roll, knowing the number is "Small" ($Y=0$) increases the likelihood that it is "Odd" ($X=1$) because the set $\{1, 2, 3\}$ is 66% odd, whereas the full die is only 50% odd. Because one variable "hints" at the other, they are mathematically dependent.

Marginalization¶

A key property of the joint distribution is that you can calculate the individual (unconditional) probability distributions, $P(X)$ and $P(Y)$, from it. These are called marginal distributions because they are found by summing over the "margins" of the joint table.

Calculating $P(X)$ (Marginalizing Out $Y$)

To find the probability of $X=1$ (Odd), we sum the probabilities of all outcomes where $X=1$:

$$P(X=1) = \sum_{y \in \text{Val}(Y)} P(X=1, Y=y)$$

$$P(X=1) = P(X=1, Y=0) + P(X=1, Y=1) = \frac{2}{6} + \frac{1}{6} = \frac{3}{6} = 0.5$$

This matches our earlier finding that the probability of rolling an odd number is $0.5$.

Conditional Probability and Conditional Independence¶

Conditional Probability¶

The conditional probability of $X$ given $Y$ is defined as the joint probability of $X$ and $Y$ divided by the marginal probability of $Y$:

$$P(X \mid Y) = \frac{P(X, Y)}{P(Y)} $$

Where:

  • $P(X, Y)$ is the Joint Probability (e.g., $P(\text{Odd and } \le 3)$).
  • $P(Y)$ is the Marginal Probability (e.g., $P(\le 3)$).

Example Calculation:

Let's calculate the probability of the roll being Odd ($X=1$), given that the roll was smaller or equal to 3 ($Y=0$).

  1. Find the Joint Probability $P(X=1, Y=0)$:
    • Outcomes where $X=1$ (Odd) AND $Y=0$ ($\le 3$) are $\{1, 3\}$.
    • From the table: $P(X=1, Y=0) = 2/6$.
  2. Find the Marginal Probability $P(Y=0)$:
    • Outcomes where $Y=0$ ($\le 3$) are $\{1, 2, 3\}$.
    • $P(Y=0) = 3/6 = 0.5$. (This is the total probability for the condition being met).
  3. Calculate the Conditional Probability $P(X=1 \mid Y=0)$:

$$P(X=1 \mid Y=0) = \frac{P(X=1, Y=0)}{P(Y=0)} = \frac{2/6}{3/6} = \frac{2}{3} \approx 0.667 $$

Interpretation

  • Initial Belief (Unconditional $P(X=1)$): Before knowing anything about $Y$, the probability of rolling an odd number was $P(X=1) = 3/6 = 0.5$.
  • Updated Belief (Conditional $P(X=1 \mid Y=0)$): Once we are given the information that the roll was $\le 3$ ($Y=0$), our focus narrows to the set $\{1, 2, 3\}$. Within this new, restricted sample space, two out of the three outcomes are odd, so the probability increases to $2/3 \approx 0.667$.

Note: This process of using observed information ($Y=0$) to update the probability of an unknown state ($X=1$) is the fundamental idea behind the Measurement Update step (Bayes' Rule) in localization and filtering.

The Alternative Definition of Independence¶

A more intuitive way to think about independence is through the lens of conditional probability.

If $X \perp Y$, then knowing the value of $Y$ does not change our belief about $X$. Mathematically, this is written as: $$P(X \mid Y) = P(X)$$

Why this is equivalent:

If you take the standard independence formula $P(X, Y) = P(X)P(Y)$ and plug it into the conditional probability formula:

$$P(X \mid Y) = \frac{P(X, Y)}{P(Y)} = \frac{P(X)P(Y)}{P(Y)} = P(X)$$

Conditional Independence¶

Two random variables, $A$ and $B$, are conditionally independent given a third variable $C$ if knowing the value of $C$ makes $A$ and $B$ statistically unrelated.

In simpler terms, if $C$ is known, information about $B$ provides no additional information about $A$.

Formal Definition

$A$ and $B$ are conditionally independent given $C$ if the following holds:

$$P(A \mid B, C) = P(A \mid C)$$

This is equivalent to:

$$P(A, B \mid C) = P(A \mid C) P(B \mid C)$$

Notation for the conditionally independence: $$ A \perp B \mid C. $$

Application to Markov Chains (The Markov Assumption)

This concept is the bedrock of sequential estimation filters. In the context of your robot localization:

  • $A$ = The Future state ($H_t$)
  • $B$ = The Past states ($H_1, \dots, H_{t-2}$)
  • $C$ = The Present state ($H_{t-1}$)

The Markov Assumption states that the future location of the robot ($H_t$) is conditionally independent of all history ($H_1, \dots, H_{t-2}$) given only the immediate previous location ($H_{t-1}$).$$P(H_t \mid H_1, \dots ,H_{t-1}) = P(H_t \mid H_{t-1})$$

Why it's Critical for Filtering

If we had to track the entire history of the robot's movement ($H_1, H_2, \dots$), the calculations would quickly become unmanageable (the complexity would grow exponentially with time).

By assuming conditional independence, the system gains the memoryless property: we only need to store and update the belief from the last time step, $P(H_{t-1})$, to predict the next state, $P(H_t)$. This dramatically simplifies the prediction step and makes recursive filters, like the Kalman Filter, computationally practical.

Directed Graphical Models (Bayesian Networks)¶

A Directed Graphical Model (or Bayesian Network) uses nodes to represent random variables and directed edges (arrows) to represent probabilistic dependencies (causation or correlation). The structure of the graph is defined such that it satisfies the Markov Property: given its parents, a node is independent of all its non-descendants.

Encoding Conditional Independence

The core rule for reading independence from a directed graph is based on tracing paths between nodes and observing the state of conditioning nodes (nodes whose values are known).

For the Markov Chain:

  • The nodes are $H_{t-2}$, $H_{t-1}$, and $H_t$.
  • The edges are $H_{t-2} \to H_{t-1}$ and $H_{t-1} \to H_t$.

The structure $H_{t-2} \to H_{t-1} \to H_t$ forms a serial connection (or chain). In a serial connection, the start node ($H_{t-2}$) and the end node ($H_t$) are independent once the middle node ($H_{t-1}$) is observed (or known).

  • Rule: The conditioning node ($H_{t-1}$) blocks the flow of influence between $H_{t-2}$ and $H_t$.
  • Result: This perfectly encodes the Markov Assumption: $H_{t-2} \perp H_t \mid H_{t-1}$.

This graphical structure directly shows why the future state $H_t$ is conditionally independent of the past state $H_{t-2}$ given the present state $H_{t-1}$.

The Chain Rule of Probability¶

General Definition

For two random variables, $A$ and $B$, the joint probability $P(A, B)$ can be written in two symmetric ways, derived directly from the definition of conditional probability $P(A \mid B) = P(A, B) / P(B)$: $$P(A, B) = P(A) \cdot P(B \mid A)$$ or $$P(A, B) = P(B) \cdot P(A \mid B)$$

For a Sequence of Variables

When dealing with a sequence of $T$ random variables—like the robot's location over time, $H_1, H_2, \dots, H_T$—the Chain Rule can be expanded to express the full joint probability:$$P(H_1, H_2, \dots ,H_T) = P(H_1) \cdot P(H_2 \mid H_1) \cdot P(H_3 \mid H_1, H_2) \cdots P(H_T \mid H_1, \dots ,H_{T-1})$$

Application to Robot Localization

This general formula applies to any sequence of random variables. However, when we apply the Markov Assumption (Conditional Independence) to the Chain Rule, the formula simplifies dramatically.

Recall the Markov Assumption: $P(H_t \mid H_1, \dots, H_{t-1}) = P(H_t \mid H_{t-1})$.

Applying this assumption to the expanded Chain Rule above simplifies the terms:

$$P(H_1, H_2, \dots ,H_T) = P(H_1) \cdot P(H_2 \mid H_1) \cdot P(H_3 \mid H_2) \cdots P(H_T \mid H_{T-1})$$

This can be written compactly using product notation:$$P(H_1, \dots ,H_T) = P(H_1) \prod_{t=2}^{T} P(H_t \mid H_{t-1})$$

Conclusion for Filtering

This simplified form, derived from the Chain Rule and the Markov Assumption, is what makes recursive state estimation possible. It means that the long-term dependency of the system's state is encapsulated entirely by a sequence of short-term, one-step transitions $P(H_t \mid H_{t-1})$. This directly translates to your Prediction Step (convolution), where you only need the previous belief to compute the next.

Bayes' Rule¶

Bayes' Rule is the mathematical foundation for updating a belief (prior) with new evidence (measurement) to achieve a refined belief (posterior). It provides a way to calculate $P(A \mid B)$ when we only know the reverse conditional, $P(B \mid A)$.

The Rule

For two events (or random variables) $A$ and $B$:

$$P(A \mid B)=\frac{P(B \mid A) \cdot P(A)}{P(B)}$$

Components Explained:

Term Name Meaning
$P(A \mid B)$ Posterior Probability Our goal. The probability of the hypothesis $A$ being true, after observing the evidence $B$.
$P(B \mid A)$ Likelihood The probability of observing the evidence $B$, if the hypothesis $A$ were true. This is the Measurement Model in filtering.
$P(A)$ Prior Probability Our existing belief in the hypothesis $A$ before observing the evidence $B$. This comes from the Prediction Step.
$P(B)$ Evidence / Normalizer The total probability of observing the evidence $B$ across all possible hypotheses $A$. It ensures the final posterior distribution sums to one.

The Core Idea: Reversing Logic

Bayes' Rule allows us to "invert" a model. It is often easy to know the Likelihood—e.g., "If the robot is at the door, how likely is the sensor to see a wall?"—but we actually need the Posterior: "If the sensor sees a wall, how likely is it that the robot is at the door?"

Calculating the "Evidence

Often the denominator is calculated by: $$P(B) = \sum_{a \in Val{(A)}}P(B \mid A=a) \cdot P(A=a)$$

Kontakt/Impressum