maximum-likelihood-learning slides

Parameter Learning by Maximum Likelihood

Examples:

  • $n$-iid Bernoulli trails - Binomial distribution
  • Simple Bayesian Network with two binary random variables

Bayes Rule for Data and Model Parameters

  • $\mathcal D$: Data
  • $\mathcal \theta$: Parameter of the model
$$ p(\theta | \mathcal D) = \frac{p(\mathcal D | \theta) p(\theta)}{ p(\mathcal D)} $$
  • numerator:
    • first term: "likelihood"
    • second term: "prior"
  • denominator: "marginal likelihood" (or "evidence") $$ posterior = \frac{likelihood \cdot prior}{evidence} $$

Likelihood

Likelihood function is considered as a function of $\theta$:

$$ \mathcal L (\theta) = p(\mathcal D | \theta) $$

The likelihood function is not a probability function!

"Never say 'the likelihood of the data'. Alway say 'the likelihood of the parameters'. The likelihood function is not a probability distribution." (from D. MacKay: Information Theory, Inference and Learning Algorithms, Page 29, Cambride 2003, http://www.inference.phy.cam.ac.uk/itila/book.html)

Maximum Likelihood

$$ \arg\max_\theta \mathcal L(\theta) $$

often the negativ log-likelihood is minimized:

$$ \arg\max_\theta \mathcal L(\theta) = \arg\min_\theta \left(- \mathcal \log \left( \mathcal L(\theta) \right) \right) $$

Example: Binomial Distribution (e.g. tossing a thumtack)

$$ \arg\min_\theta \left( - \mathcal \log L(\theta) \right)= \arg\min_\theta - \log \left( {n \choose k} \theta^k (1-\theta)^{n-k}\right) $$

necessary condition for a minimum:

$$ 0 = \frac{d}{d\theta} \left( \theta^k (1-\theta)^{n-k}\right) = k \theta^{k-1} (1-\theta)^{n-k} - (n-k) \theta^{k} (1-\theta)^{n-k-1} $$$$ k(1-\theta) = (n-k) \theta $$$$ \theta_{ML} = \frac{k}{n} $$

Example: Thumbtack using maximum likelihood estimation.

In [2]:
from IPython.display import Image
Image(filename='./thumbtack.jpg', width=200, height=200) 
Out[2]:
In [3]:
# number of iid flips
n = 14
from scipy.stats import bernoulli
bernoulli.rvs(0.3,size=n)
Out[3]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0])

Sufficient statistics for $\mathcal D$ is $k$ (number of positive outcomes) and $n$ (number of total outcomes).

In [4]:
def maximum_likelihood_estimate(theta, n_total):
    r = bernoulli.rvs(theta, size=n_total)
    mle=[np.sum(r[:i])/float(i) for i in range(1, len(r)+1)]
    return r, mle

theta = 0.3
    
r, mle = maximum_likelihood_estimate(theta, n_total=10)
print "\nMaximum likelihood estimate for the parameter theta up to the k-th trail:"
mle
Maximum likelihood estimate for the parameter theta up to the k-th trail:
Out[4]:
[0.0,
 0.0,
 0.0,
 0.25,
 0.40000000000000002,
 0.33333333333333331,
 0.42857142857142855,
 0.375,
 0.33333333333333331,
 0.29999999999999999]

Law of large numbers

In [6]:
plot_estimates(theta)

MLE for Bayesian Networks

Example Graph of two binary random variables $X$ and $Y$:

In [8]:
draw_XY(3.)

$\mathcal D$ consists of $n$ different observations:

  • freq(X=0, Y=0) = $f$
  • freq(X=0, Y=1) = $g$
  • freq(X=1, Y=0) = $h$
  • freq(X=1, Y=1) = $i$

with

  • total number of observations: $n = f + g + h + i$
  • number of $(X=1)$-observations: $k = h + i$
  • number of $(X=0)$-observations: $l = f+g = n-k$
$$ P(X, Y) = P(X) P(Y | X) $$

We have three free parameters for three Binomial distributions:

  • $\theta_X$ : Probability of $X=1$
  • $\theta_{0Y}$: Probability of $Y=1$ under the constraint $X=0$
  • $\theta_{1Y}$: Probability of $Y=1$ under the constraint $X=1$

Joinly written as parameter vector $\vec \theta = (\theta_X, \theta_{0Y},\theta_{1Y})$ or as a set $\theta = \{\theta_X, \theta_{0Y},\theta_{1Y}\}$

The likelihood of parameter vector $\vec \theta = (\theta_X, \theta_{0Y},\theta_{1Y})$ given $\mathcal D$ is:

$$ \mathcal L(\vec \theta) = p( \mathcal D| \vec \theta) = P(\mathcal D_X | \theta_X) P(\mathcal D_{Y|X} | \theta_{0Y}, \theta_{1Y}) $$

Under omitting the constant (Binomial coeffients - given data):

$$ \mathcal L(\vec \theta) \propto \left( \theta_X^{k} (1-\theta_X)^{n-k} \right) \left( \theta_{1Y}^{i} (1-\theta_{1Y})^{k-i} \right)^k \left( \theta_{0Y}^{g} (1-\theta_{0Y})^{l-g} \right)^{l} $$

The likelihood decomposes into a product of terms; that's is called Decomposability.

respectivly for the log-likelihood:

$$ \begin{align} \log \mathcal L(\vec \theta) \propto & { } \left( k \log \theta_X + (n-k) \log (1-\theta_X) \right) + \\ & { } k \left( i \log \theta_{1Y} + (k-i) \log (1-\theta_{1Y}) \right) + \\ & { } l \left( g \log \theta_{0Y} + (l-g) \log (1-\theta_{0Y}) \right) \end{align} $$

So we have three independent terms $$ \begin{align} \log \mathcal L(\vec \theta) \propto & { }\log \mathcal L(\theta_X) + \\ & { } \log \mathcal L(\theta_{1Y}) + \\ & { } \log \mathcal L(\theta_{0Y}) \end{align} $$

So searching for the argmax of $\log \mathcal L(\vec \theta)$ can be done by finding the argmax of each term independently. Also note that the leading $k$ and $l$ are just constants in the second resp. third term which can be neglected in the optimization.

Log-Likelihood of the first term:

$$ \log \mathcal L (\theta_{X}) \propto k \log \theta_{X} + (n-k) \log (1-\theta_{X}) $$

We already know how to compute (see above) the MLE (maximum likelihood estimator) of $\theta_{X}$:

$$ \theta_{X}^{ML} = \frac{k}{n} $$

(Local) Log-Likelihood (second term):

$$ \log \mathcal L(\theta_{1Y})\propto i \log \theta_{1Y} + (k-i) \log (1-\theta_{1Y}) $$

$ $

MLE (maximum likelihood estimator):

$$ \theta_{1Y}^{ML} = \frac{i}{k} $$

(Local) Log-Likelihood (third term):

$$ \log \mathcal L(\theta_{0Y}) \propto g \log \theta_{0Y} + (l-g) \log (1-\theta_{0Y}) $$

$ $

MLE (maximum likelihood estimator):

$$ \theta_{0Y}^{ML} = \frac{g}{l} $$

So for (local) Binomial (and analog for Multinomials) the MLEs can be obtained just by ratios of frequencies.

Global decomposition:

  • likelihood function decomposes as a product of independent terms for each CPD