Maximum Likelihood Estimation

Goal

We are given a dataset , which contains feature vectors and class labels . Denote as the set of features of class . We assume the following:

  1. That . That is, given a class label, the distribution of features belonging to that class forms a Gaussian with mean and covariance .
  2. The samples are independent and identically distributed (i.i.d.) according to this assumed Gaussian distribution.

The problem that MLE seeks to solve is to find the most likely set of parameters , given the data. We denote

which includes the means and covariances for every class. The likelihood of is

and the MLE of , , is

In practice, we use the log-likelihood for simpler computation:

since maximizing the log-likelihood is equivalent to maximizing the likelihood. In words, the likelihood tells us the probability of generating our dataset if each datapoint was drawn independently from the distribution defined by . The that maximizes this probability defines the actual distribution from which was drawn.

We can attempt to find by setting the gradient of to and verifying the solution is a maximum. However, this doesn’t guarantee a global maximum.

Example: Unknown $\boldsymbol{\mu}$

Let’s assume that each element in our dataset is drawn from a multivariate Gaussian with known covariance but unknown mean . What is the MLE of ?

To find the MLE of , we maximize the likelihood function. For a multivariate Gaussian distribution:

where is the dimension of .

Since we assumed that samples are independent, the likelihood of the dataset is the product of the likelihoods of each . This becomes a sum in log-space:

Taking the gradient and setting it to zero:

Derivation of gradient

Consider the quadratic form, where , :

Computing the gradient:

Where the first term comes from and the second from . We notice that:

so,

In our case, we are differentiating with respect to , which brings a negative sign when substituting. Using the fact that is symmetric (as it is a covariance matrix) and the above result:

Multiplying by on both sides:

which implies:

which is the sample mean! This result makes the most sense.