Goal
We are given a dataset , which contains feature vectors and class labels . Denote as the set of features of class . We assume the following:
- That . That is, given a class label, the distribution of features belonging to that class forms a Gaussian with mean and covariance .
- The samples are independent and identically distributed (i.i.d.) according to this assumed Gaussian distribution.
The problem that MLE seeks to solve is to find the most likely set of parameters , given the data. We denote
which includes the means and covariances for every class. The likelihood of is
and the MLE of , , is
In practice, we use the log-likelihood for simpler computation:
since maximizing the log-likelihood is equivalent to maximizing the likelihood. In words, the likelihood tells us the probability of generating our dataset if each datapoint was drawn independently from the distribution defined by . The that maximizes this probability defines the actual distribution from which was drawn.
We can attempt to find by setting the gradient of to and verifying the solution is a maximum. However, this doesn’t guarantee a global maximum.
Example: Unknown $\boldsymbol{\mu}$
Let’s assume that each element in our dataset is drawn from a multivariate Gaussian with known covariance but unknown mean . What is the MLE of ?
To find the MLE of , we maximize the likelihood function. For a multivariate Gaussian distribution:
where is the dimension of .
Since we assumed that samples are independent, the likelihood of the dataset is the product of the likelihoods of each . This becomes a sum in log-space:
Taking the gradient and setting it to zero:
Derivation of gradient
Consider the quadratic form, where , :
Computing the gradient:
Where the first term comes from and the second from . We notice that:
so,
In our case, we are differentiating with respect to , which brings a negative sign when substituting. Using the fact that is symmetric (as it is a covariance matrix) and the above result:
Multiplying by on both sides:
which implies:
which is the sample mean! This result makes the most sense.