Bayesian Parameter Estimation (BPE) is fundamentally different compared to MLE or MAP. Whereas the latter two solve for an optimal set of parameters for the model, BPE treats as a random variable with a distribution .
Setup
We are given a dataset , which contains i.i.d. features . Given a new feature vector , we want to classify it to some class . One way to do this is by the Bayes’ decision rule. That is, we choose class over class if
where only contains features belonging to class , and vice versa. We can’t directly solve this without assuming further structure to the underlying distribution.
So, let’s assume that the distribution is fully described by a model parameterized only by , a random variable. This distribution tells us how likely we are to find if it were in class . From now, I omit the subscript on for brevity. Then we observe that
This is much more manageable. We can compute by plugging into our assumed model. can also be computed since
To summarize, we have devised a method that gives us a likelihood for , averaged over all possible parameters , weighted by the prior and likelihood of given the class conditional data .
Gaussian Case
In the case that our model is a Gaussian, with a mean with distribution and a known covariance , BPE is quite easy to compute. In this case, our parameter set consists of just .
We assume the following:
-
. That is, our model is valid for each class.
-
. Here, are our “best guess” for the shape of each class conditional distribution, before seeing the data.
Keeping in mind that our goal is to compute , we first need to find . From Bayes’ theorem:
Plugging the Gaussian formulas:
where
Derivation
We notice that the exponent is quadratic in . This means must also be a Gaussian! Let’s put it in standard form. We handle the first and second terms in the exponent separately. First term:
Second term:
Grouping them back together:
which simplifies to
where
which can be found by equating like terms.
Therefore, .
To complete the exercise, we need to find . Since , we can express . It is evident that . Then .
So it turns out with this method that we don’t need to evaluate an integral at all!
In Summary
- , where are “guessed”
- , where are the class conditional statistics computed from
- . This function is used for the Bayes Decision Rule