I find entropy to be extremely fascinating. But, matching the formula to its “intuitive” explanations related to prefix free codes and information content is not obvious. Here, I want to go over a couple ways to independently arrive at the idea.
Properties of Information
Suppose we want to define a function , which represents the information content of an event. Abstracting away the specifics of the event, one measure we could use to compare one event to another is their probability of occurrence. So, could be a mapping from a probability to . Given this framing, the following requirements are sensible:
-
. If an event definitely occurs, it’s not very interesting and gives us little information.
-
should be continuous and monotonically decreasing on . A more common event is less informative.
-
Two independent events with probabilities and should have information
The last requirement is the most telling. By definition, the probability of two independent events occuring is . So
Since the function must be continuous, this only holds for
If we want to be monotonically decreasing,
must be negative. Since is positive, must be negative. Letting
Since , where the denominator is a constant, we can think of as encoding the base of the logarithm. For convenience, we let be 1, and let denote the base–2 logarithm.
Entropy is simply the expected value of , over a distribution
We also assume that , motivated by continuity.
For example, consider the Bernoulli random variable , which takes the value with probability , and with probability . If we plot its entropy
Plotting code
import numpy as np
import plotly.graph_objects as go
# Function to calculate the entropy of a Bernoulli variable
def bernoulli_entropy(p):
return -p * np.log2(p) - (1 - p) * np.log2(1 - p)
# Generate values for p from 0 to 1
p_values = np.linspace(0.01, 0.99, 100)
entropy_values = bernoulli_entropy(p_values)
# Create the plot
fig = go.Figure()
# Add the entropy trace
fig.add_trace(go.Scatter(x=p_values, y=entropy_values, mode='lines', name='Entropy', line=dict(color='red')))
# Update layout for dark mode
fig.update_layout(
title='Entropy of a Bernoulli Random Variable',
xaxis_title='p',
yaxis_title='Entropy',
template='plotly_dark'
)
# Save the plot to an HTML file
fig.write_html("bernoulli_entropy_plot.html")
we see that it is maximized when the distribution is uniform, and minimized when it is almost deterministic.
Prefix-free codes
Suppose we have a set of symbols that we want to transmit over a binary channel. We construct the channel such that we can send either a or a at a time. We want to find an optimal encoding scheme for , with one requirement: it is prefix-free.
Let’s define an encoding function , which maps symbols to binary strings of length . We say an encoding is prefix-free if no codeword is a prefix of another. For example, is not prefix free because is a prefix of . However, is.
A prefix free code implies that the code is uniquely decodable without additional delimiters in between symbols, which is a desirable property.
We also notice that a binary prefix code is uniquely defined by a binary tree:
where the root-to-symbol path determines the codeword, and symbols are always leaves. Convince yourself that any construction like this results in a prefix code.
We will now show that the expected codeword length of any prefix code over is bounded by
where a random variable that takes on values from the set with probabilities . Most importantly, we see that the entropy of is a lower bound for how much a distribution can be compressed, or equivalently, how much information it contains.
Kraft’s Inequality
Suppose that is the length of the th codeword. If the code is prefix-free:
Proof:
Let be the length of the longest codeword. We notice that:
- There are at most nodes at level
- For any codeword of length , there are descendents at level .
- The sets of descendents of each codeword are disjoint (since one codeword is never a descendent of another)
These imply
Why instead of equality? Because it is possible that a node at level is not a descendent of any codeword (consider the tree of the code )!
Lower bound for L
Now, let’s consider the expected codeword length
We will show that entropy is a lower bound for , or
Proof:
Where the final inequality is due to 1) KL divergence being non-negative and 2) to due to Kraft’s inequality. One thing to note is that if , , the theoretical minimum. The reason we cannot always achieve this is because need not be an integer, which is obviously required.
An Upper Bound for L
Notice that it is possible to construct a prefix code with lengths
since they satisfy Kraft’s Inequality:
By the definition of the ceiling function
Taking the expectation over , we get
This shows that is a good upper bound for L!
In summary, entropy is a lower bound, and a reasonable estimate, for the average number of bits it takes to encode a distribution as a prefix code.
References
- Alon Orlitsky’s ECE 255A Lectures, UCSD
- Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory. Wiley-Interscience.