Entropy

December 14, 2024

I find entropy to be extremely fascinating. But, matching the formula $\sum p_{i} lo g \frac{1}{p _{i}}$ to its “intuitive” explanations related to prefix free codes and information content is elusive, at least for me. Here, I want to go over a few ways to independently arrive at the idea.

Properties of Information

Suppose we want to define a function $I$ , which represents the information content of an event. Abstracting away the specifics of the event, one measure we could use to compare one event to another is their probabilities of occuring. So, $I$ could be a mapping from a probability $p \in [0, 1]$ to $R$ . Given this framing, the following requirements are sensible:

$I (1) = 0$ . If an event definitely occurs, it’s not very interesting and gives us little information.
$I$ should be monotonically decreasing on $[0, 1]$ . A more common event is less informative.
Two independent events with probabilities $p$ and $q$ should have information $I (p) + I (q)$

The last requirement is the most telling. The probability of two independent events occuring is $pq$ . So

I (pq) = I (p) + I (q)

which only holds for

I (p) = c lo g p .

If we want $I$ to be monotonically decreasing,

\frac{d I}{d p} = c / p

must be negative. Since $p > 0$ , $c < 0$ . Letting $c^{'} = ∣ c ∣$

I (p) = c^{'} lo g \frac{1}{p}

Since $lo g_{a} b = \frac{l o g b}{l o g a}$ , we can think of $c^{'}$ as encoding the base of the logarithm. For convenience, we let $c^{'}$ be 1.

Entropy is simply the expected value of $I$ , over a distribution $p = (p_{1}, p_{2}, \dots, p_{n})$

H (p) = i = 1 \sum n p_{i} lo g \frac{1}{p _{i}}

We also assume that $H (0) = 0 lo g \frac{1}{0} = 0$ , motivated by continuity.

Prefix-free codes

Suppose we have a set of symbols $X = {X_{1}, \dots, X_{M}}$ that we want to transmit over a binary channel. We construct the channel such that we can send either a $1$ or a $0$ at a time. We want to find an optimal encoding scheme for $X$ , with one requirement: it is prefix-free.

Let’s define an encoding function $f : X \to {0, 1}^{+}$ , which maps symbols to binary strings of length $\geq 1$ . We say an encoding is prefix-free if no codeword is a prefix of another. For example, ${0, 01}$ is not prefix free because $0$ is a prefix of $01$ . However, ${0, 10}$ is.

A prefix free code implies that the code is uniquely decodable without additional delimiters in between symbols, which is a desirable property.

Also note that a binary prefix code is uniquely defined by a binary tree:

Binary tree of prefix code — Source: https://leimao.github.io/blog/Huffman-Coding

where the root-to-symbol path determines the codeword, and symbols are always leaves. Convince yourself that any construction like this results in a prefix code.

We will now show that the expected codeword length $L$ of any prefix code over $X$ is bounded by

H (X) \leq L < H (X) + 1.

Kraft’s Inequality

Visualization of kraft inequality proof binary tree — Kraft’s Inequality Example Tree

Suppose that $l_{i}$ is the length of the $i$ th codeword. If the code is prefix-free:

i = 1 \sum M 2^{- l_{i}} \leq 1

Proof:

Let $l_{max}$ be the length of the longest codeword. We notice that:

There is at most $2^{l_{max}}$ nodes at level $l_{max}$
For any codeword of length $l_{i}$ , there are $2^{l_{max} - l_{i}}$ descendents at level $l_{max}$ .
The sets of descendents of each codeword are disjoint (since one codeword is never a descendent of another)

These imply

⟹ i \sum 2^{l_{max} - l_{i}} \leq 2^{l_{max}} i \sum 2^{- l_{i}} \leq 1.

Why $\leq$ instead of equality? Because it is possible that a node at level $l_{max}$ is not a descendent of any codeword (consider the tree of the code ${10, 11}$ )!

Lower bound for L

Now, let’s consider the expected codeword length

L = i \sum p_{i} l_{i}

We will show that entropy is a lower bound for $L$ , or

H (X) \leq L ⟺ L - H (X) \geq 0

Proof:

L - H (X) = i \sum p_{i} l_{i} + i \sum p_{i} lo g p_{i} = - i \sum p_{i} lo g 2^{- l_{i}} + i \sum p_{i} lo g p_{i} = - i \sum p_{i} lo g 2^{- l_{i}} + i \sum p_{i} lo g p_{i} + i \sum p_{i} lo g (j \sum 2^{- l_{j}}) - i \sum p_{i} lo g (j \sum 2^{- l_{j}}) = i \sum p_{i} lo g \frac{p _{i}}{2 ^{- l_{i}} / \sum _{j} 2 ^{- l_{j}}} - i \sum p_{i} lo g (j \sum 2^{- l_{j}}) = D_{K L} [p ∣∣ q] + lo g \frac{1}{c} \geq 0

Where the final inequality is due to 1) KL divergence being non-negative and 2) to $c \leq 1$ due to Kraft’s inequality. One thing to note is that if $l_{i} = - lo g p_{i}$ , $L = H (X)$ . The reason we cannot always achieve this is because $- lo g p_{i}$ may not be an integer.

An Upper Bound for L

Notice that it is possible to construct a prefix code with lengths

l_{i} = ⌈ - lo g p_{i} ⌉

since they satisfy Kraft’s Inequality:

i \sum 2^{- l_{i}} \leq i \sum 2^{l o g p_{i}} = 1

By the definition of the ceiling function

- lo g p_{i} \leq l_{i} < - lo g p_{i} + 1

Taking the expectation over $p$ , we get

- \sum p_{i} lo g p_{i} \leq \sum p_{i} l_{i} < - \sum p_{i} lo g p_{i} + 1 ⟹ H (X) \leq L < H (X) + 1

This shows that $H (X) + 1$ is an upper bound for L!

In summary, entropy is a lower bound, and a reasonable estimate, for the average number of bits it takes to encode a distribution as a prefix code.

←

The Zed Text Editor

This Website

→