Information Theory Basics: Entropy, Compression, and Machine Learning

Information theory provides a mathematical framework for quantifying information. Developed by Claude Shannon in the 1940s, it established the limits of data compression and communication, founding an entire field that now touches everything from telecommunications to machine learning. Understanding information theory helps you reason about uncertainty, measure relationships between variables, and design better algorithms.

The Concept of Information

Information is not simply data—it is meaningful data that reduces uncertainty. When you learn something you did not know, you gain information. When you confirm what you already suspected, you gain little information. Information theory captures this intuition mathematically.

Shannon’s fundamental insight was that information should be measured in terms of uncertainty resolved. A message that could have been many things tells you more than a predictable message. A rare event contains more information than a common one.

This insight leads to a precise mathematical definition. If an event has probability p of occurring, its information content is -log₂(p) bits. The logarithm base 2 makes information additive for independent events: the information in two independent events is the sum of their individual information.

Why Logarithms?

The logarithmic relationship between probability and information emerges from desirable properties. Information should be additive for independent events—if two events are independent, learning about both should add their information contents. The logarithm is the unique function with this property.

The base of the logarithm determines the unit. Base 2 gives bits, the familiar binary units. Base e gives nats. Base 10 gives decimal digits. The choice matters only when reporting numbers; the mathematical relationships are the same.

Logarithms also make the math work out nicely for probabilities. Multiplying probabilities becomes adding log-probabilities, which is computationally simpler. Many algorithms work in the log domain for this reason.

Entropy: Measuring Uncertainty

Entropy measures the average uncertainty or information content of a random variable. It quantifies how much you learn, on average, from observing an outcome.

Definition and Intuition

For a discrete random variable X with probability distribution p(x), entropy is H(X) = -Σ p(x) log₂ p(x). The sum is over all possible outcomes. Entropy is measured in bits.

Consider a fair coin flip. Each outcome (heads or tails) has probability 0.5. The entropy is -[0.5 log₂(0.5) + 0.5 log₂(0.5)] = 1 bit. This makes sense—one bit distinguishes between two equally likely outcomes.

Now consider a biased coin that comes up heads 99% of the time. The entropy is approximately 0.08 bits—you learn almost nothing from each flip because the outcome is nearly predictable. Maximum entropy occurs when all outcomes are equally likely; minimum entropy (zero) occurs when the outcome is certain.

Entropy can be fractional. With four equally likely outcomes, entropy is 2 bits. With eight equally likely outcomes, entropy is 3 bits. In general, entropy is at most log₂(n) for n outcomes, achieved when all are equally likely.

Properties of Entropy

Entropy has several important properties. It is non-negative: H(X) ≥ 0. It is zero if and only if one outcome has probability 1. It is maximized when all outcomes are equally likely.

Entropy is subadditive: the entropy of joint information is less than or equal to the sum of individual entropies. This makes intuitive sense—knowing two things together cannot give you more information than knowing each separately.

Conditional entropy H(Y|X) measures the remaining uncertainty in Y after learning X. It is always less than or equal to H(Y)—knowing something cannot increase uncertainty. This is the foundation for understanding information gain.

Joint and Conditional Entropy

Real problems involve multiple random variables. Joint and conditional entropy extend single-variable concepts to multivariate settings.

Joint Entropy

Joint entropy H(X, Y) measures uncertainty in the joint distribution of X and Y. It is computed over all combinations of values: H(X, Y) = -ΣΣ p(x, y) log₂ p(x, y).

Joint entropy relates to individual entropies through the chain rule: H(X, Y) = H(X) + H(Y|X). The uncertainty in both variables equals the uncertainty in X plus the remaining uncertainty in Y given X.

This relationship mirrors the intuition that learning about X tells you something, and then learning about Y given X adds more information. The total is the sum, but the second term is typically smaller than H(Y) because some information about Y is already contained in X.

Conditional Entropy

Conditional entropy H(Y|X) is the average entropy of Y given a particular value of X, averaged over all X values. It measures how much uncertainty remains in Y after learning X.

If X and Y are independent, then H(Y|X) = H(Y)—learning X tells you nothing about Y. If X completely determines Y, then H(Y|X) = 0—learning X removes all uncertainty about Y.

The chain rule in both directions shows these relationships: H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y). This symmetry reflects that the total information in two variables is the same regardless of order.

Mutual Information

Mutual information measures the dependence between two variables. It quantifies how much information one variable provides about another.

Definition

Mutual information I(X; Y) = H(X) + H(Y) - H(X, Y). Equivalently, it is H(Y) - H(Y|X), the reduction in Y’s uncertainty from knowing X.

When X and Y are independent, mutual information is zero—knowing one tells you nothing about the other. When X completely determines Y, mutual information equals H(Y)—knowing X reveals Y completely.

Mutual information is symmetric: I(X; Y) = I(Y; X). It is also non-negative and bounded by the smaller entropy: 0 ≤ I(X; Y) ≤ min(H(X), H(Y)).

Applications in Machine Learning

Mutual information appears throughout machine learning. Feature selection often uses mutual information to identify relevant features—features with high mutual information with the target carry predictive power.

In clustering, mutual information between cluster assignments and true labels measures clustering quality. Normalized mutual information corrects for the tendency of raw mutual information to increase with cluster count.

In decision trees, information gain is precisely mutual information between the target and each potential split. The algorithm chooses splits that maximize information gain, reducing uncertainty about the target most.

Kullback-Leibler Divergence

KL divergence measures how one probability distribution differs from another. It appears in many machine learning contexts, from variational inference to loss functions.

Definition and Properties

For distributions P and Q, KL(P || Q) = Σ P(x) log₂(P(x)/Q(x)). It is zero when P = Q and positive otherwise. Unlike distance metrics, KL divergence is asymmetric: KL(P || Q) ≠ KL(Q || P) in general.

KL divergence is not a true distance because it violates the triangle inequality and is asymmetric. However, it serves as a measure of “direction” of distribution difference. KL(P || Q) asks: if we use Q to encode samples from P, how many extra bits do we need?

The non-negativity of KL divergence follows from Gibbs inequality: Σ P(x) log₂(P(x)/Q(x)) ≥ 0, with equality only when P = Q. This fundamental result underlies many proofs in information theory.

KL Divergence in Machine Learning

In variational autoencoders, KL divergence measures how closely the learned latent distribution matches a prior (usually Gaussian). The loss function balances reconstruction quality with latent space regularization.

In natural language processing, cross-entropy loss is related to KL divergence between the true distribution and model predictions. When the true distribution is one-hot (the correct answer), cross-entropy simplifies to the negative log-likelihood.

In Bayesian inference, KL divergence appears in variational inference, where we approximate an intractable posterior with a simpler distribution. Minimizing KL divergence finds the best approximation within the chosen family.

Compression and the Entropy Bound

Shannon’s source coding theorem establishes the fundamental limit on lossless compression: no compression scheme can represent data using fewer bits than the entropy, on average.

The Compression Limit

For a source producing symbols with probabilities p₁, p₂, …, pₙ, the average code length cannot be less than the entropy H. This is not merely a theoretical limit—it can be achieved asymptotically using codes like Huffman coding or arithmetic coding.

This result is profound: entropy exactly quantifies the irreducible information content. Data with high entropy (random-looking) cannot be compressed much. Data with low entropy (predictable) can be compressed dramatically. The boundary is sharp—no compression beyond entropy is possible.

Practical compressors approach but may not achieve the entropy bound. Arithmetic coding achieves near-optimal compression but requires floating-point arithmetic. LZ-based algorithms like gzip achieve good compression for many real-world files but are not optimal for all sources.

Implications for Machine Learning

The entropy bound has implications for machine learning. If we can perfectly model the data distribution, we can compress it to the entropy. This connects learning to compression: good models enable good compression.

In generative modeling, the goal is to learn the data distribution. Once learned, we can sample from it to generate new data. The connection to compression suggests that better generative models should enable better compression of the data they model.

The information bottleneck theory uses information theory to understand deep learning. It suggests that the best representations capture information about the target while minimizing information about the input—finding the optimal trade-off between compression and relevance.

Practical Applications

Information theory concepts appear in many practical applications beyond compression.

Feature Selection and Relevance

Mutual information provides a principled way to select features. Features with high mutual information with the target are predictive. However, features may be redundant—two highly correlated features might each have high mutual information but add little beyond each other.

Conditional mutual information accounts for other features. Selecting features that maximize conditional mutual information avoids redundancy. This leads to algorithms that build sets of non-redundant, informative features.

Anomaly Detection

Entropy measures normality. Normal data often has lower entropy than anomalies—normal transactions follow patterns while anomalous ones are unpredictable. Monitoring entropy can detect novel patterns.

Mutual information between features and labels can identify relevant features for anomaly detection. High mutual information indicates strong predictive signal. Low mutual information suggests features that might be dropped.

Reinforcement Learning

In reinforcement learning, entropy regularization encourages exploration. Adding entropy to the objective rewards policies that maintain uncertainty over actions. This prevents premature convergence to suboptimal deterministic policies.

Information gain in state transitions measures learning progress. curiosity-driven exploration uses information gain as a reward signal, encouraging the agent to visit states where it learns something new.

Conclusion

Information theory provides a unifying framework for understanding uncertainty, compression, and learning. Entropy quantifies uncertainty precisely. Mutual information measures dependence between variables. KL divergence compares distributions.

These concepts connect to practical machine learning. Feature selection uses mutual information. Generative models use KL divergence. Compression limits inform our expectations for learning. Entropy regularization encourages exploration.

Understanding information theory gives you tools to reason about data at a fundamental level. It helps you understand why certain algorithms work, how much data you truly need, and what limits exist on what you can learn. These insights are valuable across the full spectrum of machine learning applications.