KL Divergence – The complete guide

Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure of how one probability distribution diverges from another. It is commonly used in information theory and statistics to quantify the difference between two probability distributions.

Given two probability distributions, P and Q, defined over the same event space, the KL divergence from P to Q is denoted as KL(P || Q) and is defined as:

KL(P || Q) = Σ P(x) log [P(x) / Q(x)]

In this equation, x represents each event in the event space, P(x) is the probability of event x according to distribution P, and Q(x) is the probability of event x according to distribution Q.

Some important properties of KL divergence are as follows:

KL divergence is non-negative: KL(P || Q) ≥ 0, and it is equal to zero if and only if P and Q are identical. KL divergence is not symmetric: In general, KL(P || Q) ≠ KL(Q || P). Therefore, it is important to consider the order of the distributions when using KL divergence. KL divergence is unbounded: There is no upper limit to the value of KL divergence.

Some common applications of KL Divergence are:

Information theory: KL divergence quantifies the information lost when one probability distribution is used to approximate another.

Machine learning: It is commonly used in training generative models, such as variational autoencoders and generative adversarial networks (GANs).

Natural language processing: KL divergence is used in topic modeling algorithms, such as Latent Dirichlet Allocation (LDA), to measure the difference between the distribution of words in documents and the distribution of topics.

Reinforcement learning: KL divergence can be used to measure the difference between the policy distributions in policy optimization algorithms like Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO).

Overall, KL divergence provides a useful tool for comparing and quantifying the differences between probability distributions.

Here are some additional aspects and concepts related to KL divergence:

  1. KL divergence is not a true distance metric because it is not symmetric and does not satisfy the triangle inequality.
  2. It is calculated as the expected value of the logarithmic difference between the two distributions.
  3. The logarithm is typically taken with base 2, resulting in the value being expressed in bits.


KL divergence measures the extra amount of information needed to encode data from one distribution when using a code optimized for another distribution. It can be thought of as a measure of the inefficiency or suboptimality of using Q to represent the data that truly follows P.

  1. Relationship to cross-entropy: KL divergence is closely related to cross-entropy, which is a measure of the average number of bits needed to encode events from one distribution when using an optimal code for another distribution. KL divergence is equal to the cross-entropy between P and Q minus the entropy of P.
  2. Minimization and optimization: Minimizing KL divergence is often used as an optimization objective in various machine learning tasks. In generative models, minimizing KL divergence between the model’s distribution and the true data distribution helps the model to learn the underlying data distribution. In reinforcement learning, KL divergence can be used as a constraint to ensure that policy updates are not too drastic, maintaining the stability of the learning process.
  3. Variational inference: KL divergence plays a crucial role in variational inference, a technique for approximating intractable posterior distributions. By minimizing the KL divergence between the true posterior and an approximating distribution, variational inference provides a means to perform probabilistic inference efficiently.
  4. Divergence between different types of distributions: KL divergence can be computed between different types of probability distributions, such as discrete distributions, continuous distributions, and even distributions defined over different spaces. However, there are variations of KL divergence tailored to specific types of distributions, such as the Kullback-Leibler divergence for continuous distributions or the Jensen-Shannon divergence for discrete distributions.

Properties of KL divergence:

  •  1. Additivity: KL divergence is additive over independent distributions. That is, for independent distributions P(X, Y) = P(X)P(Y), the KL divergence satisfies KL(P(X, Y) || Q(X, Y)) = KL(P(X) || Q(X)) + KL(P(Y) || Q(Y)).
  • 2. Invariance: KL divergence is not invariant under transformations of the probability space. That means if we transform the space or change the representation of the probability distributions, the KL divergence between them may change.
  • 3. Jensen’s Inequality: KL divergence satisfies Jensen’s inequality, which states that for any convex function f(x), the expected value of f(x) is greater than or equal to f of the expected value. In terms of KL divergence, this implies that KL(P || Q) ≥ 0 for any distributions P and Q.
  • 4. Symmetrized KL divergence: The symmetrized version of KL divergence, called the Jensen-Shannon divergence (JSD), is defined as (KL(P || M) + KL(Q || M)) / 2, where M is the average distribution given by M = (P + Q) / 2. The JSD is a symmetric and bounded measure of similarity between two distributions, ranging from 0 (when P and Q are identical) to 1 (when P and Q have disjoint supports).

Limitations and considerations:

KL divergence may not be suitable if the two distributions being compared have disjoint supports, as the logarithm of zero is undefined. It is sensitive to the choice of the reference distribution. The choice of Q affects the magnitude of the KL divergence, and different reference distributions can yield different results. KL divergence only measures the difference between the distributions in terms of their probabilities for each event. It does not capture the semantic or structural differences between the distributions. In practice, when estimating probabilities from data, it is important to handle zero probabilities or adjust for cases where probabilities are not well-defined.

Alternative divergence measures:

Besides KL divergence, there are other divergence measures used to compare probability distributions, such as total variation distance, Hellinger distance, and Bhattacharyya distance. Each measure has its own properties and applicability depending on the context. These divergence measures can be used in different scenarios based on specific requirements, such as robustness to outliers, computational efficiency, or the need for a symmetric measure.

Applications of KL divergence:

  1. Image processing: KL divergence is used in image compression algorithms to quantify the difference between the original image and the compressed image.
  2. Clustering: KL divergence is utilized in clustering algorithms to measure the similarity between clusters or to evaluate the quality of clustering.
  3. Bayesian inference: KL divergence is employed in Bayesian methods to assess the discrepancy between the prior distribution and the posterior distribution.
  4. Information retrieval: KL divergence is used in ranking algorithms to measure the relevance of search results based on user queries and document collections.