Reflections on the Bregman Divergence
Reading arXiv:2008.01883v2 I was thinking a little about the Bregman Divergence. It is a measure of distance sometimes used in probability theory that I am not that well acquainted with, so I thought I would share some shallow reflections on the topic. Mostly, it is definitions and simple algebra, and nothing deep, but who knows - maybe someone finds this useful.
Definition
Taken from wikipedia.
Given a closed convex set \(\Omega\), and a real valued, strictly convex, continously diffrentiable function \(F: \Omega \to \mathbb R\), and an inner product \(\langle \cdot ,\, \cdot \rangle\) we define the Bregman divergence like
\[D_F(p,q) = F(p) - F(q) - \left\langle \nabla F(q),\, p-q\right\rangle\]This is a peculiar formula, as it essentially is a second-and-higher-order-difference. Consider the special case when \(\Omega=\mathbb R\) and \(F\) is infinitely diffrentiable, then
\[\begin{aligned} D_F(p,q) &= F(p) - F(q) - \left\langle \nabla F(q),\, q-p\right\rangle \\ &= F\left(q+(p-q)\right) - F(q) - F'(q)(p-q) \\ &= \sum_{k=0}^{\infty}\frac{F^{(k)}(q)}{k!}(p-q)^k - F(q) - F'(q)(p-q) \\ &= \sum_{k=2}^{\infty}\frac{F^{(k)}(q)}{k!}(p-q)^k \\ &=O((p-q)^2) \end{aligned}\]using big-o-notation, and we see that one way to consider a Bregman Divergence is to ask “How different are the function values of \(F\) between \(p\) and \(q\), if we only consider the nonlinearities of \(F\).
Also - since \(F\) is convex, and continously diffrentiable \(F(p) \geq F(q) + \left\langle \nabla F(q),\, q-p\right\rangle\), and \(D_F(p,q) \geq 0\). The divergence \(D_F(p,q)\) is also linear in \(F\) and convex in \(p\).
Example: Euclidean distance
Let \(\Omega = \mathbb R^n\) and \(F(x) = \|x\|^2\). Then we can compute
\[\begin{aligned} D_F(p,q) &= F(p) - F(q) - \left\langle \nabla F(q),\, q-p\right\rangle \\ &= \|p\|^2- \|q\|^2 - 2q^T(p-q) \\ &= \|p\|^2 - 2q^Tp + \|q\|^2\\ &= \|p-q\|^2 \end{aligned}\]So in this case, we get the euclidean distance back. Notice also that if we add some linear term \(F(x) = \|x\|^2 +a^Tx+b\) we still get the euclidean distance in the end. Linear and constant terms will cancel!
One can say that the euclidean distance is the nonlinear difference in squared-norm.
Example: Generalized KL Divergence
Let \(\Omega\) be the space of functions \(p : \mathbb Z \to \mathbb R\) so that \(\sum_{i\in \mathbb Z} p(i) < \infty\). Define the negative entropy \(F(p) = -H(p) = \sum_{i} p(i)\log (p(i))\)
Since \(F\) is a functional of \(p\), we must apply a suitable definition of the nabla operator and inner product. We use the functional derivative to compute \(\left\langle \nabla F(q),\, q-p\right\rangle = \lim_{\epsilon \to 0} \frac{F(q+\epsilon(p-q))-F(q)}{\epsilon}\). I’ll use the more compact notation \(p_i=p(i)\) and \(q(i)=q_i\).
\[\begin{aligned} &\lim_{\epsilon \to 0} \frac{F(q+\epsilon(p-q))-F(q)}{\epsilon} = \\ &= \lim_{\epsilon \to 0} \frac{1}{\epsilon} \sum_{i} \left(q_i+\epsilon(p_i-q_i)\right)\log \left(q_i+\epsilon(p_i-q_i)\right) - q_i\log q_i \\ &= \lim_{\epsilon \to 0} \frac{1}{\epsilon} \sum_{i} q_i\log q_i+\epsilon(p_i-q_i)\log q_i + q_i\frac{\epsilon(p_i-q_i)}{q_i} + + O(\epsilon^2) - q_i\log q_i \\ &= \lim_{\epsilon \to 0} \sum_{i} (p_i-q_i)\log q_i + (p_i-q_i) + O(\epsilon) \\ &= \sum_{i} (p_i-q_i)\log q_i + p_i-q_i \\ &= -F(q) + \sum_{i} p_i\log q_i + p_i-q_i \end{aligned}\]So that when plugging into the definition of the Bregman divergence, we can massage it into
\[\begin{aligned} D_F(p,q) &= F(p)-F(q)-\left( -F(q) + \sum_{i} p_i\log q_i + p_i-q_i \right) \\ &= \sum_i p_i\log p_i - p_i\log q_i + p_i-q_i \\ &= \sum_i p_i\log \left(\frac{p_i}{p_i}\right) + p_i-q_i \end{aligned}\]We identify the definition of the KL divergence \(D_{KL}(p\|q)= \sum_i p_i\log \left(\frac{p_i}{p_i}\right)\)
\[D_F(p,q) = D_{KL}(p\|q) - \sum_{i} p_i + q_i\]As a special case, when \(p\) and \(q\) are probability mass functions, i.e. \(\sum_ip(i) = \sum_iq(i) = 1\) then \(D_F(p,q) = D_{KL}(p\|q)\). One can thus say that the KL divergence between two distributions is the nonlinear difference between their entropies.