diff --git a/entropy.tex b/entropy.tex index e990fff..4f4ca91 100644 --- a/entropy.tex +++ b/entropy.tex @@ -2,11 +2,13 @@ \usepackage[utf8x]{inputenc} \usepackage[margin=1in]{geometry} % Adjust margins \usepackage{caption} +\usepackage{wrapfig} \usepackage{subcaption} \usepackage{parskip} % dont indent after paragraphs, figures \usepackage{xcolor} %\usepackage{csquotes} % Recommended for biblatex \usepackage{tikz} +\usepackage{pgfplots} \usepackage{float} \usepackage{amsmath} \PassOptionsToPackage{hyphens}{url} @@ -55,17 +57,18 @@ $k_B$ refers to the Boltzmann constant, which he himself did not determine. \textit{Claude Shannon} adapted the concept of entropy to information theory. In an era of advancing communication technologies, the question he addressed was of increasing importance: How can messages be encoded and transmitted efficiently? -As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes -and media for transmission and storage. -According to his axioms, a measure for information has to comply with the following criteria: +He proposed 3(4) axioms a measure of information would have to comply with: \begin{enumerate} \item $I(1) = 0$: events that always occur do not communicate information. - \item $I(p)$ is monotonically decreasing in p: an increase in the probability of an event + \item $ I'(p) \leq 0$ is monotonically decreasing in p: an increase in the probability of an event decreases the information from an observed event, and vice versa. \item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events is the sum of the information learned from each event. \item $I(p)$ is a twice continuously differentiable function of p. \end{enumerate} + +As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes +and media for transmission and storage. In information theory, entropy can be understood as the expected information of a message. \begin{equation} H = E(I) = - \sum_i p_i \log_2(p_i) @@ -78,6 +81,60 @@ that tomorrows message will be the same. When some day we get the message 'Vanco not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely given the transmission history. +However, uncertainty (entropy) in this situation would be relatively low. +Because we attach high surprise only to the unlikely message of an eruption, the significantly more likely message +carries less information - we already expected it before it arrived. + +Putting the axioms and our intuitive understanding of information and uncertainty together, +\autoref{fig:graph-entropy} + +\begin{figure}[H] +\begin{minipage}{.5\textwidth} +\begin{tikzpicture} + \begin{axis}[ + domain=0:1, + samples=100, + axis lines=middle, + xlabel={$p$}, + ylabel={Information}, + xmin=0, xmax=1, + ymin=0, ymax=6.1, + grid=both, + width=8cm, + height=6cm, + every axis x label/.style={at={(current axis.right of origin)}, anchor=west}, + every axis y label/.style={at={(current axis.above origin)}, anchor=south}, + ] + \addplot[thick, blue] {-log2(x)}; + \end{axis} +\end{tikzpicture} +\caption{Information contained in a message depending on its probability $p$} +\label{fig:graph-information} +\end{minipage} +\begin{minipage}{.5\textwidth} +\begin{tikzpicture} + \begin{axis}[ + domain=0:1, + samples=100, + axis lines=middle, + xlabel={$p$}, + ylabel={Entropy}, + xmin=0, xmax=1, + ymin=0, ymax=1.1, + grid=both, + width=8cm, + height=6cm, + every axis x label/.style={at={(current axis.right of origin)}, anchor=west}, + every axis y label/.style={at={(current axis.above origin)}, anchor=south}, + ] + \addplot[thick, blue] {-x * log2(x) - (1-x) * log2(1-x)}; + \end{axis} +\end{tikzpicture} +\caption{Entropy of an event source with two possible events, depending on their probabilities $(p, 1-p)$} +\label{fig:graph-entropy} +\end{minipage} +\end{figure} + The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $. @@ -136,8 +193,23 @@ However, drawbacks include overfitting and poor robustness, where minimal altera can lead to a change in tree structure. \subsection{Cross-Entropy} -Kullback-Leibler = $H(p,q) - H(p)$ -as a cost function in machine learning +When dealing with two distributions, the \textit{cross-entropy}, also called Kullback-Leibler divergence +between a true distribution $p$ +and an estimated distribution $q$ is defined as: +\begin{equation} + H(p, q) = -\sum_x p(x) \log_2 q(x) +\end{equation} +The \textit{Kullback–Leibler divergence} measures how much information is lost when $q$ +is used to approximate $p$: +\begin{equation} + D_{KL}(p \| q) = H(p, q) - H(p) +\end{equation} +In machine learning, this term appears in many loss/cost functions — notably in classification problems +(cross-entropy loss) and in probabilistic models such as Variational Autoencoders (VAEs). +There, the true and predicted label are used as the true and estimated distribution, respectively. +In a supervised training example, the cross entropy loss degenerates to $-\log(p_{pred i})$ as the +true label vector is assumed to be the unit vector $e_i$ (one-hot). + \subsection{Coding} %Coding of a source of an information and communication channel % https://www.youtube.com/watch?v=ErfnhcEV1O8