update

2025-10-30 14:51:12 +01:00
parent 3eb0f229fd
commit 358274979f
1 changed files with 78 additions and 6 deletions
--- a/entropy.tex
+++ b/entropy.tex
@@ -2,11 +2,13 @@
 \usepackage[utf8x]{inputenc}
 \usepackage[margin=1in]{geometry} % Adjust margins
 \usepackage{caption}
 \usepackage{wrapfig}
 \usepackage{subcaption}
 \usepackage{parskip} % dont indent after paragraphs, figures
 \usepackage{xcolor}
 %\usepackage{csquotes} % Recommended for biblatex
 \usepackage{tikz}
 \usepackage{pgfplots}
 \usepackage{float}
 \usepackage{amsmath}
 \PassOptionsToPackage{hyphens}{url}
@@ -55,17 +57,18 @@ $k_B$ refers to the Boltzmann constant, which he himself did not determine.
 \textit{Claude Shannon} adapted the concept of entropy to information theory.
 In an era of advancing communication technologies, the question he addressed was of increasing importance:
 How can messages be encoded and transmitted efficiently?
-As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes
+He proposed 3(4) axioms a measure of information would have to comply with:
 and media for transmission and storage.
 According to his axioms, a measure for information has to comply with the following criteria:
 \begin{enumerate}
    \item $I(1) = 0$: events that always occur do not communicate information.
-    \item $I(p)$ is monotonically decreasing in p: an increase in the probability of an event
+    \item $ I'(p) \leq 0$ is monotonically decreasing in p: an increase in the probability of an event
        decreases the information from an observed event, and vice versa.
    \item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events
        is the sum of the information learned from each event.
    \item $I(p)$ is a twice continuously differentiable function of p.
 \end{enumerate}
 As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes
 and media for transmission and storage.
 In information theory, entropy can be understood as the expected information of a message.
 \begin{equation}
    H = E(I) = - \sum_i p_i \log_2(p_i)
@@ -78,6 +81,60 @@ that tomorrows message will be the same. When some day we get the message 'Vanco
 not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely
 given the transmission history.
 However, uncertainty (entropy) in this situation would be relatively low.
 Because we attach high surprise only to the unlikely message of an eruption, the significantly more likely message
 carries less information - we already expected it before it arrived.
 Putting the axioms and our intuitive understanding of information and uncertainty together,
 \autoref{fig:graph-entropy}
 \begin{figure}[H]
 \begin{minipage}{.5\textwidth}
 \begin{tikzpicture}
  \begin{axis}[
    domain=0:1,
    samples=100,
    axis lines=middle,
    xlabel={$p$},
    ylabel={Information},
    xmin=0, xmax=1,
    ymin=0, ymax=6.1,
    grid=both,
    width=8cm,
    height=6cm,
    every axis x label/.style={at={(current axis.right of origin)}, anchor=west},
    every axis y label/.style={at={(current axis.above origin)}, anchor=south},
  ]
    \addplot[thick, blue] {-log2(x)};
  \end{axis}
 \end{tikzpicture}
 \caption{Information contained in a message depending on its probability $p$}
 \label{fig:graph-information}
 \end{minipage}
 \begin{minipage}{.5\textwidth}
 \begin{tikzpicture}
  \begin{axis}[
    domain=0:1,
    samples=100,
    axis lines=middle,
    xlabel={$p$},
    ylabel={Entropy},
    xmin=0, xmax=1,
    ymin=0, ymax=1.1,
    grid=both,
    width=8cm,
    height=6cm,
    every axis x label/.style={at={(current axis.right of origin)}, anchor=west},
    every axis y label/.style={at={(current axis.above origin)}, anchor=south},
  ]
    \addplot[thick, blue] {-x * log2(x) - (1-x) * log2(1-x)};
  \end{axis}
 \end{tikzpicture}
 \caption{Entropy of an event source with two possible events, depending on their probabilities $(p, 1-p)$}
 \label{fig:graph-entropy}
 \end{minipage}
 \end{figure}
 The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically
 arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $.
@@ -136,8 +193,23 @@ However, drawbacks include overfitting and poor robustness, where minimal altera
 can lead to a change in tree structure.
 \subsection{Cross-Entropy}
-Kullback-Leibler = $H(p,q) - H(p)$
+When dealing with two distributions, the \textit{cross-entropy}, also called Kullback-Leibler divergence
-as a cost function in machine learning
+between a true distribution $p$
 and an estimated distribution $q$ is defined as:
 \begin{equation}
    H(p, q) = -\sum_x p(x) \log_2 q(x)
 \end{equation}
 The \textit{Kullback–Leibler divergence} measures how much information is lost when $q$
 is used to approximate $p$:
 \begin{equation}
    D_{KL}(p \| q) = H(p, q) - H(p)
 \end{equation}
 In machine learning, this term appears in many loss/cost functions — notably in classification problems 
 (cross-entropy loss) and in probabilistic models such as Variational Autoencoders (VAEs).
 There, the true and predicted label are used as the true and estimated distribution, respectively.
 In a supervised training example, the cross entropy loss degenerates to $-\log(p_{pred i})$ as the
 true label vector is assumed to be the unit vector $e_i$ (one-hot).
 \subsection{Coding}
 %Coding of a source of an information and communication channel
 % https://www.youtube.com/watch?v=ErfnhcEV1O8