This commit is contained in:
eneller
2025-10-30 14:51:12 +01:00
parent 3eb0f229fd
commit 358274979f

View File

@@ -2,11 +2,13 @@
\usepackage[utf8x]{inputenc} \usepackage[utf8x]{inputenc}
\usepackage[margin=1in]{geometry} % Adjust margins \usepackage[margin=1in]{geometry} % Adjust margins
\usepackage{caption} \usepackage{caption}
\usepackage{wrapfig}
\usepackage{subcaption} \usepackage{subcaption}
\usepackage{parskip} % dont indent after paragraphs, figures \usepackage{parskip} % dont indent after paragraphs, figures
\usepackage{xcolor} \usepackage{xcolor}
%\usepackage{csquotes} % Recommended for biblatex %\usepackage{csquotes} % Recommended for biblatex
\usepackage{tikz} \usepackage{tikz}
\usepackage{pgfplots}
\usepackage{float} \usepackage{float}
\usepackage{amsmath} \usepackage{amsmath}
\PassOptionsToPackage{hyphens}{url} \PassOptionsToPackage{hyphens}{url}
@@ -55,17 +57,18 @@ $k_B$ refers to the Boltzmann constant, which he himself did not determine.
\textit{Claude Shannon} adapted the concept of entropy to information theory. \textit{Claude Shannon} adapted the concept of entropy to information theory.
In an era of advancing communication technologies, the question he addressed was of increasing importance: In an era of advancing communication technologies, the question he addressed was of increasing importance:
How can messages be encoded and transmitted efficiently? How can messages be encoded and transmitted efficiently?
As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes He proposed 3(4) axioms a measure of information would have to comply with:
and media for transmission and storage.
According to his axioms, a measure for information has to comply with the following criteria:
\begin{enumerate} \begin{enumerate}
\item $I(1) = 0$: events that always occur do not communicate information. \item $I(1) = 0$: events that always occur do not communicate information.
\item $I(p)$ is monotonically decreasing in p: an increase in the probability of an event \item $ I'(p) \leq 0$ is monotonically decreasing in p: an increase in the probability of an event
decreases the information from an observed event, and vice versa. decreases the information from an observed event, and vice versa.
\item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events \item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events
is the sum of the information learned from each event. is the sum of the information learned from each event.
\item $I(p)$ is a twice continuously differentiable function of p. \item $I(p)$ is a twice continuously differentiable function of p.
\end{enumerate} \end{enumerate}
As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes
and media for transmission and storage.
In information theory, entropy can be understood as the expected information of a message. In information theory, entropy can be understood as the expected information of a message.
\begin{equation} \begin{equation}
H = E(I) = - \sum_i p_i \log_2(p_i) H = E(I) = - \sum_i p_i \log_2(p_i)
@@ -78,6 +81,60 @@ that tomorrows message will be the same. When some day we get the message 'Vanco
not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely
given the transmission history. given the transmission history.
However, uncertainty (entropy) in this situation would be relatively low.
Because we attach high surprise only to the unlikely message of an eruption, the significantly more likely message
carries less information - we already expected it before it arrived.
Putting the axioms and our intuitive understanding of information and uncertainty together,
\autoref{fig:graph-entropy}
\begin{figure}[H]
\begin{minipage}{.5\textwidth}
\begin{tikzpicture}
\begin{axis}[
domain=0:1,
samples=100,
axis lines=middle,
xlabel={$p$},
ylabel={Information},
xmin=0, xmax=1,
ymin=0, ymax=6.1,
grid=both,
width=8cm,
height=6cm,
every axis x label/.style={at={(current axis.right of origin)}, anchor=west},
every axis y label/.style={at={(current axis.above origin)}, anchor=south},
]
\addplot[thick, blue] {-log2(x)};
\end{axis}
\end{tikzpicture}
\caption{Information contained in a message depending on its probability $p$}
\label{fig:graph-information}
\end{minipage}
\begin{minipage}{.5\textwidth}
\begin{tikzpicture}
\begin{axis}[
domain=0:1,
samples=100,
axis lines=middle,
xlabel={$p$},
ylabel={Entropy},
xmin=0, xmax=1,
ymin=0, ymax=1.1,
grid=both,
width=8cm,
height=6cm,
every axis x label/.style={at={(current axis.right of origin)}, anchor=west},
every axis y label/.style={at={(current axis.above origin)}, anchor=south},
]
\addplot[thick, blue] {-x * log2(x) - (1-x) * log2(1-x)};
\end{axis}
\end{tikzpicture}
\caption{Entropy of an event source with two possible events, depending on their probabilities $(p, 1-p)$}
\label{fig:graph-entropy}
\end{minipage}
\end{figure}
The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically
arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $. arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $.
@@ -136,8 +193,23 @@ However, drawbacks include overfitting and poor robustness, where minimal altera
can lead to a change in tree structure. can lead to a change in tree structure.
\subsection{Cross-Entropy} \subsection{Cross-Entropy}
Kullback-Leibler = $H(p,q) - H(p)$ When dealing with two distributions, the \textit{cross-entropy}, also called Kullback-Leibler divergence
as a cost function in machine learning between a true distribution $p$
and an estimated distribution $q$ is defined as:
\begin{equation}
H(p, q) = -\sum_x p(x) \log_2 q(x)
\end{equation}
The \textit{KullbackLeibler divergence} measures how much information is lost when $q$
is used to approximate $p$:
\begin{equation}
D_{KL}(p \| q) = H(p, q) - H(p)
\end{equation}
In machine learning, this term appears in many loss/cost functions — notably in classification problems
(cross-entropy loss) and in probabilistic models such as Variational Autoencoders (VAEs).
There, the true and predicted label are used as the true and estimated distribution, respectively.
In a supervised training example, the cross entropy loss degenerates to $-\log(p_{pred i})$ as the
true label vector is assumed to be the unit vector $e_i$ (one-hot).
\subsection{Coding} \subsection{Coding}
%Coding of a source of an information and communication channel %Coding of a source of an information and communication channel
% https://www.youtube.com/watch?v=ErfnhcEV1O8 % https://www.youtube.com/watch?v=ErfnhcEV1O8