update
This commit is contained in:
109
entropy.tex
109
entropy.tex
@@ -9,6 +9,7 @@
|
|||||||
%\usepackage{csquotes} % Recommended for biblatex
|
%\usepackage{csquotes} % Recommended for biblatex
|
||||||
\usepackage{tikz}
|
\usepackage{tikz}
|
||||||
\usepackage{pgfplots}
|
\usepackage{pgfplots}
|
||||||
|
\usetikzlibrary{positioning}
|
||||||
\usepackage{float}
|
\usepackage{float}
|
||||||
\usepackage{amsmath}
|
\usepackage{amsmath}
|
||||||
\PassOptionsToPackage{hyphens}{url}
|
\PassOptionsToPackage{hyphens}{url}
|
||||||
@@ -51,7 +52,7 @@ S = -k_B \sum_i p_i \ln(p_i)
|
|||||||
It gives statistical meaning to the macroscopic phenomenon of classical thermodynamics
|
It gives statistical meaning to the macroscopic phenomenon of classical thermodynamics
|
||||||
by defining the entropy $S$ of a macrostate as
|
by defining the entropy $S$ of a macrostate as
|
||||||
the result of probabilities $p_i$ of all its constituting micro states.
|
the result of probabilities $p_i$ of all its constituting micro states.
|
||||||
$k_B$ refers to the Boltzmann constant, which he himself did not determine.
|
$k_B$ refers to the Boltzmann constant, which he himself did not determine but is part of todays SI system.
|
||||||
|
|
||||||
\section{Shannon's axioms}
|
\section{Shannon's axioms}
|
||||||
\textit{Claude Shannon} adapted the concept of entropy to information theory.
|
\textit{Claude Shannon} adapted the concept of entropy to information theory.
|
||||||
@@ -72,6 +73,7 @@ and media for transmission and storage.
|
|||||||
In information theory, entropy can be understood as the expected information of a message.
|
In information theory, entropy can be understood as the expected information of a message.
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
H = E(I) = - \sum_i p_i \log_2(p_i)
|
H = E(I) = - \sum_i p_i \log_2(p_i)
|
||||||
|
\label{eq:entropy-information}
|
||||||
\end{equation}
|
\end{equation}
|
||||||
This leaves $ I =log(1/p_i) = - log_2(p_i)$, implying that an unexpected message (low probability) carries
|
This leaves $ I =log(1/p_i) = - log_2(p_i)$, implying that an unexpected message (low probability) carries
|
||||||
more information than one with higher probability.
|
more information than one with higher probability.
|
||||||
@@ -86,7 +88,9 @@ Because we attach high surprise only to the unlikely message of an eruption, the
|
|||||||
carries less information - we already expected it before it arrived.
|
carries less information - we already expected it before it arrived.
|
||||||
|
|
||||||
Putting the axioms and our intuitive understanding of information and uncertainty together,
|
Putting the axioms and our intuitive understanding of information and uncertainty together,
|
||||||
\autoref{fig:graph-entropy}
|
we can see the logarithmic decay of information transported by a message as its probability increases in \autoref{fig:graph-information},
|
||||||
|
as well as the entropy for a 2-event source given by solving \autoref{eq:entropy-information} for $i=2$, resulting in
|
||||||
|
$-p * \log_2(p) - (1-p) * \log_2(1-p) $.
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\begin{minipage}{.5\textwidth}
|
\begin{minipage}{.5\textwidth}
|
||||||
@@ -96,7 +100,7 @@ Putting the axioms and our intuitive understanding of information and uncertaint
|
|||||||
samples=100,
|
samples=100,
|
||||||
axis lines=middle,
|
axis lines=middle,
|
||||||
xlabel={$p$},
|
xlabel={$p$},
|
||||||
ylabel={Information},
|
ylabel={Information [bits]},
|
||||||
xmin=0, xmax=1,
|
xmin=0, xmax=1,
|
||||||
ymin=0, ymax=6.1,
|
ymin=0, ymax=6.1,
|
||||||
grid=both,
|
grid=both,
|
||||||
@@ -118,9 +122,10 @@ Putting the axioms and our intuitive understanding of information and uncertaint
|
|||||||
samples=100,
|
samples=100,
|
||||||
axis lines=middle,
|
axis lines=middle,
|
||||||
xlabel={$p$},
|
xlabel={$p$},
|
||||||
ylabel={Entropy},
|
ylabel={Entropy [bits]},
|
||||||
xmin=0, xmax=1,
|
xmin=0, xmax=1,
|
||||||
ymin=0, ymax=1.1,
|
ymin=0, ymax=1.1,
|
||||||
|
xtick={0,0.25,0.5,0.75,1},
|
||||||
grid=both,
|
grid=both,
|
||||||
width=8cm,
|
width=8cm,
|
||||||
height=6cm,
|
height=6cm,
|
||||||
@@ -211,6 +216,35 @@ In a supervised training example, the cross entropy loss degenerates to $-\log(p
|
|||||||
true label vector is assumed to be the unit vector $e_i$ (one-hot).
|
true label vector is assumed to be the unit vector $e_i$ (one-hot).
|
||||||
|
|
||||||
\subsection{Coding}
|
\subsection{Coding}
|
||||||
|
The concept of entropy also plays a crucial role in the design and evaluation of codes used for data compression and transmission.
|
||||||
|
In this context, \textit{coding} refers to the representation of symbols or messages
|
||||||
|
from a source using a finite set of codewords.
|
||||||
|
Each codeword is typically composed of a sequence of bits,
|
||||||
|
and the design goal is to minimize the average length of these codewords while maintaining unique decodability.
|
||||||
|
|
||||||
|
According to Shannon's source coding theorem, the theoretical lower bound for the average codeword length of a source
|
||||||
|
is given by its entropy $H$.
|
||||||
|
In other words, no lossless coding scheme can achieve an average length smaller than the source entropy when expressed in bits.
|
||||||
|
Codes that approach this bound are called \textit{efficient} or \textit{entropy-optimal}.
|
||||||
|
A familiar example of such a scheme is \textit{Huffman coding},
|
||||||
|
which assigns shorter codewords to more probable symbols and longer ones to less probable symbols,
|
||||||
|
resulting in a prefix-free code with minimal expected length.
|
||||||
|
|
||||||
|
Beyond compression, coding is essential for reliable communication over imperfect channels.
|
||||||
|
In real-world systems, transmitted bits are often corrupted by noise, requiring mechanisms to detect and correct errors.
|
||||||
|
One simple but powerful concept to quantify the robustness of a code is the \textit{Hamming distance}.
|
||||||
|
The Hamming distance between two codewords is defined as the number of bit positions in which they differ.
|
||||||
|
For example, the codewords $10110$ and $11100$ have a Hamming distance of 2.
|
||||||
|
|
||||||
|
A code with a minimum Hamming distance $d_{min}$ can detect up to $d_{min}-1$ errors
|
||||||
|
and correct up to $\lfloor (d_{min}-1)/2 \rfloor$ errors.
|
||||||
|
This insight forms the basis of error-correcting codes such as Hamming codes,
|
||||||
|
which add redundant bits to data in a structured way that enables the receiver to both identify and correct single-bit errors.
|
||||||
|
|
||||||
|
Thus, the efficiency and reliability of communication systems are governed by a trade-off:
|
||||||
|
higher redundancy (lower efficiency) provides greater error correction capability,
|
||||||
|
while minimal redundancy maximizes data throughput but reduces error resilience.
|
||||||
|
|
||||||
%Coding of a source of an information and communication channel
|
%Coding of a source of an information and communication channel
|
||||||
% https://www.youtube.com/watch?v=ErfnhcEV1O8
|
% https://www.youtube.com/watch?v=ErfnhcEV1O8
|
||||||
% relation to hamming distance and efficient codes
|
% relation to hamming distance and efficient codes
|
||||||
@@ -218,25 +252,11 @@ true label vector is assumed to be the unit vector $e_i$ (one-hot).
|
|||||||
\subsection{Noisy communication channels}
|
\subsection{Noisy communication channels}
|
||||||
The noisy channel coding theorem was stated by \textit{Claude Shannon} in 1948, but first rigorous proof was
|
The noisy channel coding theorem was stated by \textit{Claude Shannon} in 1948, but first rigorous proof was
|
||||||
provided in 1954 by Amiel Feinstein.
|
provided in 1954 by Amiel Feinstein.
|
||||||
One of the important issues Shannon wanted to tackle with his 'Mathematical theory of commmunication'
|
One of the important issues Shannon tackled with his 'Mathematical theory of commmunication'
|
||||||
was the insufficient means of transporting discrete data through a noisy channel that were more efficient than
|
was the insufficient means of transporting discrete data through a noisy channel that were more efficient than
|
||||||
the telegram.
|
the telegram - or, how to communicate reliably over an unreliable channel.
|
||||||
The means of error correction until then had been limited to very basic means.
|
The means of error correction until then had been limited to very basic means.
|
||||||
|
|
||||||
First, analogue connections like the first telephone lines, bypassed the issue altogether and relied
|
|
||||||
on the communicating parties and their brains' ability to filter human voices from the noise that was inevitably transmitted
|
|
||||||
along with the intended signal.
|
|
||||||
After some development, the telegraph in its final form used morse code, a series of long and short clicks, that,
|
|
||||||
together with letter and word gaps, would encode text messages.
|
|
||||||
Even though the long-short coding might appear similar to todays binary coding, the means of error correction were lacking.
|
|
||||||
For a long time, it relied on simply repeating the message multiple times, which is highly inefficient.
|
|
||||||
The destination would then have to determine the most likely intended message by performing a majority vote.
|
|
||||||
One might also propose simply increasing transmitting power, thereby decreasing the error rate of the associated channel.
|
|
||||||
However, the noisy channel coding theorem provides us with a more elegant solution.
|
|
||||||
It is of foundational importance to information theory, stating that given a noisy channel with capacity $C$
|
|
||||||
and information transmitted at rate $R$, there exists an $R<C$ so the error rate at the receiver can be
|
|
||||||
arbitrarily small.
|
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\begin{tikzpicture}
|
\begin{tikzpicture}
|
||||||
\def\boxw{2.5cm}
|
\def\boxw{2.5cm}
|
||||||
@@ -260,6 +280,39 @@ arbitrarily small.
|
|||||||
\caption{Model of a noisy communication channel}
|
\caption{Model of a noisy communication channel}
|
||||||
\label{fig:noisy-channel}
|
\label{fig:noisy-channel}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
First, analogue connections like the first telephone lines, bypassed the issue altogether and relied
|
||||||
|
on the communicating parties and their brains' ability to filter human voices from the noise that was inevitably transmitted
|
||||||
|
along with the intended signal.
|
||||||
|
After some development, the telegraph in its final form used morse code, a series of long and short clicks, that,
|
||||||
|
together with letter and word gaps, would encode text messages.
|
||||||
|
Even though the long-short coding might appear similar to todays binary coding, the means of error correction were lacking.
|
||||||
|
For a long time, it relied on simply repeating the message multiple times, which is highly inefficient.
|
||||||
|
The destination would then have to determine the most likely intended message by performing a majority vote.
|
||||||
|
One might also propose simply increasing transmitting power, thereby decreasing the error rate of the associated channel.
|
||||||
|
However, the noisy channel coding theorem provides us with a more elegant solution.
|
||||||
|
It is of foundational importance to information theory, stating that given a noisy channel with capacity $C$
|
||||||
|
and information transmitted at rate $R$, there exists an $R<C$ so the error rate at the receiver can be
|
||||||
|
arbitrarily small.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
\paragraph{Channel capacity and mutual information}
|
||||||
|
For any discrete memoryless channel, we can describe its behavior with a conditional probability distribution
|
||||||
|
$p(y|x)$ — the probability that symbol $y$ is received given symbol $x$ was sent.
|
||||||
|
The \textit{mutual information} between the transmitted and received signals measures how much information, on average, passes through the channel:
|
||||||
|
\begin{equation}
|
||||||
|
I(X;Y) = \sum_{x,y} p(x, y) \log_2 \frac{p(x, y)}{p(x)p(y)} = H(Y) - H(Y|X)
|
||||||
|
\end{equation}
|
||||||
|
The \textit{channel capacity} $C$ is then defined as the maximum achievable mutual information across all possible input distributions:
|
||||||
|
\begin{equation}
|
||||||
|
C = \max_{p(x)} I(X;Y)
|
||||||
|
\end{equation}
|
||||||
|
It represents the highest rate (in bits per symbol) at which information can be transmitted with arbitrarily small error,
|
||||||
|
given optimal encoding and decoding schemes.
|
||||||
|
|
||||||
|
\paragraph{Binary symmetric channel (BSC)}
|
||||||
|
The binary symmetric channel is one example of such discrete memoryless channels, where each transmitted bit
|
||||||
|
has a probability $p$ of being flipped during transmission and a probability $(1-p)$ of being received correctly.
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\begin{tikzpicture}
|
\begin{tikzpicture}
|
||||||
@@ -284,10 +337,24 @@ arbitrarily small.
|
|||||||
\draw[->] (D0) -- (D);
|
\draw[->] (D0) -- (D);
|
||||||
\draw[->] (D1) -- (D);
|
\draw[->] (D1) -- (D);
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\caption{Model of a binary symmetric channel}
|
\caption{Binary symmetric channel with crossover probability $p$}
|
||||||
\label{fig:binary-channel}
|
\label{fig:binary-channel}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
The capacity of the binary symmetric channel is given by:
|
||||||
|
\begin{equation}
|
||||||
|
C = 1 - H_2(p)
|
||||||
|
\end{equation}
|
||||||
|
where $H_2(p) = -p \log_2(p) - (1-p)\log_2(1-p)$ is the binary entropy function.
|
||||||
|
As $p$ increases, uncertainty grows and channel capacity declines.
|
||||||
|
When $p = 0.5$, output bits are completely random and no information can be transmitted ($C = 0$).
|
||||||
|
As already shown in \autoref{fig:graph-entropy}, an error rate over $p > 0.5$ is equivalent to $ 1-p < 0.5$,
|
||||||
|
though not relevant in practice.
|
||||||
|
|
||||||
|
Shannon’s theorem is not constructive as it does not provide an explicit method for constructing such efficient codes,
|
||||||
|
but it guarantees their existence.
|
||||||
|
In practice, structured codes such as Hamming and Reed–Solomon codes are employed to approach channel capacity.
|
||||||
|
|
||||||
\printbibliography
|
\printbibliography
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
Reference in New Issue
Block a user