update

2025-10-30 15:47:54 +01:00
parent 358274979f
commit f8013bcee5
1 changed files with 88 additions and 21 deletions
--- a/entropy.tex
+++ b/entropy.tex
@@ -9,6 +9,7 @@
 %\usepackage{csquotes} % Recommended for biblatex
 \usepackage{tikz}
 \usepackage{pgfplots}
 \usetikzlibrary{positioning}
 \usepackage{float}
 \usepackage{amsmath}
 \PassOptionsToPackage{hyphens}{url}
@@ -51,7 +52,7 @@ S = -k_B \sum_i p_i \ln(p_i)
 It gives statistical meaning to the macroscopic phenomenon of classical thermodynamics 
 by defining the entropy $S$ of a macrostate as 
 the result of probabilities $p_i$ of all its constituting micro states.
-$k_B$ refers to the Boltzmann constant, which he himself did not determine.
+$k_B$ refers to the Boltzmann constant, which he himself did not determine but is part of todays SI system.
 \section{Shannon's axioms}
 \textit{Claude Shannon} adapted the concept of entropy to information theory.
@@ -72,6 +73,7 @@ and media for transmission and storage.
 In information theory, entropy can be understood as the expected information of a message.
 \begin{equation}
    H = E(I) = - \sum_i p_i \log_2(p_i)
    \label{eq:entropy-information}
 \end{equation}
 This leaves $ I =log(1/p_i) = - log_2(p_i)$, implying that an unexpected message (low probability) carries
 more information than one with higher probability.
@@ -86,7 +88,9 @@ Because we attach high surprise only to the unlikely message of an eruption, the
 carries less information - we already expected it before it arrived.
 Putting the axioms and our intuitive understanding of information and uncertainty together,
-\autoref{fig:graph-entropy}
+we can see the logarithmic decay of information transported by a message as its probability increases in \autoref{fig:graph-information},
 as well as the entropy for a 2-event source given by solving \autoref{eq:entropy-information} for $i=2$, resulting in
 $-p * \log_2(p) - (1-p) * \log_2(1-p) $.
 \begin{figure}[H]
 \begin{minipage}{.5\textwidth}
@@ -96,7 +100,7 @@ Putting the axioms and our intuitive understanding of information and uncertaint
    samples=100,
    axis lines=middle,
    xlabel={$p$},
-    ylabel={Information},
+      ylabel={Information [bits]},
    xmin=0, xmax=1,
    ymin=0, ymax=6.1,
    grid=both,
@@ -118,9 +122,10 @@ Putting the axioms and our intuitive understanding of information and uncertaint
    samples=100,
    axis lines=middle,
    xlabel={$p$},
-    ylabel={Entropy},
+    ylabel={Entropy [bits]},
    xmin=0, xmax=1,
    ymin=0, ymax=1.1,
    xtick={0,0.25,0.5,0.75,1},
    grid=both,
    width=8cm,
    height=6cm,
@@ -211,6 +216,35 @@ In a supervised training example, the cross entropy loss degenerates to $-\log(p
 true label vector is assumed to be the unit vector $e_i$ (one-hot).
 \subsection{Coding}
 The concept of entropy also plays a crucial role in the design and evaluation of codes used for data compression and transmission.
 In this context, \textit{coding} refers to the representation of symbols or messages
 from a source using a finite set of codewords.
 Each codeword is typically composed of a sequence of bits,
 and the design goal is to minimize the average length of these codewords while maintaining unique decodability.
 According to Shannon's source coding theorem, the theoretical lower bound for the average codeword length of a source
 is given by its entropy $H$.
 In other words, no lossless coding scheme can achieve an average length smaller than the source entropy when expressed in bits.
 Codes that approach this bound are called \textit{efficient} or \textit{entropy-optimal}.
 A familiar example of such a scheme is \textit{Huffman coding},
 which assigns shorter codewords to more probable symbols and longer ones to less probable symbols,
 resulting in a prefix-free code with minimal expected length.
 Beyond compression, coding is essential for reliable communication over imperfect channels.
 In real-world systems, transmitted bits are often corrupted by noise, requiring mechanisms to detect and correct errors.
 One simple but powerful concept to quantify the robustness of a code is the \textit{Hamming distance}.
 The Hamming distance between two codewords is defined as the number of bit positions in which they differ.
 For example, the codewords $10110$ and $11100$ have a Hamming distance of 2.
 A code with a minimum Hamming distance $d_{min}$ can detect up to $d_{min}-1$ errors
 and correct up to $\lfloor (d_{min}-1)/2 \rfloor$ errors.
 This insight forms the basis of error-correcting codes such as Hamming codes,
 which add redundant bits to data in a structured way that enables the receiver to both identify and correct single-bit errors.
 Thus, the efficiency and reliability of communication systems are governed by a trade-off:
 higher redundancy (lower efficiency) provides greater error correction capability,
 while minimal redundancy maximizes data throughput but reduces error resilience.
 %Coding of a source of an information and communication channel
 % https://www.youtube.com/watch?v=ErfnhcEV1O8
 % relation to hamming distance and efficient codes
@@ -218,25 +252,11 @@ true label vector is assumed to be the unit vector $e_i$ (one-hot).
 \subsection{Noisy communication channels}
 The noisy channel coding theorem was stated by \textit{Claude Shannon} in 1948, but first rigorous proof was
 provided in 1954 by Amiel Feinstein.
-One of the important issues Shannon wanted to tackle with his 'Mathematical theory of commmunication'
+One of the important issues Shannon tackled with his 'Mathematical theory of commmunication'
 was the insufficient means of transporting discrete data through a noisy channel that were more efficient than
-the telegram.
+the telegram - or, how to communicate reliably over an unreliable channel.
 The means of error correction until then had been limited to very basic means.
 First, analogue connections like the first telephone lines, bypassed the issue altogether and relied
 on the communicating parties and their brains' ability to filter human voices from the noise that was inevitably transmitted
 along with the intended signal.
 After some development, the telegraph in its final form used morse code, a series of long and short clicks, that,
 together with letter and word gaps, would encode text messages.
 Even though the long-short coding might appear similar to todays binary coding, the means of error correction were lacking.
 For a long time, it relied on simply repeating the message multiple times, which is highly inefficient.
 The destination would then have to determine the most likely intended message by performing a majority vote.
 One might also propose simply increasing transmitting power, thereby decreasing the error rate of the associated channel.
 However, the noisy channel coding theorem provides us with a more elegant solution.
 It is of foundational importance to information theory, stating that given a noisy channel with capacity $C$
 and information transmitted at rate $R$, there exists an $R<C$ so the error rate at the receiver can be
 arbitrarily small.
 \begin{figure}[H]
 \begin{tikzpicture}
    \def\boxw{2.5cm}
@@ -260,6 +280,39 @@ arbitrarily small.
 \caption{Model of a noisy communication channel}
 \label{fig:noisy-channel}
 \end{figure}
 First, analogue connections like the first telephone lines, bypassed the issue altogether and relied
 on the communicating parties and their brains' ability to filter human voices from the noise that was inevitably transmitted
 along with the intended signal.
 After some development, the telegraph in its final form used morse code, a series of long and short clicks, that,
 together with letter and word gaps, would encode text messages.
 Even though the long-short coding might appear similar to todays binary coding, the means of error correction were lacking.
 For a long time, it relied on simply repeating the message multiple times, which is highly inefficient.
 The destination would then have to determine the most likely intended message by performing a majority vote.
 One might also propose simply increasing transmitting power, thereby decreasing the error rate of the associated channel.
 However, the noisy channel coding theorem provides us with a more elegant solution.
 It is of foundational importance to information theory, stating that given a noisy channel with capacity $C$
 and information transmitted at rate $R$, there exists an $R<C$ so the error rate at the receiver can be
 arbitrarily small.
 \paragraph{Channel capacity and mutual information}
 For any discrete memoryless channel, we can describe its behavior with a conditional probability distribution
 $p(y|x)$ — the probability that symbol $y$ is received given symbol $x$ was sent.
 The \textit{mutual information} between the transmitted and received signals measures how much information, on average, passes through the channel:
 \begin{equation}
    I(X;Y) = \sum_{x,y} p(x, y) \log_2 \frac{p(x, y)}{p(x)p(y)} = H(Y) - H(Y|X)
 \end{equation}
 The \textit{channel capacity} $C$ is then defined as the maximum achievable mutual information across all possible input distributions:
 \begin{equation}
    C = \max_{p(x)} I(X;Y)
 \end{equation}
 It represents the highest rate (in bits per symbol) at which information can be transmitted with arbitrarily small error,
 given optimal encoding and decoding schemes.
 \paragraph{Binary symmetric channel (BSC)}
 The binary symmetric channel is one example of such discrete memoryless channels, where each transmitted bit
 has a probability $p$ of being flipped during transmission and a probability $(1-p)$ of being received correctly.
 \begin{figure}[H]
 \begin{tikzpicture}
@@ -284,10 +337,24 @@ arbitrarily small.
    \draw[->] (D0) -- (D);
    \draw[->] (D1) -- (D);
 \end{tikzpicture}
-\caption{Model of a binary symmetric channel}
+\caption{Binary symmetric channel with crossover probability $p$}
 \label{fig:binary-channel}
 \end{figure}
 The capacity of the binary symmetric channel is given by:
 \begin{equation}
    C = 1 - H_2(p)
 \end{equation}
 where $H_2(p) = -p \log_2(p) - (1-p)\log_2(1-p)$ is the binary entropy function.
 As $p$ increases, uncertainty grows and channel capacity declines.
 When $p = 0.5$, output bits are completely random and no information can be transmitted ($C = 0$).
 As already shown in \autoref{fig:graph-entropy}, an error rate over $p > 0.5$ is equivalent to $ 1-p < 0.5$,
 though not relevant in practice.
 Shannon’s theorem is not constructive as it does not provide an explicit method for constructing such efficient codes,
 but it guarantees their existence.
 In practice, structured codes such as Hamming and Reed–Solomon codes are employed to approach channel capacity.
 \printbibliography
 \end{document}