diff --git a/entropy.tex b/entropy.tex index 4f4ca91..7ef7fce 100644 --- a/entropy.tex +++ b/entropy.tex @@ -9,6 +9,7 @@ %\usepackage{csquotes} % Recommended for biblatex \usepackage{tikz} \usepackage{pgfplots} +\usetikzlibrary{positioning} \usepackage{float} \usepackage{amsmath} \PassOptionsToPackage{hyphens}{url} @@ -51,7 +52,7 @@ S = -k_B \sum_i p_i \ln(p_i) It gives statistical meaning to the macroscopic phenomenon of classical thermodynamics by defining the entropy $S$ of a macrostate as the result of probabilities $p_i$ of all its constituting micro states. -$k_B$ refers to the Boltzmann constant, which he himself did not determine. +$k_B$ refers to the Boltzmann constant, which he himself did not determine but is part of todays SI system. \section{Shannon's axioms} \textit{Claude Shannon} adapted the concept of entropy to information theory. @@ -72,6 +73,7 @@ and media for transmission and storage. In information theory, entropy can be understood as the expected information of a message. \begin{equation} H = E(I) = - \sum_i p_i \log_2(p_i) + \label{eq:entropy-information} \end{equation} This leaves $ I =log(1/p_i) = - log_2(p_i)$, implying that an unexpected message (low probability) carries more information than one with higher probability. @@ -86,7 +88,9 @@ Because we attach high surprise only to the unlikely message of an eruption, the carries less information - we already expected it before it arrived. Putting the axioms and our intuitive understanding of information and uncertainty together, -\autoref{fig:graph-entropy} +we can see the logarithmic decay of information transported by a message as its probability increases in \autoref{fig:graph-information}, +as well as the entropy for a 2-event source given by solving \autoref{eq:entropy-information} for $i=2$, resulting in +$-p * \log_2(p) - (1-p) * \log_2(1-p) $. \begin{figure}[H] \begin{minipage}{.5\textwidth} @@ -96,7 +100,7 @@ Putting the axioms and our intuitive understanding of information and uncertaint samples=100, axis lines=middle, xlabel={$p$}, - ylabel={Information}, + ylabel={Information [bits]}, xmin=0, xmax=1, ymin=0, ymax=6.1, grid=both, @@ -118,9 +122,10 @@ Putting the axioms and our intuitive understanding of information and uncertaint samples=100, axis lines=middle, xlabel={$p$}, - ylabel={Entropy}, + ylabel={Entropy [bits]}, xmin=0, xmax=1, ymin=0, ymax=1.1, + xtick={0,0.25,0.5,0.75,1}, grid=both, width=8cm, height=6cm, @@ -211,6 +216,35 @@ In a supervised training example, the cross entropy loss degenerates to $-\log(p true label vector is assumed to be the unit vector $e_i$ (one-hot). \subsection{Coding} +The concept of entropy also plays a crucial role in the design and evaluation of codes used for data compression and transmission. +In this context, \textit{coding} refers to the representation of symbols or messages +from a source using a finite set of codewords. +Each codeword is typically composed of a sequence of bits, +and the design goal is to minimize the average length of these codewords while maintaining unique decodability. + +According to Shannon's source coding theorem, the theoretical lower bound for the average codeword length of a source +is given by its entropy $H$. +In other words, no lossless coding scheme can achieve an average length smaller than the source entropy when expressed in bits. +Codes that approach this bound are called \textit{efficient} or \textit{entropy-optimal}. +A familiar example of such a scheme is \textit{Huffman coding}, +which assigns shorter codewords to more probable symbols and longer ones to less probable symbols, +resulting in a prefix-free code with minimal expected length. + +Beyond compression, coding is essential for reliable communication over imperfect channels. +In real-world systems, transmitted bits are often corrupted by noise, requiring mechanisms to detect and correct errors. +One simple but powerful concept to quantify the robustness of a code is the \textit{Hamming distance}. +The Hamming distance between two codewords is defined as the number of bit positions in which they differ. +For example, the codewords $10110$ and $11100$ have a Hamming distance of 2. + +A code with a minimum Hamming distance $d_{min}$ can detect up to $d_{min}-1$ errors +and correct up to $\lfloor (d_{min}-1)/2 \rfloor$ errors. +This insight forms the basis of error-correcting codes such as Hamming codes, +which add redundant bits to data in a structured way that enables the receiver to both identify and correct single-bit errors. + +Thus, the efficiency and reliability of communication systems are governed by a trade-off: +higher redundancy (lower efficiency) provides greater error correction capability, +while minimal redundancy maximizes data throughput but reduces error resilience. + %Coding of a source of an information and communication channel % https://www.youtube.com/watch?v=ErfnhcEV1O8 % relation to hamming distance and efficient codes @@ -218,25 +252,11 @@ true label vector is assumed to be the unit vector $e_i$ (one-hot). \subsection{Noisy communication channels} The noisy channel coding theorem was stated by \textit{Claude Shannon} in 1948, but first rigorous proof was provided in 1954 by Amiel Feinstein. -One of the important issues Shannon wanted to tackle with his 'Mathematical theory of commmunication' +One of the important issues Shannon tackled with his 'Mathematical theory of commmunication' was the insufficient means of transporting discrete data through a noisy channel that were more efficient than -the telegram. +the telegram - or, how to communicate reliably over an unreliable channel. The means of error correction until then had been limited to very basic means. -First, analogue connections like the first telephone lines, bypassed the issue altogether and relied -on the communicating parties and their brains' ability to filter human voices from the noise that was inevitably transmitted -along with the intended signal. -After some development, the telegraph in its final form used morse code, a series of long and short clicks, that, -together with letter and word gaps, would encode text messages. -Even though the long-short coding might appear similar to todays binary coding, the means of error correction were lacking. -For a long time, it relied on simply repeating the message multiple times, which is highly inefficient. -The destination would then have to determine the most likely intended message by performing a majority vote. -One might also propose simply increasing transmitting power, thereby decreasing the error rate of the associated channel. -However, the noisy channel coding theorem provides us with a more elegant solution. -It is of foundational importance to information theory, stating that given a noisy channel with capacity $C$ -and information transmitted at rate $R$, there exists an $R] (D0) -- (D); \draw[->] (D1) -- (D); \end{tikzpicture} -\caption{Model of a binary symmetric channel} +\caption{Binary symmetric channel with crossover probability $p$} \label{fig:binary-channel} \end{figure} +The capacity of the binary symmetric channel is given by: +\begin{equation} + C = 1 - H_2(p) +\end{equation} +where $H_2(p) = -p \log_2(p) - (1-p)\log_2(1-p)$ is the binary entropy function. +As $p$ increases, uncertainty grows and channel capacity declines. +When $p = 0.5$, output bits are completely random and no information can be transmitted ($C = 0$). +As already shown in \autoref{fig:graph-entropy}, an error rate over $p > 0.5$ is equivalent to $ 1-p < 0.5$, +though not relevant in practice. + +Shannon’s theorem is not constructive as it does not provide an explicit method for constructing such efficient codes, +but it guarantees their existence. +In practice, structured codes such as Hamming and Reed–Solomon codes are employed to approach channel capacity. + \printbibliography \end{document} \ No newline at end of file