This commit is contained in:
eneller
2025-11-26 23:27:47 +01:00
parent e015e816bd
commit e10e311f0f
2 changed files with 53 additions and 20 deletions

View File

@@ -37,34 +37,67 @@ Not only does it enable the storage of large amounts of information needed for r
like DNA sequencing and analysis, it also plays a vital role in keeping stored data accessible by
facilitating cataloging, search and retrieval.
The concept of entropy introduced in the previous entry is closely related to the design of efficient codes for compression.
\begin{figure}[H]
\begin{minipage}{0.5\textwidth}
\begin{equation}
H = E(I) = - \sum_i p_i \log_2(p_i)
\label{eq:entropy-information}
\end{equation}
\end{minipage}
\begin{minipage}{0.5\textwidth}
\begin{equation}
E(L) = \sum_i p_i l_i
\label{eq:expected-codelength}
\end{equation}
\end{minipage}
\end{figure}
In coding theory, the events of an information source are to be encoded in a manner that minimizes the bits needed to store
the information provided by the source.
The process of encoding can thus be described by a function $C$ transforming from a source alphabet $X$ to a code alphabet $Y$.
Symbols in the alphabets are denominated $x_i$ and $y_j$ respectively, and have underlying probabilities $p_{i}$.
\begin{equation}
C: X \rightarrow Y \qquad X=\{x_1,x_2,...x_n\} \qquad Y=\{y_1,y_2,...y_m\}
\label{eq:formal-code}
\end{equation}
The understanding of entropy as the expected information $E(I)$ of a message provides an intuition that,
given a source with a given entropy (in bits), any coding can not have a lower average word length (in bits)
given a source with a given entropy (in bits), any coding can not have a lower average word length $l_j$ (in bits)
than this entropy without losing information.
\begin{equation}
H = E(I) = - \sum_i p_i \log_2(p_i) \quad \leq \quad E(L) = \sum_i p_j l_j
\label{eq:entropy-information}
\end{equation}
This is the content of Shannons's source coding theorem,
introduced in \citeyear{shannon1948mathematical} \cite{enwiki:shannon-source-coding}.
introduced in \citeyear{shannon1948mathematical}.
In his paper, \citeauthor{shannon1948mathematical} proposed two principal ideas to minimize the average length of a code.
The first is to use short codes for symbols with higher probability.
This is an intuitive approach as more frequent symbols have a higher impact on average code length.
The second idea is to encode events that frequently occur together at the same time, allowing for greater flexibility
in code design.
The second idea is to encode events that frequently occur together at the same time, artificially increasing
the size of the code alphabet $Y$ to allow for greater flexibility in code design.\cite{enwiki:shannon-source-coding}
Codes can have several properties. A code where all codewords have equal lengths is called a \textit{block code}.
While easy to construct, they are not well suited for our goal of minimizing average word length
as specified in \autoref{eq:entropy-information} because the source alphabet is generally not equally distributed
in a way that $p_i = \frac{1}{n}$.
In order to send (or store, for that matter) multiple code words in succession, a code $Y$ has to be uniquely decodable.
When receiving 0010 in succesion using the nonsingular code $Y_2$ from \autoref{tab:code-properties},
it is not clear to the recipient which source symbols make up the intended message.
For the specified sequence, there are a total of three possibilities to decode the received code:
$s_0 s_3 s_0$, $s_0 s_0 s_1$ or $s_2 s_1$ could all be the intended message, making the code useless.
\begin{table}[H]
\centering
\begin{tabular}{c l l l}
Source Code $X$ & Prefix Code $Y_0$ & Suffix Code $Y_1$ & Nonsingular Code $Y_2$ \\
\hline
$s_0$ & 0 & 0 & 0 \\
$s_1$ & 10 & 01 & 10 \\
$s_2$ & 110 & 011 & 00 \\
$s_3$ & 1110 & 0111 & 01 \\
\end{tabular}
\caption{Examples of different properties of codes}
\label{tab:code-properties}
\end{table}
Another interesting property of a code that is specifically important for transmission but less so for storage, is
being prefix-free.
A prefix code (which is said to be prefix-free) can be decoded by the receiver of the symbol as soon as it is received
because no code word $y_j$ is the prefix of another valid code word.
As shown in \autoref{tab:code-properties} $Y_0$ is a prefix code, in this case more specifically called a \textit{comma code}
because each code word is separated by a trailing 0 from the next code word.
$Y_1$ in contrast is called a \textit{capital code} (capitalizes the beginning of each word) and is not a prefix code.
In the case of the capital code in fact every word other than the longest possible code word is a prefix of the longer words
lower in the table. As a result, the receiver cannot instantaneously decode each word but rather has to wait for the leading 0
of the next codeword.
Further, a code is said to be \textit{efficient} if it has the smallest possible average word length, i.e. matches
the entropy of the source alphabet.
\section{Kraft-McMillan inequality}