update
This commit is contained in:
@@ -1,5 +1,5 @@
|
||||
$latex = 'latex %O --shell-escape %S';
|
||||
$pdflatex = 'pdflatex %O --shell-escape %S';
|
||||
$pdf_mode = 1;
|
||||
$clean_ext = "lol nav snm";
|
||||
$clean_ext = "lol nav snm loa bbl*";
|
||||
$bibtex_use = 2;
|
||||
|
||||
@@ -37,34 +37,67 @@ Not only does it enable the storage of large amounts of information needed for r
|
||||
like DNA sequencing and analysis, it also plays a vital role in keeping stored data accessible by
|
||||
facilitating cataloging, search and retrieval.
|
||||
The concept of entropy introduced in the previous entry is closely related to the design of efficient codes for compression.
|
||||
\begin{figure}[H]
|
||||
\begin{minipage}{0.5\textwidth}
|
||||
\begin{equation}
|
||||
H = E(I) = - \sum_i p_i \log_2(p_i)
|
||||
\label{eq:entropy-information}
|
||||
\end{equation}
|
||||
\end{minipage}
|
||||
\begin{minipage}{0.5\textwidth}
|
||||
\begin{equation}
|
||||
E(L) = \sum_i p_i l_i
|
||||
\label{eq:expected-codelength}
|
||||
\end{equation}
|
||||
\end{minipage}
|
||||
\end{figure}
|
||||
In coding theory, the events of an information source are to be encoded in a manner that minimizes the bits needed to store
|
||||
the information provided by the source.
|
||||
The process of encoding can thus be described by a function $C$ transforming from a source alphabet $X$ to a code alphabet $Y$.
|
||||
Symbols in the alphabets are denominated $x_i$ and $y_j$ respectively, and have underlying probabilities $p_{i}$.
|
||||
\begin{equation}
|
||||
C: X \rightarrow Y \qquad X=\{x_1,x_2,...x_n\} \qquad Y=\{y_1,y_2,...y_m\}
|
||||
\label{eq:formal-code}
|
||||
\end{equation}
|
||||
|
||||
The understanding of entropy as the expected information $E(I)$ of a message provides an intuition that,
|
||||
given a source with a given entropy (in bits), any coding can not have a lower average word length (in bits)
|
||||
|
||||
given a source with a given entropy (in bits), any coding can not have a lower average word length $l_j$ (in bits)
|
||||
than this entropy without losing information.
|
||||
\begin{equation}
|
||||
H = E(I) = - \sum_i p_i \log_2(p_i) \quad \leq \quad E(L) = \sum_i p_j l_j
|
||||
\label{eq:entropy-information}
|
||||
\end{equation}
|
||||
This is the content of Shannons's source coding theorem,
|
||||
introduced in \citeyear{shannon1948mathematical} \cite{enwiki:shannon-source-coding}.
|
||||
introduced in \citeyear{shannon1948mathematical}.
|
||||
In his paper, \citeauthor{shannon1948mathematical} proposed two principal ideas to minimize the average length of a code.
|
||||
The first is to use short codes for symbols with higher probability.
|
||||
This is an intuitive approach as more frequent symbols have a higher impact on average code length.
|
||||
The second idea is to encode events that frequently occur together at the same time, allowing for greater flexibility
|
||||
in code design.
|
||||
The second idea is to encode events that frequently occur together at the same time, artificially increasing
|
||||
the size of the code alphabet $Y$ to allow for greater flexibility in code design.\cite{enwiki:shannon-source-coding}
|
||||
|
||||
Codes can have several properties. A code where all codewords have equal lengths is called a \textit{block code}.
|
||||
While easy to construct, they are not well suited for our goal of minimizing average word length
|
||||
as specified in \autoref{eq:entropy-information} because the source alphabet is generally not equally distributed
|
||||
in a way that $p_i = \frac{1}{n}$.
|
||||
|
||||
In order to send (or store, for that matter) multiple code words in succession, a code $Y$ has to be uniquely decodable.
|
||||
When receiving 0010 in succesion using the nonsingular code $Y_2$ from \autoref{tab:code-properties},
|
||||
it is not clear to the recipient which source symbols make up the intended message.
|
||||
For the specified sequence, there are a total of three possibilities to decode the received code:
|
||||
$s_0 s_3 s_0$, $s_0 s_0 s_1$ or $s_2 s_1$ could all be the intended message, making the code useless.
|
||||
|
||||
\begin{table}[H]
|
||||
\centering
|
||||
\begin{tabular}{c l l l}
|
||||
Source Code $X$ & Prefix Code $Y_0$ & Suffix Code $Y_1$ & Nonsingular Code $Y_2$ \\
|
||||
\hline
|
||||
$s_0$ & 0 & 0 & 0 \\
|
||||
$s_1$ & 10 & 01 & 10 \\
|
||||
$s_2$ & 110 & 011 & 00 \\
|
||||
$s_3$ & 1110 & 0111 & 01 \\
|
||||
\end{tabular}
|
||||
\caption{Examples of different properties of codes}
|
||||
\label{tab:code-properties}
|
||||
\end{table}
|
||||
Another interesting property of a code that is specifically important for transmission but less so for storage, is
|
||||
being prefix-free.
|
||||
A prefix code (which is said to be prefix-free) can be decoded by the receiver of the symbol as soon as it is received
|
||||
because no code word $y_j$ is the prefix of another valid code word.
|
||||
As shown in \autoref{tab:code-properties} $Y_0$ is a prefix code, in this case more specifically called a \textit{comma code}
|
||||
because each code word is separated by a trailing 0 from the next code word.
|
||||
$Y_1$ in contrast is called a \textit{capital code} (capitalizes the beginning of each word) and is not a prefix code.
|
||||
In the case of the capital code in fact every word other than the longest possible code word is a prefix of the longer words
|
||||
lower in the table. As a result, the receiver cannot instantaneously decode each word but rather has to wait for the leading 0
|
||||
of the next codeword.
|
||||
|
||||
Further, a code is said to be \textit{efficient} if it has the smallest possible average word length, i.e. matches
|
||||
the entropy of the source alphabet.
|
||||
|
||||
\section{Kraft-McMillan inequality}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user