\documentclass{article} %%% basic layouting \usepackage[utf8x]{inputenc} \usepackage[margin=1in]{geometry} % Adjust margins \usepackage{caption} \usepackage{hyperref} \PassOptionsToPackage{hyphens}{url} % allow breaking urls \usepackage{float} \usepackage{wrapfig} \usepackage{subcaption} \usepackage{parskip} % dont indent after paragraphs, figures \usepackage{xcolor} %%% algorithms \usepackage{algorithm} \usepackage{algpseudocodex} % graphs and plots \usepackage{tikz} \usepackage{pgfplots} \usetikzlibrary{positioning} %\usegdlibrary{trees} %%% math \usepackage{amsmath} %%% citations \usepackage[style=ieee, backend=biber, maxnames=1, minnames=1]{biblatex} %\usepackage{csquotes} % Recommended for biblatex \addbibresource{compression.bib} \title{Compression} \author{Erik Neller} \date{\today} \begin{document} \maketitle \section{Introduction} As the volume of data grows exponentially around the world, compression is only gaining in importance to all disciplines. Not only does it enable the storage of large amounts of information needed for research in scientific domains like DNA sequencing and analysis, it also plays a vital role in keeping stored data accessible by facilitating cataloging, search and retrieval. The concept of entropy introduced in the previous entry is closely related to the design of efficient codes for compression. In coding theory, the events of an information source are to be encoded in a manner that minimizes the bits needed to store the information provided by the source. The process of encoding can thus be described by a function $C$ transforming from a source alphabet $X$ to a code alphabet $Y$. Symbols in the alphabets are denominated $x_i$ and $y_j$ respectively, and have underlying probabilities $p_{i}$. \begin{equation} C: X \rightarrow Y \qquad X=\{x_1,x_2,...x_n\} \qquad Y=\{y_1,y_2,...y_m\} \label{eq:formal-code} \end{equation} The understanding of entropy as the expected information $E(I)$ of a message provides an intuition that, given a source with a given entropy (in bits), any coding can not have a lower average word length $l_j$ (in bits) than this entropy without losing information. \begin{equation} H = E(I) = - \sum_i p_i \log_2(p_i) \quad \leq \quad E(L) = \sum_i p_j l_j \label{eq:entropy-information} \end{equation} This is the content of Shannons's source coding theorem, introduced in \citeyear{shannon1948mathematical}. In his paper, \citeauthor{shannon1948mathematical} proposed two principal ideas to minimize the average length of a code. The first is to use short codes for symbols with higher probability. This is an intuitive approach as more frequent symbols have a higher impact on average code length. The second idea is to encode events that frequently occur together at the same time, artificially increasing the size of the code alphabet $Y$ to allow for greater flexibility in code design.\cite{enwiki:shannon-source-coding} Codes can have several properties. A code where all codewords have equal lengths is called a \textit{block code}. While easy to construct, they are not well suited for our goal of minimizing average word length as specified in \autoref{eq:entropy-information} because the source alphabet is generally not equally distributed in a way that $p_i = \frac{1}{n}$. In order to send (or store, for that matter) multiple code words in succession, a code $Y$ has to be uniquely decodable. When receiving 0010 in succesion using the nonsingular code $Y_2$ from \autoref{tab:code-properties}, it is not clear to the recipient which source symbols make up the intended message. For the specified sequence, there are a total of three possibilities to decode the received code: $s_0 s_3 s_0$, $s_0 s_0 s_1$ or $s_2 s_1$ could all be the intended message, making the code useless. \begin{table}[H] \centering \begin{tabular}{c l l l} Source Code $X$ & Prefix Code $Y_0$ & Suffix Code $Y_1$ & Nonsingular Code $Y_2$ \\ \hline $s_0$ & 0 & 0 & 0 \\ $s_1$ & 10 & 01 & 10 \\ $s_2$ & 110 & 011 & 00 \\ $s_3$ & 1110 & 0111 & 01 \\ \end{tabular} \caption{Examples of different properties of codes} \label{tab:code-properties} \end{table} Another interesting property of a code that is specifically important for transmission but less so for storage, is being prefix-free. A prefix code (which is said to be prefix-free) can be decoded by the receiver of the symbol as soon as it is received because no code word $y_j$ is the prefix of another valid code word. As shown in \autoref{tab:code-properties} $Y_0$ is a prefix code, in this case more specifically called a \textit{comma code} because each code word is separated by a trailing 0 from the next code word. $Y_1$ in contrast is called a \textit{capital code} (capitalizes the beginning of each word) and is not a prefix code. In the case of the capital code in fact every word other than the longest possible code word is a prefix of the longer words lower in the table. As a result, the receiver cannot instantaneously decode each word but rather has to wait for the leading 0 of the next codeword. Further, a code is said to be \textit{efficient} if it has the smallest possible average word length, i.e. matches the entropy of the source alphabet. \section{Kraft-McMillan inequality} \section{Shannon-Fano} Shannon-Fano coding is one of the earliest methods for constructing prefix codes. It divides symbols into equal groups based on their probabilities, recursively partitioning them to assign shorter codewords to more frequent symbols. While intuitive, Shannon-Fano coding does not always achieve optimal compression, paving the way for more advanced techniques like Huffman coding. \begin{algorithm} \begin{algorithmic}[1] \State first line \end{algorithmic} \label{alg:shannon-fano} \caption{Shannon-Fano compression algorithm} \end{algorithm} \section{Huffman Coding} Huffman coding is an optimal prefix coding algorithm that minimizes the expected codeword length for a given set of symbol probabilities. By constructing a binary tree where the most frequent symbols are assigned the shortest codewords, Huffman coding achieves the theoretical limit of entropy for discrete memoryless sources. Its efficiency and simplicity have made it a cornerstone of lossless data compression. \section{Arithmetic Coding} Arithmetic coding is a modern compression technique that encodes an entire message as a single interval within the range $[0, 1)$. By iteratively refining this interval based on the probabilities of the symbols in the message, arithmetic coding can achieve compression rates that approach the entropy of the source. Its ability to handle non-integer bit lengths makes it particularly powerful for applications requiring high compression efficiency. \section{LZW Algorithm} The Lempel-Ziv-Welch (LZW) algorithm is a dictionary-based compression method that dynamically builds a dictionary of recurring patterns in the data. Unlike entropy-based methods, LZW does not require prior knowledge of symbol probabilities, making it highly adaptable and efficient for a wide range of applications, including image and text compression. \cite{dewiki:lzw} \printbibliography \end{document}