diff --git a/compression.bib b/compression.bib index 51d337a..850649c 100644 --- a/compression.bib +++ b/compression.bib @@ -50,6 +50,13 @@ url = "https://en.wikipedia.org/w/index.php?title=Kraft%E2%80%93McMillan_inequality&oldid=1313803157", note = "[Online; accessed 26-November-2025]" } +@misc{ enwiki:partition, + author = "{Wikipedia contributors}", + title = "Partition problem --- {Wikipedia}{,} The Free Encyclopedia", + year = "2025", + url = "https://en.wikipedia.org/w/index.php?title=Partition_problem&oldid=1320732818", + note = "[Online; accessed 30-November-2025]" +} @misc{ dewiki:shannon-fano, author = "Wikipedia", title = "Shannon-Fano-Kodierung --- Wikipedia{,} die freie Enzyklopädie", diff --git a/compression.tex b/compression.tex index 4ba08ac..ee1b10f 100644 --- a/compression.tex +++ b/compression.tex @@ -120,23 +120,96 @@ It is a top-down method that divides symbols into equal groups based on their pr recursively partitioning them to assign shorter codewords to more frequent events. \begin{algorithm} -\begin{algorithmic} - \State first line -\end{algorithmic} -\label{alg:shannon-fano} \caption{Shannon-Fano compression} +\label{alg:shannon-fano} +\begin{algorithmic} + \Procedure{ShannonFano}{symbols, probabilities} + \If{length(symbols) $= 1$} + \State \Return codeword for single symbol + \EndIf + \State $\text{current\_sum} \gets 0$ + \State $\text{split\_index} \gets 0$ + \For{$i \gets 1$ \textbf{to} length(symbols)} + \If{$|\text{current\_sum} + \text{probabilities}[i] - 0.5| < |\text{current\_sum} - 0.5|$} + \State $\text{current\_sum} \gets \text{current\_sum} + \text{probabilities}[i]$ + \State $\text{split\_index} \gets i$ + \EndIf + \EndFor + \State $\text{left\_group} \gets \text{symbols}[1 : \text{split\_index}]$ + \State $\text{right\_group} \gets \text{symbols}[\text{split\_index} + 1 : \text{length(symbols)}]$ + \State Assign prefix ``0'' to codes from ShannonFano($\text{left\_group}, \ldots$) + \State Assign prefix ``1'' to codes from ShannonFano($\text{right\_group}, \ldots$) + \EndProcedure +\end{algorithmic} \end{algorithm} +While Shannon-Fano coding guarantees the generation of a prefix-free code with an average word length close to the entropy, +it is not guaranteed to be optimal. In practice, it often generates codewords that are only slightly longer than necessary. +Weaknesses of the algorithm also include the non-trivial partitioning phase \cite{enwiki:partition}, which can in practice however be solved relatively efficiently. +Due to the aforementioned limitations, neither of the two historically slightly ambiguous Shannon-Fano algorithms are almost never used, +in favor of the Huffman coding as described in the next section. + \section{Huffman Coding} +\label{sec:huffman} Huffman coding is an optimal prefix coding algorithm that minimizes the expected codeword length -for a given set of symbol probabilities. -By constructing a binary tree where the most frequent symbols are assigned the shortest codewords, -Huffman coding achieves the theoretical limit of entropy for discrete memoryless sources. -Its efficiency and simplicity have made it a cornerstone of lossless data compression. +for a given set of symbol probabilities. Developed by David Huffman in 1952, it guarantees optimality +by constructing a binary tree where the most frequent symbols are assigned the shortest codewords. +Huffman coding achieves the theoretical limit of entropy for discrete memoryless sources, +making it one of the most important compression techniques in information theory. + +Unlike Shannon-Fano, which uses a top-down approach, Huffman coding employs a bottom-up strategy. +The algorithm builds the code tree by iteratively combining the two symbols with the lowest probabilities +into a new internal node. This greedy approach ensures that the resulting tree minimizes the weighted path length, +where the weight of each symbol is its probability. + +\begin{algorithm} +\caption{Huffman coding algorithm} +\label{alg:huffman} +\begin{algorithmic} + \Procedure{Huffman}{symbols, probabilities} + \State Create a leaf node for each symbol and add it to a priority queue + \While{priority queue contains more than one node} + \State Extract two nodes with minimum frequency: $\text{left}$ and $\text{right}$ + \State Create a new internal node with frequency $\text{freq(left)} + \text{freq(right)}$ + \State Set $\text{left}$ as the left child and $\text{right}$ as the right child + \State Add the new internal node to the priority queue + \EndWhile + \State $\text{root} \gets$ remaining node in priority queue + \State Traverse tree and assign codewords: ``0'' for left edges, ``1'' for right edges + \State \Return codewords + \EndProcedure +\end{algorithmic} +\end{algorithm} + +The optimality of Huffman coding can be proven by exchange arguments. +The key insight is that if two codewords have the maximum length in an optimal code, they must correspond to the two least frequent symbols. +Moreover, these two symbols can be combined into a single meta-symbol without affecting optimality, +which leads to a recursive structure that guarantees Huffman's method produces an optimal code. + +The average codeword length $L_{\text{Huffman}}$ produced by Huffman coding satisfies the following bounds: +\begin{equation} + H(X) \leq L_{\text{Huffman}} < H(X) + 1 + \label{eq:huffman-bounds} +\end{equation} +where $H(X)$ is the entropy of the source. This means Huffman coding is guaranteed to be within one bit +of the theoretical optimum. In practice, when symbol probabilities are powers of $\frac{1}{2}$, +Huffman coding achieves perfect compression and $L_{\text{Huffman}} = H(X)$. + +The computational complexity of Huffman coding is $O(n \log n)$, where $n$ is the number of distinct symbols. +A priority queue implementation using a binary heap achieves this bound, making Huffman coding +efficient even for large alphabets. Its widespread use in compression formats such as DEFLATE, JPEG, and MP3 +testifies to its practical importance. + +However, Huffman coding has limitations. First, it requires knowledge of the probability distribution +of symbols before encoding, necessitating a preprocessing pass or transmission of frequency tables. +Second, it assigns an integer number of bits to each symbol, which can be suboptimal +when symbol probabilities do not align well with powers of two. +Symbol-by-symbol coding imposes a constraint that is often unneeded since codes will usually be packed in long sequences, +leaving room for further optimization as provided by Arithmetic Coding. \section{Arithmetic Coding} Arithmetic coding is a modern compression technique that encodes an entire message as a single interval -within the range $[0, 1)$. +within the range $[0, 1)$, as opposed to symbol-by-symbol coding used by Huffman. By iteratively refining this interval based on the probabilities of the symbols in the message, arithmetic coding can achieve compression rates that approach the entropy of the source. Its ability to handle non-integer bit lengths makes it particularly powerful @@ -144,11 +217,100 @@ for applications requiring high compression efficiency. \section{LZW Algorithm} The Lempel-Ziv-Welch (LZW) algorithm is a dictionary-based compression method that dynamically builds a dictionary -of recurring patterns in the data. -Unlike entropy-based methods, LZW does not require prior knowledge of symbol probabilities, -making it highly adaptable and efficient for a wide range of applications, including image and text compression. -Because the dictionary does not have to be transmitted explicitly, LZW is also useful for streaming data. -\cite{dewiki:lzw} +of recurring patterns in the data as compression proceeds. Unlike entropy-based methods such as Huffman or arithmetic coding, +LZW does not require prior knowledge of symbol probabilities, making it highly adaptable and efficient +for a wide range of applications, including image and text compression. +The algorithm was developed by Abraham Lempel and Jacob Ziv, with refinements by Terry Welch in 1984. + +The fundamental insight of LZW is that many data sources contain repeating patterns that can be exploited +by replacing longer sequences with shorter codes. Rather than assigning variable-length codes to individual symbols +based on their frequency, LZW identifies recurring substrings and assigns them fixed-length codes. +As the algorithm processes the data, it dynamically constructs a dictionary that maps these patterns to codes, +without requiring the dictionary to be transmitted with the compressed data. + +\begin{algorithm} +\caption{LZW compression algorithm} +\label{alg:lzw} +\begin{algorithmic} + \Procedure{LZWCompress}{data} + \State Initialize dictionary with all single characters + \State $\text{code} \gets$ next available code (typically 256 for byte alphabet) + \State $w \gets$ first symbol from data + \State $\text{output} \gets [\,]$ + \For{each symbol $c$ in remaining data} + \If{$w + c$ exists in dictionary} + \State $w \gets w + c$ + \Else + \State append $\text{code}(w)$ to output + \If{code $<$ max\_code} + \State Add $w + c$ to dictionary with code $\text{code}$ + \State $\text{code} \gets \text{code} + 1$ + \EndIf + \State $w \gets c$ + \EndIf + \EndFor + \State append $\text{code}(w)$ to output + \State \Return output + \EndProcedure +\end{algorithmic} +\end{algorithm} + +The decompression process is equally elegant. The decompressor initializes an identical dictionary +and reconstructs the original data by decoding the transmitted codes. Crucially, the decompressor +can reconstruct the dictionary entries on-the-fly as it processes the compressed data, +recovering the exact sequence of dictionary updates that occurred during compression. +This property is what allows the dictionary to remain implicit rather than explicitly transmitted. + +\begin{algorithm} +\caption{LZW decompression algorithm} +\label{alg:lzw-decompress} +\begin{algorithmic} + \Procedure{LZWDecompress}{codes} + \State Initialize dictionary with all single characters + \State $\text{code} \gets$ next available code + \State $w \gets \text{decode}(\text{codes}[0])$ + \State $\text{output} \gets w$ + \For{each code $c$ in $\text{codes}[1:]$} + \If{$c$ exists in dictionary} + \State $k \gets \text{decode}(c)$ + \Else + \State $k \gets w + w[0]$ \quad \{handle special case\} + \EndIf + \State append $k$ to output + \State Add $w + k[0]$ to dictionary with code $\text{code}$ + \State $\text{code} \gets \text{code} + 1$ + \State $w \gets k$ + \EndFor + \State \Return output + \EndProcedure +\end{algorithmic} +\end{algorithm} + +LZW's advantages make it particularly valuable for certain applications. First, it requires no statistical modeling +of the input data, making it applicable to diverse data types without prior analysis. +Second, the dictionary is built incrementally and implicitly, eliminating transmission overhead. +Third, it can achieve significant compression on data with repeating patterns, such as text, images, and structured data. +Fourth, the algorithm is relatively simple to implement and computationally efficient, with time complexity $O(n)$ +where $n$ is the length of the input. + +However, LZW has notable limitations. Its compression effectiveness is highly dependent on the structure and repetitiveness +of the input data. On truly random data with no repeating patterns, LZW can even increase the file size. +Additionally, the fixed size of the dictionary (typically 12 or 16 bits, allowing $2^12=4096$ or $2^16=65536$ entries) +limits its ability to adapt to arbitrarily large vocabularies of patterns. +When the dictionary becomes full, most implementations stop adding new entries, potentially reducing compression efficiency. + +LZW has seen widespread practical deployment in compression standards and applications. +The GIF image format uses LZW compression, as does the TIFF image format in some variants. +The V.42bis modem compression standard incorporates LZW-like techniques. +More recent variants such as LZSS, LZMA, and Deflate (used in ZIP and gzip) +extend the LZW concept with additional refinements like literal-length-distance encoding +and Huffman coding post-processing to achieve better compression ratios. + +The relationship between dictionary-based methods like LZW and entropy-based methods like Huffman +is complementary rather than competitive. LZW excels at capturing structure and repetition, +while entropy-based methods optimize symbol encoding based on probability distributions. +This has led to hybrid approaches that combine both techniques, such as the Deflate algorithm, +which uses LZSS (a variant of LZ77) followed by Huffman coding of the output. \printbibliography