update
This commit is contained in:
@@ -50,6 +50,13 @@
|
||||
url = "https://en.wikipedia.org/w/index.php?title=Kraft%E2%80%93McMillan_inequality&oldid=1313803157",
|
||||
note = "[Online; accessed 26-November-2025]"
|
||||
}
|
||||
@misc{ enwiki:partition,
|
||||
author = "{Wikipedia contributors}",
|
||||
title = "Partition problem --- {Wikipedia}{,} The Free Encyclopedia",
|
||||
year = "2025",
|
||||
url = "https://en.wikipedia.org/w/index.php?title=Partition_problem&oldid=1320732818",
|
||||
note = "[Online; accessed 30-November-2025]"
|
||||
}
|
||||
@misc{ dewiki:shannon-fano,
|
||||
author = "Wikipedia",
|
||||
title = "Shannon-Fano-Kodierung --- Wikipedia{,} die freie Enzyklopädie",
|
||||
|
||||
190
compression.tex
190
compression.tex
@@ -120,23 +120,96 @@ It is a top-down method that divides symbols into equal groups based on their pr
|
||||
recursively partitioning them to assign shorter codewords to more frequent events.
|
||||
|
||||
\begin{algorithm}
|
||||
\begin{algorithmic}
|
||||
\State first line
|
||||
\end{algorithmic}
|
||||
\label{alg:shannon-fano}
|
||||
\caption{Shannon-Fano compression}
|
||||
\label{alg:shannon-fano}
|
||||
\begin{algorithmic}
|
||||
\Procedure{ShannonFano}{symbols, probabilities}
|
||||
\If{length(symbols) $= 1$}
|
||||
\State \Return codeword for single symbol
|
||||
\EndIf
|
||||
\State $\text{current\_sum} \gets 0$
|
||||
\State $\text{split\_index} \gets 0$
|
||||
\For{$i \gets 1$ \textbf{to} length(symbols)}
|
||||
\If{$|\text{current\_sum} + \text{probabilities}[i] - 0.5| < |\text{current\_sum} - 0.5|$}
|
||||
\State $\text{current\_sum} \gets \text{current\_sum} + \text{probabilities}[i]$
|
||||
\State $\text{split\_index} \gets i$
|
||||
\EndIf
|
||||
\EndFor
|
||||
\State $\text{left\_group} \gets \text{symbols}[1 : \text{split\_index}]$
|
||||
\State $\text{right\_group} \gets \text{symbols}[\text{split\_index} + 1 : \text{length(symbols)}]$
|
||||
\State Assign prefix ``0'' to codes from ShannonFano($\text{left\_group}, \ldots$)
|
||||
\State Assign prefix ``1'' to codes from ShannonFano($\text{right\_group}, \ldots$)
|
||||
\EndProcedure
|
||||
\end{algorithmic}
|
||||
\end{algorithm}
|
||||
|
||||
While Shannon-Fano coding guarantees the generation of a prefix-free code with an average word length close to the entropy,
|
||||
it is not guaranteed to be optimal. In practice, it often generates codewords that are only slightly longer than necessary.
|
||||
Weaknesses of the algorithm also include the non-trivial partitioning phase \cite{enwiki:partition}, which can in practice however be solved relatively efficiently.
|
||||
Due to the aforementioned limitations, neither of the two historically slightly ambiguous Shannon-Fano algorithms are almost never used,
|
||||
in favor of the Huffman coding as described in the next section.
|
||||
|
||||
\section{Huffman Coding}
|
||||
\label{sec:huffman}
|
||||
Huffman coding is an optimal prefix coding algorithm that minimizes the expected codeword length
|
||||
for a given set of symbol probabilities.
|
||||
By constructing a binary tree where the most frequent symbols are assigned the shortest codewords,
|
||||
Huffman coding achieves the theoretical limit of entropy for discrete memoryless sources.
|
||||
Its efficiency and simplicity have made it a cornerstone of lossless data compression.
|
||||
for a given set of symbol probabilities. Developed by David Huffman in 1952, it guarantees optimality
|
||||
by constructing a binary tree where the most frequent symbols are assigned the shortest codewords.
|
||||
Huffman coding achieves the theoretical limit of entropy for discrete memoryless sources,
|
||||
making it one of the most important compression techniques in information theory.
|
||||
|
||||
Unlike Shannon-Fano, which uses a top-down approach, Huffman coding employs a bottom-up strategy.
|
||||
The algorithm builds the code tree by iteratively combining the two symbols with the lowest probabilities
|
||||
into a new internal node. This greedy approach ensures that the resulting tree minimizes the weighted path length,
|
||||
where the weight of each symbol is its probability.
|
||||
|
||||
\begin{algorithm}
|
||||
\caption{Huffman coding algorithm}
|
||||
\label{alg:huffman}
|
||||
\begin{algorithmic}
|
||||
\Procedure{Huffman}{symbols, probabilities}
|
||||
\State Create a leaf node for each symbol and add it to a priority queue
|
||||
\While{priority queue contains more than one node}
|
||||
\State Extract two nodes with minimum frequency: $\text{left}$ and $\text{right}$
|
||||
\State Create a new internal node with frequency $\text{freq(left)} + \text{freq(right)}$
|
||||
\State Set $\text{left}$ as the left child and $\text{right}$ as the right child
|
||||
\State Add the new internal node to the priority queue
|
||||
\EndWhile
|
||||
\State $\text{root} \gets$ remaining node in priority queue
|
||||
\State Traverse tree and assign codewords: ``0'' for left edges, ``1'' for right edges
|
||||
\State \Return codewords
|
||||
\EndProcedure
|
||||
\end{algorithmic}
|
||||
\end{algorithm}
|
||||
|
||||
The optimality of Huffman coding can be proven by exchange arguments.
|
||||
The key insight is that if two codewords have the maximum length in an optimal code, they must correspond to the two least frequent symbols.
|
||||
Moreover, these two symbols can be combined into a single meta-symbol without affecting optimality,
|
||||
which leads to a recursive structure that guarantees Huffman's method produces an optimal code.
|
||||
|
||||
The average codeword length $L_{\text{Huffman}}$ produced by Huffman coding satisfies the following bounds:
|
||||
\begin{equation}
|
||||
H(X) \leq L_{\text{Huffman}} < H(X) + 1
|
||||
\label{eq:huffman-bounds}
|
||||
\end{equation}
|
||||
where $H(X)$ is the entropy of the source. This means Huffman coding is guaranteed to be within one bit
|
||||
of the theoretical optimum. In practice, when symbol probabilities are powers of $\frac{1}{2}$,
|
||||
Huffman coding achieves perfect compression and $L_{\text{Huffman}} = H(X)$.
|
||||
|
||||
The computational complexity of Huffman coding is $O(n \log n)$, where $n$ is the number of distinct symbols.
|
||||
A priority queue implementation using a binary heap achieves this bound, making Huffman coding
|
||||
efficient even for large alphabets. Its widespread use in compression formats such as DEFLATE, JPEG, and MP3
|
||||
testifies to its practical importance.
|
||||
|
||||
However, Huffman coding has limitations. First, it requires knowledge of the probability distribution
|
||||
of symbols before encoding, necessitating a preprocessing pass or transmission of frequency tables.
|
||||
Second, it assigns an integer number of bits to each symbol, which can be suboptimal
|
||||
when symbol probabilities do not align well with powers of two.
|
||||
Symbol-by-symbol coding imposes a constraint that is often unneeded since codes will usually be packed in long sequences,
|
||||
leaving room for further optimization as provided by Arithmetic Coding.
|
||||
|
||||
\section{Arithmetic Coding}
|
||||
Arithmetic coding is a modern compression technique that encodes an entire message as a single interval
|
||||
within the range $[0, 1)$.
|
||||
within the range $[0, 1)$, as opposed to symbol-by-symbol coding used by Huffman.
|
||||
By iteratively refining this interval based on the probabilities of the symbols in the message,
|
||||
arithmetic coding can achieve compression rates that approach the entropy of the source.
|
||||
Its ability to handle non-integer bit lengths makes it particularly powerful
|
||||
@@ -144,11 +217,100 @@ for applications requiring high compression efficiency.
|
||||
|
||||
\section{LZW Algorithm}
|
||||
The Lempel-Ziv-Welch (LZW) algorithm is a dictionary-based compression method that dynamically builds a dictionary
|
||||
of recurring patterns in the data.
|
||||
Unlike entropy-based methods, LZW does not require prior knowledge of symbol probabilities,
|
||||
making it highly adaptable and efficient for a wide range of applications, including image and text compression.
|
||||
Because the dictionary does not have to be transmitted explicitly, LZW is also useful for streaming data.
|
||||
\cite{dewiki:lzw}
|
||||
of recurring patterns in the data as compression proceeds. Unlike entropy-based methods such as Huffman or arithmetic coding,
|
||||
LZW does not require prior knowledge of symbol probabilities, making it highly adaptable and efficient
|
||||
for a wide range of applications, including image and text compression.
|
||||
The algorithm was developed by Abraham Lempel and Jacob Ziv, with refinements by Terry Welch in 1984.
|
||||
|
||||
The fundamental insight of LZW is that many data sources contain repeating patterns that can be exploited
|
||||
by replacing longer sequences with shorter codes. Rather than assigning variable-length codes to individual symbols
|
||||
based on their frequency, LZW identifies recurring substrings and assigns them fixed-length codes.
|
||||
As the algorithm processes the data, it dynamically constructs a dictionary that maps these patterns to codes,
|
||||
without requiring the dictionary to be transmitted with the compressed data.
|
||||
|
||||
\begin{algorithm}
|
||||
\caption{LZW compression algorithm}
|
||||
\label{alg:lzw}
|
||||
\begin{algorithmic}
|
||||
\Procedure{LZWCompress}{data}
|
||||
\State Initialize dictionary with all single characters
|
||||
\State $\text{code} \gets$ next available code (typically 256 for byte alphabet)
|
||||
\State $w \gets$ first symbol from data
|
||||
\State $\text{output} \gets [\,]$
|
||||
\For{each symbol $c$ in remaining data}
|
||||
\If{$w + c$ exists in dictionary}
|
||||
\State $w \gets w + c$
|
||||
\Else
|
||||
\State append $\text{code}(w)$ to output
|
||||
\If{code $<$ max\_code}
|
||||
\State Add $w + c$ to dictionary with code $\text{code}$
|
||||
\State $\text{code} \gets \text{code} + 1$
|
||||
\EndIf
|
||||
\State $w \gets c$
|
||||
\EndIf
|
||||
\EndFor
|
||||
\State append $\text{code}(w)$ to output
|
||||
\State \Return output
|
||||
\EndProcedure
|
||||
\end{algorithmic}
|
||||
\end{algorithm}
|
||||
|
||||
The decompression process is equally elegant. The decompressor initializes an identical dictionary
|
||||
and reconstructs the original data by decoding the transmitted codes. Crucially, the decompressor
|
||||
can reconstruct the dictionary entries on-the-fly as it processes the compressed data,
|
||||
recovering the exact sequence of dictionary updates that occurred during compression.
|
||||
This property is what allows the dictionary to remain implicit rather than explicitly transmitted.
|
||||
|
||||
\begin{algorithm}
|
||||
\caption{LZW decompression algorithm}
|
||||
\label{alg:lzw-decompress}
|
||||
\begin{algorithmic}
|
||||
\Procedure{LZWDecompress}{codes}
|
||||
\State Initialize dictionary with all single characters
|
||||
\State $\text{code} \gets$ next available code
|
||||
\State $w \gets \text{decode}(\text{codes}[0])$
|
||||
\State $\text{output} \gets w$
|
||||
\For{each code $c$ in $\text{codes}[1:]$}
|
||||
\If{$c$ exists in dictionary}
|
||||
\State $k \gets \text{decode}(c)$
|
||||
\Else
|
||||
\State $k \gets w + w[0]$ \quad \{handle special case\}
|
||||
\EndIf
|
||||
\State append $k$ to output
|
||||
\State Add $w + k[0]$ to dictionary with code $\text{code}$
|
||||
\State $\text{code} \gets \text{code} + 1$
|
||||
\State $w \gets k$
|
||||
\EndFor
|
||||
\State \Return output
|
||||
\EndProcedure
|
||||
\end{algorithmic}
|
||||
\end{algorithm}
|
||||
|
||||
LZW's advantages make it particularly valuable for certain applications. First, it requires no statistical modeling
|
||||
of the input data, making it applicable to diverse data types without prior analysis.
|
||||
Second, the dictionary is built incrementally and implicitly, eliminating transmission overhead.
|
||||
Third, it can achieve significant compression on data with repeating patterns, such as text, images, and structured data.
|
||||
Fourth, the algorithm is relatively simple to implement and computationally efficient, with time complexity $O(n)$
|
||||
where $n$ is the length of the input.
|
||||
|
||||
However, LZW has notable limitations. Its compression effectiveness is highly dependent on the structure and repetitiveness
|
||||
of the input data. On truly random data with no repeating patterns, LZW can even increase the file size.
|
||||
Additionally, the fixed size of the dictionary (typically 12 or 16 bits, allowing $2^12=4096$ or $2^16=65536$ entries)
|
||||
limits its ability to adapt to arbitrarily large vocabularies of patterns.
|
||||
When the dictionary becomes full, most implementations stop adding new entries, potentially reducing compression efficiency.
|
||||
|
||||
LZW has seen widespread practical deployment in compression standards and applications.
|
||||
The GIF image format uses LZW compression, as does the TIFF image format in some variants.
|
||||
The V.42bis modem compression standard incorporates LZW-like techniques.
|
||||
More recent variants such as LZSS, LZMA, and Deflate (used in ZIP and gzip)
|
||||
extend the LZW concept with additional refinements like literal-length-distance encoding
|
||||
and Huffman coding post-processing to achieve better compression ratios.
|
||||
|
||||
The relationship between dictionary-based methods like LZW and entropy-based methods like Huffman
|
||||
is complementary rather than competitive. LZW excels at capturing structure and repetition,
|
||||
while entropy-based methods optimize symbol encoding based on probability distributions.
|
||||
This has led to hybrid approaches that combine both techniques, such as the Deflate algorithm,
|
||||
which uses LZSS (a variant of LZ77) followed by Huffman coding of the output.
|
||||
|
||||
|
||||
\printbibliography
|
||||
|
||||
Reference in New Issue
Block a user