101 lines
4.1 KiB
TeX
101 lines
4.1 KiB
TeX
\documentclass{article}
|
|
%%% basic layouting
|
|
\usepackage[utf8x]{inputenc}
|
|
\usepackage[margin=1in]{geometry} % Adjust margins
|
|
\usepackage{caption}
|
|
\usepackage{hyperref}
|
|
\PassOptionsToPackage{hyphens}{url} % allow breaking urls
|
|
\usepackage{float}
|
|
\usepackage{wrapfig}
|
|
\usepackage{subcaption}
|
|
\usepackage{parskip} % dont indent after paragraphs, figures
|
|
\usepackage{xcolor}
|
|
%%% algorithms
|
|
\usepackage{algorithm}
|
|
\usepackage{algpseudocodex}
|
|
% graphs and plots
|
|
\usepackage{tikz}
|
|
\usepackage{pgfplots}
|
|
\usetikzlibrary{positioning}
|
|
%\usegdlibrary{trees}
|
|
%%% math
|
|
\usepackage{amsmath}
|
|
%%% citations
|
|
\usepackage[style=ieee, backend=biber, maxnames=1, minnames=1]{biblatex}
|
|
%\usepackage{csquotes} % Recommended for biblatex
|
|
\addbibresource{compression.bib}
|
|
|
|
\title{Compression}
|
|
\author{Erik Neller}
|
|
\date{\today}
|
|
|
|
\begin{document}
|
|
\maketitle
|
|
\section{Introduction}
|
|
As the volume of data grows exponentially around the world, compression is only gaining in importance to all disciplines.
|
|
Not only does it enable the storage of large amounts of information needed for research in scientific domains
|
|
like DNA sequencing and analysis, it also plays a vital role in keeping stored data accessible by
|
|
facilitating cataloging, search and retrieval.
|
|
The concept of entropy is closely related to the design of efficient codes.
|
|
|
|
\begin{equation}
|
|
H = E(I) = - \sum_i p_i \log_2(p_i)
|
|
\label{eq:entropy-information}
|
|
\end{equation}
|
|
The understanding of entropy as the expected information $E(I)$ of a message provides an intuition that,
|
|
given a source with a given entropy (in bits), any coding can not have a lower average word length (in bits)
|
|
|
|
\begin{equation}
|
|
E(l) = \sum_i p_i l_i
|
|
\end{equation}
|
|
than this entropy without losing information.
|
|
This is the content of Shannons's source coding theorem,
|
|
introduced in \citeyear{shannon1948mathematical} \cite{enwiki:shannon-source-coding}.
|
|
In his paper, \citeauthor{shannon1948mathematical} proposed two principal ideas to minimize the average length of a code.
|
|
The first is to use short codes for symbols with higher probability.
|
|
This is an intuitive approach as more frequent symbols have a higher impact on average code length.
|
|
|
|
|
|
\section{Kraft-McMillan inequality}
|
|
% https://de.wikipedia.org/wiki/Kraft-Ungleichung
|
|
% https://en.wikipedia.org/wiki/Kraft%E2%80%93McMillan_inequality
|
|
\section{Shannon-Fano}
|
|
Shannon-Fano coding is one of the earliest methods for constructing prefix codes.
|
|
It divides symbols into groups based on their probabilities, recursively partitioning them to assign shorter codewords
|
|
to more frequent symbols.
|
|
While intuitive, Shannon-Fano coding does not always achieve optimal compression,
|
|
paving the way for more advanced techniques like Huffman coding.
|
|
|
|
\begin{algorithm}
|
|
\begin{algorithmic}[1]
|
|
\State first line
|
|
\end{algorithmic}
|
|
\label{alg:shannon-fano}
|
|
\caption{Shannon-Fano compression algorithm}
|
|
\end{algorithm}
|
|
|
|
\section{Huffman Coding}
|
|
Huffman coding is an optimal prefix coding algorithm that minimizes the expected codeword length
|
|
for a given set of symbol probabilities.
|
|
By constructing a binary tree where the most frequent symbols are assigned the shortest codewords,
|
|
Huffman coding achieves the theoretical limit of entropy for discrete memoryless sources.
|
|
Its efficiency and simplicity have made it a cornerstone of lossless data compression.
|
|
|
|
\section{Arithmetic Coding}
|
|
Arithmetic coding is a modern compression technique that encodes an entire message as a single interval
|
|
within the range $[0, 1)$.
|
|
By iteratively refining this interval based on the probabilities of the symbols in the message,
|
|
arithmetic coding can achieve compression rates that approach the entropy of the source.
|
|
Its ability to handle non-integer bit lengths makes it particularly powerful
|
|
for applications requiring high compression efficiency.
|
|
|
|
\section{LZW Algorithm}
|
|
The Lempel-Ziv-Welch (LZW) algorithm is a dictionary-based compression method that dynamically builds a dictionary
|
|
of recurring patterns in the data.
|
|
Unlike entropy-based methods, LZW does not require prior knowledge of symbol probabilities,
|
|
making it highly adaptable and efficient for a wide range of applications, including image and text compression.
|
|
\cite{dewiki:lzw}
|
|
|
|
|
|
\printbibliography
|
|
\end{document} |