begin entropy examples
This commit is contained in:
132
entropy.tex
132
entropy.tex
@@ -26,8 +26,9 @@ Originating in classical thermodynamics,
|
|||||||
over time it has been applied in different sciences such as chemistry and information theory.
|
over time it has been applied in different sciences such as chemistry and information theory.
|
||||||
%As the informal concept of entropy gains popularity, its specific meaning can feel far-fetched and ambiguous.
|
%As the informal concept of entropy gains popularity, its specific meaning can feel far-fetched and ambiguous.
|
||||||
The name 'entropy' was first coined by german physicist \textit{Rudolf Clausius} in 1865
|
The name 'entropy' was first coined by german physicist \textit{Rudolf Clausius} in 1865
|
||||||
while postulating the second law of thermodynamics.
|
while postulating the second law of thermodynamics, one of 3(4)
|
||||||
The laws of thermodynamics are based on universal observation regarding heat and energy conversion.
|
laws of thermodynamics based on universal observation regarding heat and energy conversion.
|
||||||
|
|
||||||
Specifically, the second law states that not all thermal energy can be converted into work in a cyclic process.
|
Specifically, the second law states that not all thermal energy can be converted into work in a cyclic process.
|
||||||
Or, in other words, that the entropy of an isolated system cannot decrease,
|
Or, in other words, that the entropy of an isolated system cannot decrease,
|
||||||
as they always tend toward a state of thermodynamic equilibrium where entropy is highest for a given internal energy.
|
as they always tend toward a state of thermodynamic equilibrium where entropy is highest for a given internal energy.
|
||||||
@@ -42,15 +43,134 @@ that is still in use in information theory today.
|
|||||||
\begin{equation}
|
\begin{equation}
|
||||||
S = -k_B \sum_i p_i \ln(p_i)
|
S = -k_B \sum_i p_i \ln(p_i)
|
||||||
\end{equation}
|
\end{equation}
|
||||||
It gives statical meaning to the macroscopical phenomenon by defining the entropy $S$ of a macrostate as
|
It gives statistical meaning to the macroscopic phenomenon of classical thermodynamics
|
||||||
|
by defining the entropy $S$ of a macrostate as
|
||||||
the result of probabilities $p_i$ of all its constituting micro states.
|
the result of probabilities $p_i$ of all its constituting micro states.
|
||||||
|
$k_B$ refers to the Boltzmann constant, which he himself did not determine.
|
||||||
|
|
||||||
|
\section{Shannon's axioms}
|
||||||
|
\textit{Claude Shannon} adapted the concept of entropy to information theory.
|
||||||
|
In an era of advancing communication technologies, the question he addressed was of increasing importance:
|
||||||
|
How can messages be encoded and transmitted efficiently?
|
||||||
|
As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes
|
||||||
|
and media for transmission and storage.
|
||||||
|
According to his axioms, a measure for information has to comply with the following criteria:
|
||||||
|
\begin{enumerate}
|
||||||
|
\item $I(p)$ is monotonically decreasing in p: an increase in the probability of an event
|
||||||
|
decreases the information from an observed event, and vice versa.
|
||||||
|
\item $I(1) = 0$: events that always occur do not communicate information.
|
||||||
|
\item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events
|
||||||
|
is the sum of the information learned from each event.
|
||||||
|
\item $I(p)$ is a twice continuously differentiable function of p.
|
||||||
|
\end{enumerate}
|
||||||
|
In information theory, entropy can be understood as the expected information of a message.
|
||||||
|
\begin{equation}
|
||||||
|
H = E(I) = - \sum_i p_i \log_2(p_i)
|
||||||
|
\end{equation}
|
||||||
|
This leaves $ I =log(1/p) = - log_2(p_i)$, implying that an unexpected message (low probability) carries
|
||||||
|
more information than one with higher probability.
|
||||||
|
Intuitively, we can imagine David A. Johnston, a volcanologist reporting day after day that there is no
|
||||||
|
activity on Mount St. Helens. After a while, we grow to expect this message because it is statistically very likely
|
||||||
|
that tomorrows message will be the same. When some day we get the message 'Vancouver! This is it!' it carries a lot of information
|
||||||
|
not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely
|
||||||
|
given the transmission history.
|
||||||
|
|
||||||
|
The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically
|
||||||
|
arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $.
|
||||||
|
|
||||||
|
Further, the $\log_2$ can be intuitively understood for an event source with $2^n$ possible outcomes -
|
||||||
|
using standard binary coding, we can easily see that a message has to contain $\log_2(2^n) = n$ Bits
|
||||||
|
in order to be able to encode all possible outcomes.
|
||||||
|
For numbers where $a \neq 2^n$ such as $a=10$, it is easy to see that there exists a number $a^k = 2^n$
|
||||||
|
which defines a message size that can encode the outcomes of $k$ event sources with $a$ outcomes each,
|
||||||
|
leaving the required Bits per event source at $\log_2(a^k) \div k = \log_2(a)$.
|
||||||
|
|
||||||
\section{Definitions across disciplines}
|
|
||||||
%- bedingte Entropie
|
%- bedingte Entropie
|
||||||
%- Redundanz
|
%- Redundanz
|
||||||
%- Quellentropie
|
%- Quellentropie
|
||||||
\section{Shannon Axioms}
|
\section{Applications}
|
||||||
\section{Coding of a source of an information and communication channel}
|
\subsection{Decision Trees}
|
||||||
|
A decision tree is a supervised learning approach commonly used in data mining.
|
||||||
|
The goal is to create an algorithm, i.e a series of questions to pose to new data (input variables)
|
||||||
|
in order to predict the target variable, a class label.
|
||||||
|
Graphically, each question can be visualized as a node in a tree, splitting the dataset into two or more groups.
|
||||||
|
This process is applied to the source set and then its resulting sets in a process called \textit{recursive partitioning}.
|
||||||
|
Once a leaf is reached, the class of the input has been successfully determined.
|
||||||
|
|
||||||
|
In order to build the shallowest possible trees, we want to use input variables that minimize uncertainty.
|
||||||
|
While other measures for the best choice such as the Gini coefficient exist,
|
||||||
|
entropy is a popular measure used in decision trees.
|
||||||
|
|
||||||
|
Using the previous learnings about information and entropy, we of course want to ask questions that have the
|
||||||
|
highest information content, so pertaining input variables with the highest entropy.
|
||||||
|
|
||||||
|
Zu berechnen ist also die Entropie des Klassifikators abzüglich der erwarteten Entropie
|
||||||
|
nach der Aufteilung anhand des Attributs,
|
||||||
|
erklärt in~\autoref{ex:decisiontree}.
|
||||||
|
\begin{figure}[H]
|
||||||
|
\centering
|
||||||
|
\begin{minipage}{.3\textwidth}
|
||||||
|
\begin{tabular}{c|c|c}
|
||||||
|
& hot & cold \\
|
||||||
|
\hline
|
||||||
|
rain &4 &5 \\
|
||||||
|
\hline
|
||||||
|
no rain & 3 & 2 \\
|
||||||
|
\end{tabular}
|
||||||
|
\end{minipage}
|
||||||
|
\begin{minipage}{.6\textwidth}
|
||||||
|
Die Entropie für Regen ist also $H_{prior} = H(\frac{9}{14},\frac{5}{14})$,
|
||||||
|
nach der Aufteilung anhand des Attributs Temperatur beträgt sie $H_{warm}= H(\frac{4}{7}, \frac{3}{7})$
|
||||||
|
und $H_{kalt}= H(\frac{5}{7}, \frac{2}{7})$.
|
||||||
|
Also ist die erwartete (Erwartungswert) Entropie nach der Aufteilung anhand der Temperatur $p_{warm} * H_{warm} + p_{kalt} * H_{kalt} $ .
|
||||||
|
Der \textbf{Informationsgewinn} berechnet sich dann als Entropie für Regen abzüglich des Erwartungswertes der Entropie.
|
||||||
|
Da $H_{prior}$ in dieser Berechnung jedoch konstant ist, kann auch einfach $E[H]$ nach der Aufteilung minimiert werden.
|
||||||
|
\end{minipage}
|
||||||
|
\caption{Beispiel Informationsgewinn für Aufbau eines Entscheidungsbaums}
|
||||||
|
\label{ex:decisiontree}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
|
||||||
|
Vorteil: Niedriger Rechenaufwand und nachvollziehbarer Aufbau
|
||||||
|
Nachteil: Overfitting, geringe Robustheit: bereits kleine Änderungen der Trainingsdaten können zu einer Veränderung des Baums führen.
|
||||||
|
|
||||||
|
|
||||||
|
Kann zu einem \textbf{Random Forest} (bagging) erweitert werden um die Robustheit zu steigern und Overfitting zu verringern.
|
||||||
|
Resultiert jedoch in höherem Rechenaufwand und geringerer Interpretierbarkeit.
|
||||||
|
Die Trainingsdaten können dafür zufällig gruppiert werden und die Bäume für eine Mehrheitsentscheidung genutzt werden.
|
||||||
|
\subsection{Cross-Entropy}
|
||||||
|
Kullback-Leibler = $H(p,q) - H(p)$
|
||||||
|
as a cost function in machine learning
|
||||||
|
\subsection{Coding}
|
||||||
|
%Coding of a source of an information and communication channel
|
||||||
|
% https://www.youtube.com/watch?v=ErfnhcEV1O8
|
||||||
|
% relation to hamming distance and efficient codes
|
||||||
|
|
||||||
|
\subsection{Noisy communication channels}
|
||||||
|
Given a model of
|
||||||
|
|
||||||
|
\begin{figure}[H]
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\def\boxw{2.5cm}
|
||||||
|
\def\n{5}
|
||||||
|
\pgfmathsetmacro{\gap}{(\textwidth - \n*\boxw)/(\n-1)}
|
||||||
|
% Draw the boxes
|
||||||
|
\node (A) at (0, 0) [draw, text width=\boxw, align=center] {Information Source};
|
||||||
|
\node (B) at (\boxw + \gap, 0) [draw, text width=\boxw, align=center] {Transmitter};
|
||||||
|
\node (C) at ({2*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Channel};
|
||||||
|
\node (N) at ({2*(\boxw + \gap)}, -1) [draw, text width=\boxw, align=center] {Noise};
|
||||||
|
\node (D) at ({3*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Receiver};
|
||||||
|
\node (E) at ({4*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Destination};
|
||||||
|
|
||||||
|
% Draw arrows between the boxes
|
||||||
|
\draw[->] (A) -- (B);
|
||||||
|
\draw[->] (B) -- (C);
|
||||||
|
\draw[->] (C) -- (D);
|
||||||
|
\draw[->] (D) -- (E);
|
||||||
|
\draw[->] (N) -- (C);
|
||||||
|
\end{tikzpicture}
|
||||||
|
\caption{Model of a noisy communication channel}
|
||||||
|
\label{fig:noisy-channel}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
Reference in New Issue
Block a user