From d1b45ee1ec467c12f342d1b0909a207a84ef70bd Mon Sep 17 00:00:00 2001 From: eneller Date: Wed, 29 Oct 2025 17:21:03 +0100 Subject: [PATCH] begin entropy examples --- entropy.tex | 132 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 126 insertions(+), 6 deletions(-) diff --git a/entropy.tex b/entropy.tex index 900fb07..2586608 100644 --- a/entropy.tex +++ b/entropy.tex @@ -26,8 +26,9 @@ Originating in classical thermodynamics, over time it has been applied in different sciences such as chemistry and information theory. %As the informal concept of entropy gains popularity, its specific meaning can feel far-fetched and ambiguous. The name 'entropy' was first coined by german physicist \textit{Rudolf Clausius} in 1865 -while postulating the second law of thermodynamics. -The laws of thermodynamics are based on universal observation regarding heat and energy conversion. +while postulating the second law of thermodynamics, one of 3(4) +laws of thermodynamics based on universal observation regarding heat and energy conversion. + Specifically, the second law states that not all thermal energy can be converted into work in a cyclic process. Or, in other words, that the entropy of an isolated system cannot decrease, as they always tend toward a state of thermodynamic equilibrium where entropy is highest for a given internal energy. @@ -42,15 +43,134 @@ that is still in use in information theory today. \begin{equation} S = -k_B \sum_i p_i \ln(p_i) \end{equation} -It gives statical meaning to the macroscopical phenomenon by defining the entropy $S$ of a macrostate as +It gives statistical meaning to the macroscopic phenomenon of classical thermodynamics +by defining the entropy $S$ of a macrostate as the result of probabilities $p_i$ of all its constituting micro states. +$k_B$ refers to the Boltzmann constant, which he himself did not determine. + +\section{Shannon's axioms} +\textit{Claude Shannon} adapted the concept of entropy to information theory. +In an era of advancing communication technologies, the question he addressed was of increasing importance: +How can messages be encoded and transmitted efficiently? +As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes +and media for transmission and storage. +According to his axioms, a measure for information has to comply with the following criteria: +\begin{enumerate} + \item $I(p)$ is monotonically decreasing in p: an increase in the probability of an event + decreases the information from an observed event, and vice versa. + \item $I(1) = 0$: events that always occur do not communicate information. + \item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events + is the sum of the information learned from each event. + \item $I(p)$ is a twice continuously differentiable function of p. +\end{enumerate} +In information theory, entropy can be understood as the expected information of a message. +\begin{equation} + H = E(I) = - \sum_i p_i \log_2(p_i) +\end{equation} +This leaves $ I =log(1/p) = - log_2(p_i)$, implying that an unexpected message (low probability) carries +more information than one with higher probability. +Intuitively, we can imagine David A. Johnston, a volcanologist reporting day after day that there is no +activity on Mount St. Helens. After a while, we grow to expect this message because it is statistically very likely +that tomorrows message will be the same. When some day we get the message 'Vancouver! This is it!' it carries a lot of information +not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely +given the transmission history. + +The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically +arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $. + +Further, the $\log_2$ can be intuitively understood for an event source with $2^n$ possible outcomes - +using standard binary coding, we can easily see that a message has to contain $\log_2(2^n) = n$ Bits +in order to be able to encode all possible outcomes. +For numbers where $a \neq 2^n$ such as $a=10$, it is easy to see that there exists a number $a^k = 2^n$ +which defines a message size that can encode the outcomes of $k$ event sources with $a$ outcomes each, +leaving the required Bits per event source at $\log_2(a^k) \div k = \log_2(a)$. -\section{Definitions across disciplines} %- bedingte Entropie %- Redundanz %- Quellentropie -\section{Shannon Axioms} -\section{Coding of a source of an information and communication channel} +\section{Applications} +\subsection{Decision Trees} +A decision tree is a supervised learning approach commonly used in data mining. +The goal is to create an algorithm, i.e a series of questions to pose to new data (input variables) +in order to predict the target variable, a class label. +Graphically, each question can be visualized as a node in a tree, splitting the dataset into two or more groups. +This process is applied to the source set and then its resulting sets in a process called \textit{recursive partitioning}. +Once a leaf is reached, the class of the input has been successfully determined. +In order to build the shallowest possible trees, we want to use input variables that minimize uncertainty. +While other measures for the best choice such as the Gini coefficient exist, +entropy is a popular measure used in decision trees. + +Using the previous learnings about information and entropy, we of course want to ask questions that have the +highest information content, so pertaining input variables with the highest entropy. + +Zu berechnen ist also die Entropie des Klassifikators abzüglich der erwarteten Entropie +nach der Aufteilung anhand des Attributs, +erklärt in~\autoref{ex:decisiontree}. +\begin{figure}[H] +\centering +\begin{minipage}{.3\textwidth} +\begin{tabular}{c|c|c} + & hot & cold \\ + \hline + rain &4 &5 \\ + \hline + no rain & 3 & 2 \\ +\end{tabular} +\end{minipage} +\begin{minipage}{.6\textwidth} +Die Entropie für Regen ist also $H_{prior} = H(\frac{9}{14},\frac{5}{14})$, +nach der Aufteilung anhand des Attributs Temperatur beträgt sie $H_{warm}= H(\frac{4}{7}, \frac{3}{7})$ +und $H_{kalt}= H(\frac{5}{7}, \frac{2}{7})$. +Also ist die erwartete (Erwartungswert) Entropie nach der Aufteilung anhand der Temperatur $p_{warm} * H_{warm} + p_{kalt} * H_{kalt} $ . +Der \textbf{Informationsgewinn} berechnet sich dann als Entropie für Regen abzüglich des Erwartungswertes der Entropie. +Da $H_{prior}$ in dieser Berechnung jedoch konstant ist, kann auch einfach $E[H]$ nach der Aufteilung minimiert werden. +\end{minipage} +\caption{Beispiel Informationsgewinn für Aufbau eines Entscheidungsbaums} +\label{ex:decisiontree} +\end{figure} + + +Vorteil: Niedriger Rechenaufwand und nachvollziehbarer Aufbau +Nachteil: Overfitting, geringe Robustheit: bereits kleine Änderungen der Trainingsdaten können zu einer Veränderung des Baums führen. + + +Kann zu einem \textbf{Random Forest} (bagging) erweitert werden um die Robustheit zu steigern und Overfitting zu verringern. +Resultiert jedoch in höherem Rechenaufwand und geringerer Interpretierbarkeit. +Die Trainingsdaten können dafür zufällig gruppiert werden und die Bäume für eine Mehrheitsentscheidung genutzt werden. +\subsection{Cross-Entropy} +Kullback-Leibler = $H(p,q) - H(p)$ +as a cost function in machine learning +\subsection{Coding} +%Coding of a source of an information and communication channel +% https://www.youtube.com/watch?v=ErfnhcEV1O8 +% relation to hamming distance and efficient codes + +\subsection{Noisy communication channels} +Given a model of + +\begin{figure}[H] +\begin{tikzpicture} + \def\boxw{2.5cm} + \def\n{5} + \pgfmathsetmacro{\gap}{(\textwidth - \n*\boxw)/(\n-1)} + % Draw the boxes + \node (A) at (0, 0) [draw, text width=\boxw, align=center] {Information Source}; + \node (B) at (\boxw + \gap, 0) [draw, text width=\boxw, align=center] {Transmitter}; + \node (C) at ({2*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Channel}; + \node (N) at ({2*(\boxw + \gap)}, -1) [draw, text width=\boxw, align=center] {Noise}; + \node (D) at ({3*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Receiver}; + \node (E) at ({4*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Destination}; + + % Draw arrows between the boxes + \draw[->] (A) -- (B); + \draw[->] (B) -- (C); + \draw[->] (C) -- (D); + \draw[->] (D) -- (E); + \draw[->] (N) -- (C); +\end{tikzpicture} +\caption{Model of a noisy communication channel} +\label{fig:noisy-channel} +\end{figure} \end{document} \ No newline at end of file