begin entropy examples

2025-10-29 17:21:03 +01:00
parent b1989ee151
commit d1b45ee1ec
1 changed files with 126 additions and 6 deletions
--- a/entropy.tex
+++ b/entropy.tex
@@ -26,8 +26,9 @@ Originating in classical thermodynamics,
 over time it has been applied in different sciences such as chemistry and information theory.
 %As the informal concept of entropy gains popularity, its specific meaning can feel far-fetched and ambiguous.
 The name 'entropy' was first coined by german physicist \textit{Rudolf Clausius} in 1865
-while postulating the second law of thermodynamics.
+while postulating the second law of thermodynamics, one of 3(4)
-The laws of thermodynamics are based on universal observation regarding heat and energy conversion.
+laws of thermodynamics based on universal observation regarding heat and energy conversion.
 Specifically, the second law states that not all thermal energy can be converted into work in a cyclic process.
 Or, in other words, that the entropy of an isolated system cannot decrease,
 as they always tend toward a state of thermodynamic equilibrium where entropy is highest for a given internal energy.
@@ -42,15 +43,134 @@ that is still in use in information theory today.
 \begin{equation}
 S = -k_B \sum_i p_i \ln(p_i)
 \end{equation}
-It gives statical meaning to the macroscopical phenomenon by defining the entropy $S$ of a macrostate as 
+It gives statistical meaning to the macroscopic phenomenon of classical thermodynamics 
 by defining the entropy $S$ of a macrostate as 
 the result of probabilities $p_i$ of all its constituting micro states.
 $k_B$ refers to the Boltzmann constant, which he himself did not determine.
 \section{Shannon's axioms}
 \textit{Claude Shannon} adapted the concept of entropy to information theory.
 In an era of advancing communication technologies, the question he addressed was of increasing importance:
 How can messages be encoded and transmitted efficiently?
 As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes
 and media for transmission and storage.
 According to his axioms, a measure for information has to comply with the following criteria:
 \begin{enumerate}
    \item $I(p)$ is monotonically decreasing in p: an increase in the probability of an event
        decreases the information from an observed event, and vice versa.
    \item $I(1) = 0$: events that always occur do not communicate information.
    \item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events
        is the sum of the information learned from each event.
    \item $I(p)$ is a twice continuously differentiable function of p.
 \end{enumerate}
 In information theory, entropy can be understood as the expected information of a message.
 \begin{equation}
    H = E(I) = - \sum_i p_i \log_2(p_i)
 \end{equation}
 This leaves $ I =log(1/p) = - log_2(p_i)$, implying that an unexpected message (low probability) carries
 more information than one with higher probability.
 Intuitively, we can imagine David A. Johnston, a volcanologist reporting day after day that there is no
 activity on Mount St. Helens. After a while, we grow to expect this message because it is statistically very likely
 that tomorrows message will be the same. When some day we get the message 'Vancouver! This is it!' it carries a lot of information
 not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely
 given the transmission history.
 The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically
 arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $.
 Further, the $\log_2$ can be intuitively understood for an event source with $2^n$ possible outcomes -
 using standard binary coding, we can easily see that a message has to contain $\log_2(2^n) = n$ Bits
 in order to be able to encode all possible outcomes.
 For numbers where $a \neq 2^n$ such as $a=10$, it is easy to see that there exists a number $a^k = 2^n$
 which defines a message size that can encode the outcomes of $k$ event sources with $a$ outcomes each,
 leaving the required Bits per event source at $\log_2(a^k) \div k = \log_2(a)$.
 \section{Definitions across disciplines}
 %- bedingte Entropie
 %- Redundanz
 %- Quellentropie
-\section{Shannon Axioms}
+\section{Applications}
-\section{Coding of a source of an information and communication channel}
+\subsection{Decision Trees}
 A decision tree is a supervised learning approach commonly used in data mining.
 The goal is to create an algorithm, i.e a series of questions to pose to new data (input variables)
 in order to predict the target variable, a class label.
 Graphically, each question can be visualized as a node in a tree, splitting the dataset into two or more groups.
 This process is applied to the source set and then its resulting sets in a process called \textit{recursive partitioning}.
 Once a leaf is reached, the class of the input has been successfully determined.
 In order to build the shallowest possible trees, we want to use input variables that minimize uncertainty.
 While other measures for the best choice such as the Gini coefficient exist,
 entropy is a popular measure used in decision trees.
 Using the previous learnings about information and entropy, we of course want to ask questions that have the
 highest information content, so pertaining input variables with the highest entropy.
 Zu berechnen ist also die Entropie des Klassifikators abzüglich der erwarteten Entropie
 nach der Aufteilung anhand des Attributs, 
 erklärt in~\autoref{ex:decisiontree}.
 \begin{figure}[H]
 \centering
 \begin{minipage}{.3\textwidth}
 \begin{tabular}{c|c|c}
    & hot & cold \\
    \hline
    rain &4  &5 \\
    \hline
    no rain & 3 & 2 \\
 \end{tabular}
 \end{minipage}
 \begin{minipage}{.6\textwidth}
 Die Entropie für Regen ist also $H_{prior} = H(\frac{9}{14},\frac{5}{14})$, 
 nach der Aufteilung anhand des Attributs Temperatur beträgt sie $H_{warm}= H(\frac{4}{7}, \frac{3}{7})$
 und $H_{kalt}= H(\frac{5}{7}, \frac{2}{7})$.
 Also ist die erwartete (Erwartungswert) Entropie nach der Aufteilung anhand der Temperatur $p_{warm} * H_{warm} + p_{kalt} * H_{kalt} $ .
 Der \textbf{Informationsgewinn} berechnet sich dann als Entropie für Regen abzüglich des Erwartungswertes der Entropie.
 Da $H_{prior}$ in dieser Berechnung jedoch konstant ist, kann auch einfach $E[H]$ nach der Aufteilung minimiert werden.
 \end{minipage}
 \caption{Beispiel Informationsgewinn für Aufbau eines Entscheidungsbaums}
 \label{ex:decisiontree}
 \end{figure}
 Vorteil: Niedriger Rechenaufwand und nachvollziehbarer Aufbau 
 Nachteil: Overfitting, geringe Robustheit: bereits kleine Änderungen der Trainingsdaten können zu einer Veränderung des Baums führen.
 Kann zu einem \textbf{Random Forest} (bagging) erweitert werden um die Robustheit zu steigern und Overfitting zu verringern.
 Resultiert jedoch in höherem Rechenaufwand und geringerer Interpretierbarkeit.
 Die Trainingsdaten können dafür zufällig gruppiert werden und die Bäume für eine Mehrheitsentscheidung genutzt werden.
 \subsection{Cross-Entropy}
 Kullback-Leibler = $H(p,q) - H(p)$
 as a cost function in machine learning
 \subsection{Coding}
 %Coding of a source of an information and communication channel
 % https://www.youtube.com/watch?v=ErfnhcEV1O8
 % relation to hamming distance and efficient codes
 \subsection{Noisy communication channels}
 Given a model of 
 \begin{figure}[H]
 \begin{tikzpicture}
    \def\boxw{2.5cm}
    \def\n{5}
    \pgfmathsetmacro{\gap}{(\textwidth - \n*\boxw)/(\n-1)}
    % Draw the boxes
    \node (A) at (0, 0) [draw, text width=\boxw, align=center] {Information Source};
    \node (B) at (\boxw + \gap, 0) [draw, text width=\boxw, align=center] {Transmitter};
    \node (C) at ({2*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Channel};
    \node (N) at ({2*(\boxw + \gap)}, -1) [draw, text width=\boxw, align=center] {Noise};
    \node (D) at ({3*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Receiver};
    \node (E) at ({4*(\boxw + \gap)}, 0) [draw, text width=\boxw, align=center] {Destination};
    % Draw arrows between the boxes
    \draw[->] (A) -- (B);
    \draw[->] (B) -- (C);
    \draw[->] (C) -- (D);
    \draw[->] (D) -- (E);
    \draw[->] (N) -- (C);
 \end{tikzpicture}
 \caption{Model of a noisy communication channel}
 \label{fig:noisy-channel}
 \end{figure}
 \end{document}