diff --git a/entropy.tex b/entropy.tex
index e990fff..4f4ca91 100644
--- a/entropy.tex
+++ b/entropy.tex
@@ -2,11 +2,13 @@
 \usepackage[utf8x]{inputenc}
 \usepackage[margin=1in]{geometry} % Adjust margins
 \usepackage{caption}
+\usepackage{wrapfig}
 \usepackage{subcaption}
 \usepackage{parskip} % dont indent after paragraphs, figures
 \usepackage{xcolor}
 %\usepackage{csquotes} % Recommended for biblatex
 \usepackage{tikz}
+\usepackage{pgfplots}
 \usepackage{float}
 \usepackage{amsmath}
 \PassOptionsToPackage{hyphens}{url}
@@ -55,17 +57,18 @@ $k_B$ refers to the Boltzmann constant, which he himself did not determine.
 \textit{Claude Shannon} adapted the concept of entropy to information theory.
 In an era of advancing communication technologies, the question he addressed was of increasing importance:
 How can messages be encoded and transmitted efficiently?
-As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes
-and media for transmission and storage.
-According to his axioms, a measure for information has to comply with the following criteria:
+He proposed 3(4) axioms a measure of information would have to comply with:
 \begin{enumerate}
     \item $I(1) = 0$: events that always occur do not communicate information.
-    \item $I(p)$ is monotonically decreasing in p: an increase in the probability of an event
+    \item $ I'(p) \leq 0$ is monotonically decreasing in p: an increase in the probability of an event
         decreases the information from an observed event, and vice versa.
     \item $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$: the information learned from independent events
         is the sum of the information learned from each event.
     \item $I(p)$ is a twice continuously differentiable function of p.
 \end{enumerate}
+
+As a measure, Shannon's formula uses the \textit{Bit}, quantifying the efficiency of codes
+and media for transmission and storage.
 In information theory, entropy can be understood as the expected information of a message.
 \begin{equation}
     H = E(I) = - \sum_i p_i \log_2(p_i)
@@ -78,6 +81,60 @@ that tomorrows message will be the same. When some day we get the message 'Vanco
 not only semantically (because it announces the eruption of a volcano) but statistically because it was very unlikely
 given the transmission history.
 
+However, uncertainty (entropy) in this situation would be relatively low.
+Because we attach high surprise only to the unlikely message of an eruption, the significantly more likely message
+carries less information - we already expected it before it arrived.
+
+Putting the axioms and our intuitive understanding of information and uncertainty together,
+\autoref{fig:graph-entropy}
+
+\begin{figure}[H]
+\begin{minipage}{.5\textwidth}
+\begin{tikzpicture}
+  \begin{axis}[
+    domain=0:1,
+    samples=100,
+    axis lines=middle,
+    xlabel={$p$},
+    ylabel={Information},
+    xmin=0, xmax=1,
+    ymin=0, ymax=6.1,
+    grid=both,
+    width=8cm,
+    height=6cm,
+    every axis x label/.style={at={(current axis.right of origin)}, anchor=west},
+    every axis y label/.style={at={(current axis.above origin)}, anchor=south},
+  ]
+    \addplot[thick, blue] {-log2(x)};
+  \end{axis}
+\end{tikzpicture}
+\caption{Information contained in a message depending on its probability $p$}
+\label{fig:graph-information}
+\end{minipage}
+\begin{minipage}{.5\textwidth}
+\begin{tikzpicture}
+  \begin{axis}[
+    domain=0:1,
+    samples=100,
+    axis lines=middle,
+    xlabel={$p$},
+    ylabel={Entropy},
+    xmin=0, xmax=1,
+    ymin=0, ymax=1.1,
+    grid=both,
+    width=8cm,
+    height=6cm,
+    every axis x label/.style={at={(current axis.right of origin)}, anchor=west},
+    every axis y label/.style={at={(current axis.above origin)}, anchor=south},
+  ]
+    \addplot[thick, blue] {-x * log2(x) - (1-x) * log2(1-x)};
+  \end{axis}
+\end{tikzpicture}
+\caption{Entropy of an event source with two possible events, depending on their probabilities $(p, 1-p)$}
+\label{fig:graph-entropy}
+\end{minipage}
+\end{figure}
+
 The base 2 is chosen for the logarithm as our computers rely on a system of the same base, but theoretically
 arbitrary bases can be used as they are proportional according to $\log_a b = \frac{\log_c b}{\log_c a} $.
 
@@ -136,8 +193,23 @@ However, drawbacks include overfitting and poor robustness, where minimal altera
 can lead to a change in tree structure.
 
 \subsection{Cross-Entropy}
-Kullback-Leibler = $H(p,q) - H(p)$
-as a cost function in machine learning
+When dealing with two distributions, the \textit{cross-entropy}, also called Kullback-Leibler divergence
+between a true distribution $p$
+and an estimated distribution $q$ is defined as:
+\begin{equation}
+    H(p, q) = -\sum_x p(x) \log_2 q(x)
+\end{equation}
+The \textit{Kullback–Leibler divergence} measures how much information is lost when $q$
+is used to approximate $p$:
+\begin{equation}
+    D_{KL}(p \| q) = H(p, q) - H(p)
+\end{equation}
+In machine learning, this term appears in many loss/cost functions — notably in classification problems 
+(cross-entropy loss) and in probabilistic models such as Variational Autoencoders (VAEs).
+There, the true and predicted label are used as the true and estimated distribution, respectively.
+In a supervised training example, the cross entropy loss degenerates to $-\log(p_{pred i})$ as the
+true label vector is assumed to be the unit vector $e_i$ (one-hot).
+
 \subsection{Coding}
 %Coding of a source of an information and communication channel
 % https://www.youtube.com/watch?v=ErfnhcEV1O8