Fast Recap:

Recap: Mixture of Gaussian Components: $\underline{Θ_{i}} = \underline{μ_{i}}$ A GMM (Gaussian Mixture Model) is a mixture density having form:

p (\underline{x} ∣ \underline{Θ}) = j = 1 \sum c P (ω_{j}) N (\underline{μ_{j}}, Σ_{j})

With the following component densities:

p (\underline{x} ∣ ω_{j}, \underline{Θ_{j}}) = \frac{1}{( 2 π ) ^{\frac{d}{2}} \cdot ∣ Σ _{j} ∣ ^{\frac{1}{2}}} \cdot exp {- \frac{1}{2} (\underline{x} - \underline{μ_{j}})^{t} \cdot Σ_{j}^{- 1} \cdot (\underline{x} - \underline{μ_{j}})}

Whose gradient w.r.t. (whit respect to) parameters $Θ_{i}$ is:

\nabla_{\underline{μ_{j}}} lo g {p (\underline{x} ∣ ω_{j}, μ_{j})} = Σ_{j}^{- 1} (\underline{x} - \underline{μ_{j}})

From the previous lecture we have seen that the condition to be respected is:

k = 1 \sum n P (ω_{i} ∣ \underline{x_{k}}, \underline{\hat{Θ}}) \cdot \nabla_{\underline{\hat{Θ}_{i}}} lo g {p (\underline{x} ∣ ω_{i}, \underline{\hat{Θ}_{i}})} = 0 \forall i = 1, \dots, c

So we will have that, since $\underline{\hat{Θ}_{j}} = \underline{\overset{μ}{^}_{j}}$ :

\underline{\overset{μ}{^}_{j}} = \frac{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ ) x _{k}}{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ )}

But there is a problem:

⎩ ⎨ ⎧ \underline{\overset{μ}{^}_{j}} = \frac{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ ) x _{k}}{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ )} P (ω_{j} ∣ \underline{x_{k}}, \overset{\underline{μ}}{^}) = \frac{p ( x _{k} ∣ ω _{j} , μ ^ ) P ( ω _{j} )}{\sum _{i = 1}^{c} p ( x _{k} ∣ ω _{j} , μ ^ ) P ( ω _{i} )}

The formulation is circular: to calculate $\underline{\overset{μ}{^}_{j}}$ we need to know $\underline{\overset{μ}{^}_{j}}$ , we thus resort to the following iterative algorithm:

⎩ ⎨ ⎧ \underline{\overset{μ}{^}} (0) = inital "arbitrary" estimate \underline{\overset{μ}{^}} (t + 1) = \frac{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ ( t )) x _{k}}{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ ( t ))}

And we iterate for $t = 0$ up to $t = T$ , where $T$ is decided arbitrarily.

Unsupervised Non-Parametric Estimation: Clustering Update equation for $\underline{\overset{μ}{^}_{j}} (t + 1)$ :

The $P (ω_{i} ∣ \underline{x_{k}}, \underline{\overset{μ}{^}} (t))$ formula just says that given:

$c$ classes ( $ω_{j}$ ) each with a vector of parameters, in this case just the evaluated mean ( $\underline{\overset{μ}{^}_{j}} (t)$ ) The probability that $\underline{x_{k}}$ belong to class $ω_{j}$ is $1$ if and only if the evaluated mean $\underline{\overset{μ}{^}_{j}} (t)$ is the closest to $\underline{x_{k}}$ according to the Euclidean distance, with respect to all other means ( $\underline{\overset{μ}{^}_{i}} (t)$ for $i \neq = j$ ) ⇒ So: $P (ω_{i} ∣ \underline{x_{k}}, \underline{\overset{μ}{^}} (t))$ is just $1$ or $0$ , this simplifies the calculation.

k-Means Clustering Algorithm:

Fix initial arbitrary (~ex.: random) values: $\underline{\overset{μ}{^}_{1}} (0), \dots, \underline{\overset{μ}{^}_{c}} (0)$ .
Assign each $\underline{x_{k}}$ (for $k = 1, \dots, n$ ) to its closest mean $\underline{\overset{μ}{^}_{j}} (t)$ .
Re-calculate $\underline{\overset{μ}{^}_{j}} (t)$ for $j = 1, \dots, c$ applying the previous equation for updating $\underline{\overset{μ}{^}_{j}} (t)$ (arithmetic mean of the pattern $\underline{x_{k}}$ in cluster $ω_{i}$ ).
If $\exists j \in {1, \dots, c} : \underline{\overset{μ}{^}_{j}} (t) \neq = \underline{\overset{μ}{^}_{j}} (t - 1)$ then iterate from point 2., this mean: If during the last iteration at least one value of $\underline{\overset{μ}{^}_{j}}$ changed, repeat.

The Problem of Data Description $\underline{\overset{μ}{^}}$ and $\underline{\hat{Σ}}$ are sufficient statistics (they are a complete description of our data) if and only if, our data $\underline{x}$ is drawn from a Gaussian Distribution.

Otherwise they would yield a wrong data description.

To solve this problem we can use a GMM (Gaussian Mixture Model), and relying an an unbounded number of Gaussian Distribution we can describe any continuous and limited PDF. This raises two problems:

Using an “unbounded” number of Gaussian PDFs brings complexity issues
Also if we mix too many Gaussian PDFs the problem overfits the training data and doesn’t generalize too well.

Alternatively we can use a non-parametric technique for estimating $p (\underline{x})$ , for example the Parzen-Window or $k_{n}$ -Nearest Neighbors, but we need to pay attention to the number of data used:

If $n = ∣ τ ∣$ is small (few data), then the estimate $p_{n} (\underline{x})$ is not reliable.
If $n = ∣ τ ∣$ is big, then $p_{n} (\underline{x})$ is over-complex, being memory-based, since $τ$ must be kept in memory and involved in the calculations, this would not be a real data description.

⇒ ==Our best shot is to use Clustering algorithms.==

Similarity Measures Instead of defining $k$ (the number of cluster) at the beginning we can define $d_{0}$ the maximum distance for which an element can be considered inside a cluster. If in a single the “within-cluster distance” of 2 elements is further than $d_{0}$ , then we create another cluster and re-separate the data.
Hierarchical Clustering: The Hierarchical Clustering has 2 families:

Top-Bottom: Divisive Clustering (we start with 1 big cluster and separate until we have at least $c$ or clusters)
Bottom-Up: Agglomerative Clustering, that we will see later (we start with no cluster and agglomerate data into new and previously created cluster until we at least have $c$ clusters)

Agglomerative Clustering algorithm:

Here is an example on how the algorithm works:

We start with creating a cluster with data $10$ and $13$ since they are the closest.
Then a new cluster with $7$ and $12$
We simultaneously agglomerate $14$ and $19$ into their own cluster, and add $4$ to the cluster containing $10$ and $13$
$\dots$

Between-Clusters Distance Measures: The most popular measures of distance between 2 clusters are:

The formula for obtaining these distances are reported here: ( $X_{i}$ and $X_{j}$ are the two cluster we take into consideration)

Utility of Clustering Algorithms:

Finding the geometric/probabilistic properties of the data: (~ex.: center of mass and variance)
Describe the data in a concise fashion
Partitioning the data into $c$ sub-samples (useful for dived et conquer algorithms)
Finding good initialization (the centroids) for complex models such as: GMMs with Max Likelihood, RBFs, $\dots$
Discretization of continuous features (~ex.: We can use the centroids and the number of data in each cluster to create an histograms of the data).
Replacing original big data set with one single cluster of it, useful when working with complex algorithms like K-NN or Parzen-Window.

🪴 Quartz 4.0

Explorer

AI - Lecture 19

Fast Recap:

Original Files:

Graph View

Backlinks