University AI - Complete Mind Map (Old Version)

Turing Machine

Nastro

Alfabeto

Orologio

Testina

Stati Interni

Regole di Transizione

AI Discussion

The Imitation Game

Weak and Strong AI

Classification Problem

Event

Extract Features

Classify

Class

Types of Features

Numeric

Symbolic

Qualitative

Gaussian Distributions

Definition of Gaussian PDF, Multivariate Gaussian PDF, Mean and Variance

Mahalanobis distance

$(\underline{x} - \underline{μ})^{T} Σ^{- 1} (\underline{x} - \underline{μ})$

Bayes Theorem

Decision Boundary or Decision Rule

Bayes Decision Rule

$i \in N max p (\underline{x}, w_{i}) = p (\underline{x} ∣ w_{i}) \cdot P (w_{i})$

Bayes Decision Rule with Discriminant Functions

Maximum Likelihood Decision

$g_{i} (\underline{x}) = lo g (p (\underline{x} ∣ w_{i})) + lo g (P (ω_{i}))$

Fist $X$ Principal Components

Likelihood

Likelihood of $Θ$ given $Y$

p (Y ∣ \underline{Θ}) = k = 1 \prod n p (y_{k} ∣ \underline{Θ})

(ML) Maximum Likelihood Estimate

\underline{\hat{Θ}} : \underline{\hat{Θ}} max p (Y ∣ \underline{\hat{Θ}})

If we want to find the ML Estimate we need to find $\underline{\hat{Θ}}$ such that: $\nabla_{\underline{\hat{Θ}}} {p (Y ∣ \underline{\hat{Θ}})} = 0$

Log-Likelihood

Instead of the normal $p (Y ∣ \underline{Θ})$ we use $lo g {p (Y ∣ \underline{Θ})}$
If we want to find the ML Estimate we need to find $\underline{\hat{Θ}}$ such that: $\nabla_{\underline{\hat{Θ}}} {lo g {p (Y ∣ \underline{\hat{Θ}})}} = 0$

Gaussian ML Estimate:

If the PDF we want to estimate is Gaussian, then the ML Estimate is the Gaussian PDF having mean equal to the sample mean and variance equal to the sample variance.

Validation of Classifiers

Normal Method with 60/20/20 Training Validation Test Sets

“Leave One Out” Method

Used if the data sample is small and it is difficult/expensive to add data
For each single data, we iterate, the training set will be $n - 1$ data, the validation set will be only $1$ piece of data, the error is computed averaging all the error on the validation sets.
At each iteration $\hat{\underline{Θ}}$ changes

“Many-Fold Crossvalidation” Method

Alternative to the normal $Y, V, τ$ method, it is based on the idea of the “Leave One Out” method.
Scales much better than the “Leave One Out” method.
Same as “Leave One Out” method, but instead of using $1$ piece of data for the Validation set we use $m$ data, where of course $m < n$ .
Like the LOO method, at each iteration $\hat{\underline{Θ}}$ changes.
Each new validation set of size $m$ (created at each iteration) has data never used before in previous validation set.

Supervised Learning

Non-Parametric Estimates

Relative Frequency Estimate

If $k$ out of $n$ data are in $R$ , we can estimate $P$ via the relative frequency:

P ≃ \frac{k}{n}

Easily Estimate a PDF:

~Ex.: we are searching for the pdf of males in a specific university, we use the region $R_{n}$ to estimate $p (\underline{x_{0}})$ from $n$ data.

The estimate of $p (\underline{x_{0}})$ out of $n$ data is defined as: $p_{n} (\underline{x_{0}})$ .

Then the estimate will be calculated from:

$V_{n}$ be the volume of $R_{n}$ .
$k_{n}$ be the number of data (out of $n$ ) in $R_{n}$ .

⇒ Then, $p_{n} (\underline{x_{0}})$ the estimate of $p (\underline{x_{0}})$ will be equal to: $p_{n} (\underline{x_{0}}) = \frac{k _{n}}{n V _{n}}$

Asymptotic Necessary and Sufficient Conditions

We want to ensure $p_{n} (\underline{x_{0}}) \to p (\underline{x_{0}})$ , so:

$lim_{n \to \infty} V_{n} = 0$
$lim_{n \to \infty} k_{n} = \infty$
$lim_{n \to \infty} \frac{k _{n}}{n} = 0$ (to guarantee convergence)

Solutions:

PARZEN WINDOW: Fix the volume (~Ex.: $V_{n} = \frac{1}{n}$ ) and determine $k_{n}$ consequently.
$k_{n}$ -NEAREST NEIGHBOUR: Fix $k_{n}$ (~Ex.: $k_{n} = n$ ), and determine $V_{n}$ consequently, in such a way that exactly $k_{n}$ patterns fall in $R_{n}$ .

K-NN

Parzen Window

Kernels

Before applying the Parzen Window or K-NN method, we apply a filter to the data using a kernel.

Hypercube Kernel

The easiest kernel, it result into a steep separation of the data into inside and outside the region $R$
The kernel formula is:

φ (\underline{u}) = {10 ∣ u_{j} ∣ \leq \frac{1}{2} \forall j = 1, 2, \dots, d

Gaussian Kernel

Most useful if the pdf to be estimated is Gaussian

φ (\underline{u}) = N (\underline{u}; \underline{0}, 1)

Dirac’s Kernel

δ_{n} (\underline{x}) = \frac{1}{V _{n}} \cdot φ (\frac{x}{h _{n}})

Where $φ$ is the hypercube kernel function.
$V_{n} = (h_{n})^{d}$ .
If $h_{n}$ is big the function $δ_{n} (\underline{x} - \underline{x_{i}})$ is a very smooth function.
if $h_{n}$ is really small, $h_{n} \to 0$ , than $δ_{n} (\underline{x} - \underline{x_{i}})$ is a Dirac’s Delta centered in $\underline{x_{i}}$ .
This was used as an useful example to understand how a different kernel can impact the quality of the estimate $p_{n} (\underline{x})$ .

Estimated Probability

Given a region $R$ with center in $x_{0}$ (which is the data we want to estimate)
Let’s say we have $k$ (labeled) data points inside $R$ .
Let’s also say that we have $k_{i}$ data points in $R$ are labeled as belonging to class $ω_{i}$
Then the estimated probability $\hat{P} (ω_{i} ∣ \underline{x_{0}})$ is equal to:

\hat{P} (ω_{i} ∣ \underline{x_{0}}) = \frac{k _{i}}{k}

Nearest Neighbour Rule:

$x_{0}$ the pattern to be classified will be classified with the nearest labeled data $x_{i}$ .
Let’s say $x_{i}$ belongs to class $ω_{i}$ , then we classify $x_{0}$ with class $ω_{i}$
It’s the same as doing a 1-NN classification

$k_{n}$ -NN ( $k_{n}$ -Nearest Neighbour)

Same as K-NN but instead of deciding on an arbitrary $K$ at the start, we decide on a $k_{n}$ which depends on the number of data $n$
~Ex.:

k_{n} = α \cdot n

Alpha is an hyperparameter that we can be decided arbitrarily, for example $α = 1$ .

ANN

Architecture

directed or un-directed graph
each node (or vertices) is called a neuron and each arch (or edge) is called a synaptic connection
For each node we can define an activation function
While for each arch we can define its weight.
An architecture is completely defined with: Vertices, Edges, Input Units, Output Units, Hidden Units, Weights and Activation Functions.
Typical Activation function: Sigmoid function:

f (a) = \frac{1}{1 + e ^{- a}}, f (a) \in [0, 1]

Dynamics

Defines “How the signal propagates”.
Normal Dynamics are: The input data propagates from the input layer up to the output layer passing all hidden layer, every time a set of data propagates to a new layer, the arch that go into a single neuron are multiplied with their respective weights and the summed all together, lastly the activation function of the neuron is applied, this process repeat for each layer up to the output layer.

Training & Generalization

The ANN learns from the examples contained in the training set.
We can distinguish between a variety of learning types:
- Supervised: all training data is labeled
- Unsupervised: none of the training data is labeled
- Semi-Supervised: some of the training data is labeled
- Reinforcement: a reinforcement signal (either a penalty or a reward) is given every now and then.
Learning it’s not enough if we do not generalize, that is, it’s necessary that the model will provide sufficient results when given new data.

MLP (Multilayer Perceptron)

Fully Connected layered architecture with feed-forward propagation of input signals.

Activation Functions are usually sigmoid and/or linear
Dynamics: simultaneous propagation of the signal with no delays, from the input layer to the output layer, traversing an arbitrary number of hidden layer (even $0$ hidden layers is acceptable)
Learning: supervised, via the gradient method (backpropagation)

SP (Simple Perceptron)

1-Layer ANN (no hidden layer, just input and output layer)

Different Learning Methods

Batch Mode:

C (τ, w) = C (W) = \frac{1}{2} j = 1 \sum n i = 1 \sum m (\overset{y_{i}}{^} - y_{i})^{2}

For each epoch the model sees all the data in the training set, then calculates the cost summing all the errors.

On-line Mode:

C (W) = \frac{1}{2} j = 1 \sum m (\overset{y_{i}}{^} - y_{i})^{2}

For each epoch the cost function considers only one piece of data from the data set.
Useful if we have a constant influx of new data and we want/need to update the model constantly, for example data taken from a website.

Delta Rule

Once we have calculate the Cost $C (τ, w)$ during training, we need to update the weights.
Using the gradient descent method:

w^{'} = w + Δ w where: Δ w = - η \frac{\partial C}{\partial w}

Instead of calculating every time the derivative of $C$ ,we can apply the “delta rule”:

Δ w_{jk} = η δ_{j} o_{k}

Where:

δ_{j} = {(\overset{y_{j}}{^} - y_{j}) f_{j}^{'} (a_{j}) (\sum_{i \in L_{k + 1}} w_{ij} δ_{i}) \cdot f_{j}^{'} (a_{j}) if j \in L_{l} if j \in L_{k} where: k = l - 1, \dots, 0

Practical Insight for training ANN

Randomly initialize weights in a small 0-centered interval

Normalize the Inputs

Regularization: reduce the number of dimension of the data

Weight-Decay: Numerically smaller weights are usually better, add the sum of all weights to the cost function

Use more flexible activation function

Add a momentum or inertia term

$Δ w (t + 1) = - η \frac{\partial C}{\partial w ( t )} + ρ Δ w (t)$ Where: $ρ \in (0, 1)$ is the momentum rate, how much should we consider the old $Δ w$ in the new one.

Main Supervised Learning Tasks

A supervised MLP learns a transformation $ϕ : R^{d} \to R^{m}$

Function approximation: $\underline{y} = ϕ (\underline{x})$ .
Regression (linear or non-linear) : $\underline{y} = ϕ (\underline{x}) + \underline{ε}$ , where $\underline{ε}$ is the multivariate gaussian noise.
Pattern classification: $\underline{y} = (g_{1} (\underline{x}), \dots, g_{c} (\underline{x}))$ .

Universality of MLP:

A non-linear MLP is very flexible
According tho the theorem of Universality of MLP, an MLP with a hidden layer of sigmoid function an and a linear output is a universal machine.

Mixtures of Experts

A neural module can also be called an Expert.
A neural module or expert realizes a function.
Then a Gather assigns credit to each expert (how much a neural module is reliable), result in the final equation:

\underline{y} = i = 1 \sum k α_{i} ϕ_{i} (\underline{x})

Where: $α_{i} \in [0, 1]$ is the credit assigned from the Gather

Divide et Conquer

The feature region may be partitioned and each partition given to a different expert, usually when used this approach the gather will give a credit of $1$ to only one expert at a time (the one that knows about the current region) and all the the other credits will be equal to $0$ .

Overlapping regions

Each expert will express a “likelihood” of being competent over any input $\underline{x}$ , the gather will assign credits according to a pdf $α_{i} = P (ω_{i} ∣ \underline{x})$ under the condition that $\sum α_{i} = 1$ and $\underline{y} = \sum P (ω_{i} ∣ \underline{x}) φ_{i} (\underline{x})$ imposed during both training and test.

Training the whole Mixture of Experts

Instead of training each expert separately, we can train the whole model including the Gather, which can learn automatically the values of all credits ( $α_{i}$ )
The Expert and the Gather are trained in parallel.

Autoencoder

An ANN where the training data is defined as $τ = {(x_{i}, x_{i})}$

Uses of an Autoencoder

Let $R^{d}$ be the original feature space, $τ = {(\underline{x}, \underline{\overset{y}{^}}) ∣ \underline{x} \in R^{d}, \underline{\overset{y}{^}} \in R^{m}}$ , so our goal is to train an ANN to realize the function $ϕ : R^{d} \to R^{m}$ .
From $τ$ we define $τ ’$ the training set for our autoencoder: $τ = {(\underline{x}, \underline{x})}$ and then train our autoencoder.
We remove just the output layer from our autoencoder and obtain a new function $ψ : R^{d} \to R^{k}$ such that $k < d$ , using this function on the input $\underline{x}$ we obtain a new set $τ^{''} = {(\underline{x}, \underline{z}) ∣ \underline{z} \in R^{k}}$
We train a new MLP via backpropagation on $τ^{'''} = {(\underline{z}, \underline{\overset{y}{^}})}$ and we obtain the function $\hat{ϕ} : R^{k} \to R^{m}$ .
We mount the two MLP (autoencoder and new MLP) on top of each other and obtain the function $ϕ : R^{d} \to R^{m}$
We can tune the completed MLP via backpropagation on the original data set $τ$ , if necessary
We can iterate this process stacking even more autoencoder at the beginning of the whole MLP

Using an ANN as a Non-parametric Estimator

We can use an MLP as a non-parametric estimator for pattern recognition in 2 ways:

Use MLPs as discriminant function: Train them via backpropagation on a set labeled with $0/1$ outputs.
Probabilistic interpretation of the MLPs outputs.

Since the output of an MLP can be seen as:

y_{i} (\underline{x}) ≃ P (ω_{i} ∣ \underline{x}) = \frac{p ( x ∣ ω _{i} ) P ( ω _{i} )}{p ( x )}

We can write:

\frac{y _{i} ( x )}{P ( ω _{i} )} ≃ \frac{p ( x ∣ ω _{i} )}{p ( x )}

Which is knows as scaled likelihood.

$p (\underline{x})$ is unknown, but can be estimated.
Also $p (\underline{x})$ estimates are more robust than $p (\underline{x} ∣ ω_{i})$ estimates, because we need to estimate only one estimate instead of $c$ other PDFs ( $c$ : number of classes, $ω_{1}, \dots, ω_{c}$ ), also with the same logic if we estimate only $p (\underline{x})$ we will have $c$ times more data.
Also if $P (ω_{i})$ changes over time (let’s say it assumes the new value $P^{'} (ω_{i})$ ) , we can just reuse the same MLP, so no re-training necessary and use the following formula:

P^{'} (ω_{i} ∣ \underline{x}) = = \frac{p ( x ∣ ω _{i} )}{p ( x )} P^{'} (ω_{i}) ≃ \frac{y _{i} ( x )}{P ( ω _{i} )} P^{'} (ω_{i})

Theorem: Lippmann, Richard

If we reach the global minimum using $0/1$ targets and the right MLP architecture, we are guaranteed that the MLP obtained this way is the optimal Bayesian Classifier, as long as the class-posteriors are continuous.
In practice, using backpropagation on real world data we will never find the global minimum.

RBF (Radial Basis Function) Networks: A generalized linear discriminant

All weights between the input layer and the first hidden layer are equal to $1$ .
The RB Function (Radial Basis Function), or kernel is defined as:

φ (\underline{x}) = e^{- \frac{∥ x - μ _{k} ∥}{2 σ _{k}^{2}}}

For the learning part, it’s supervised, And we usually consider 2 approaches:

Via gradient descent over $C (w)$ , we learn the parameters: $w_{ij}$ , $b_{i}$ , $\underline{μ_{k}}$ and $σ_{k}$ .
$\underline{μ_{k}}$ and $σ_{k}$ are estimated statistically, then the other parameters $w_{ij}$ and $b_{i}$ are estimated via linear algebra methods (such as matrix inversion), or via the precedent method gradient descent.

Like MLPs, RBF Networks are “universal” approximators.

NOTE: With RBF Networks we can apply gradient-ASCENT over ML (Maximum Likelihood) method in order to estimate PDFs.

The ML method only works if the weights between the last hidden layer and the output layer sum up to $1$ .

This can’t be done in MLPs because the constraint $\int p (x) d x = 1$ is violated, since they realize MLPs realize mixtures of activation functions that are not inherently pdfs.

ML Estimate:

~Ex.: We say that the pdf we want to try is a linear combination of 5 Gaussian PDFs, then we will search $\underline{\hat{Θ}} = {(\underline{μ_{1}}, σ_{1}^{2}), \dots, (\underline{μ_{5}}, σ_{5}^{2})}$ such that our assumption will be as close as possible to the real pdf (we minimize the cost)

We will have that $\underline{\hat{Θ}}$ has to respect:

k = 1 \sum n P (ω_{i} ∣ \underline{x_{k}}, \underline{\hat{Θ}}) \cdot \nabla_{\underline{\hat{Θ}_{i}}} lo g {p (\underline{x} ∣ ω_{i}, \underline{\hat{Θ}_{i}})} = 0 \forall i = 1, \dots, c

GMM (Gaussian Mixture Model)

p (\underline{x} ∣ \underline{Θ}) = j = 1 \sum c P (ω_{j}) N (\underline{μ_{j}}, Σ_{j})

⎩ ⎨ ⎧ \underline{\overset{μ}{^}} (0) = inital "arbitrary" estimate \underline{\overset{μ}{^}} (t + 1) = \frac{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ ( t )) x _{k}}{\sum _{k = 1}^{n} P ( ω _{j} ∣ x _{k} , μ ^ ( t ))}

k-Means Clustering Algorithm

Fix initial arbitrary (~ex.: random) values: $\underline{\overset{μ}{^}_{1}} (0), \dots, \underline{\overset{μ}{^}_{c}} (0)$ .
Assign each $\underline{x_{k}}$ (for $k = 1, \dots, n$ ) to its closest mean $\underline{\overset{μ}{^}_{j}} (t)$ .
Re-calculate $\underline{\overset{μ}{^}_{j}} (t)$ for $j = 1, \dots, c$ applying the previous equation for updating $\underline{\overset{μ}{^}_{j}} (t)$ (arithmetic mean of the pattern $\underline{x_{k}}$ in cluster $ω_{i}$ ).
If $\exists j \in {1, \dots, c} : \underline{\overset{μ}{^}_{j}} (t) \neq = \underline{\overset{μ}{^}_{j}} (t - 1)$ then iterate from point 2., this mean: If during the last iteration at least one value of $\underline{\overset{μ}{^}_{j}}$ changed, repeat.

The Problem of Data Description

$\underline{\overset{μ}{^}}$ and $\underline{\hat{Σ}}$ are sufficient statistics (they are a complete description of our data) if and only if, our data $\underline{x}$ is drawn from a Gaussian Distribution.

Otherwise they would yield a wrong data description.

To solve this problem we can use a GMM (Gaussian Mixture Model), and relying an an unbounded number of Gaussian Distribution we can describe any continuous and limited PDF. This raises two problems:

Using an “unbounded” number of Gaussian PDFs brings complexity issues
Also if we mix too many Gaussian PDFs the problem overfits the training data and doesn’t generalize too well.

Alternatively we can use a non-parametric technique for estimating $p (\underline{x})$ , for example the Parzen-Window or $k_{n}$ -Nearest Neighbors, but we need to pay attention to the number of data used:

If $n = ∣ τ ∣$ is small (few data), then the estimate $p_{n} (\underline{x})$ is not reliable.
If $n = ∣ τ ∣$ is big, then $p_{n} (\underline{x})$ is over-complex, being memory-based, since $τ$ must be kept in memory and involved in the calculations, this would not be a real data description.

⇒ ==Our best shot is to use Clustering algorithms.==

Similarity Measures

Within Cluster Distance
Between Cluster Distance

Hierarchical Clustering

Divisive Clustering

Top-Bottom: Divisive Clustering (we start with 1 big cluster and separate until we have at least $c$ or clusters)

Agglomerative Clustering

Bottom-Up: Agglomerative Clustering, that we will see later (we start with no cluster and agglomerate data into new and previously created cluster until we at least have $c$ clusters)

Differences of K-NN Classifier and K-Mean Clustering:

K-NN is a Supervised machine learning while K-means is an Unsupervised machine learning.
K-NN is a classification or regression machine learning algorithm while K-means is a clustering machine learning algorithm.
K-NN is a lazy learner while K-Means is an eager learner. An eager learner has a model fitting that means a training step but a lazy learner does not have a training phase.
K-NN performs much better if all of the data have the same scale but this is not true for K-means.

Utility of Clustering Algorithms

Finding the geometric/probabilistic properties of the data: (~ex.: center of mass and variance)
Describe the data in a concise fashion
Partitioning the data into $c$ sub-samples (useful for dived et conquer algorithms)
Finding good initialization (the centroids) for complex models such as: GMMs with Max Likelihood, RBFs, $\dots$
Discretization of continuous features (~ex.: We can use the centroids and the number of data in each cluster to create an histograms of the data).
Replacing original big data set with one single cluster of it, useful when working with complex algorithms like K-NN or Parzen-Window.

CNN (Competitive Neural Networks)

ARCHITECTURE: In the output layer, there are 2 more type of connections:

Lateral connection (each output neuron is connected to each other output)
Self-connections (each output neuron has a connection that goes from itself to itself)

The lateral connections are inhibitory (their weights are $< 0$ ) While the self-connections are excitatory (their weights are $> 0$ )

Also at the end of the network there is a MAXNET component that realizes a winner takes all strategy, only one of the output units (a cluster) wins over the others (the losers are set to $0$ ).

DYNAMICS: Simple dynamics, the input is passed to the network, the network spits out the outputs ( $\overset{y}{^}_{i}$ for $i = 1, \dots$ ) then the MAXNET selects the winner and turn off all the other outputs.