Fast Recap:

ANN (Artificial Neural Network)
1. Architecture
2. Dynamics
3. Training & Generalization
MLP (Multilayer Perceptron)
SP (Simple Perceptron)
Learning Methods for ANN
- Batch Mode
- On-Line Mode

Recap:

$k_{n}$ -NN ( $k_{n}$ -Nearest Neighbour) Decision Rule:

Define, $\underline{x_{0}}$ the pattern to be classified.
Create an hypersphere of dimension $d$ around $\underline{x_{0}}$ .
Let the sphere expand until it includes $k_{n} = n$ data, where $n$ is the total number of data (training, validation and testing).
Calculate the volume $V_{n}$ of this hypersphere.
$p_{n} (\underline{x_{0}}) = \frac{k _{n}}{n V _{n}} = \frac{1}{n V _{n}}$

NOTE: To have more flexibility we can define $k_{n} = α \cdot n$ where $α \in R$ is an hyperparameter decided arbitrarily.

Non-Parametric Decision Rule*:

${\underline{x_{1}}, \underline{x_{2}}, \dots, \underline{x_{n}}}$ : be a supervised sample.
$V$ : volume of the ball embracing $k$ patterns.
$k_{i}$ : how many patterns among the $k$ pattern, belong to class $ω_{i}$ .

⇒ Since $p_{n} (\underline{x_{0}}) = \frac{k}{n V}$ we will have that the probability that $x_{0}$ will belong to class $ω_{i}$ is:

p_{n} (\underline{x_{0}}, ω_{i}) = \frac{k _{i}}{n V}

⇒ Hence an estimate $P_{n} (ω_{i} ∣ \underline{x_{0}})$ of $P (ω_{i} ∣ \underline{x_{0}})$ is given by:

P_{n} (ω_{i} ∣ \underline{x_{0}}) = \frac{p _{n} ( x _{0} , ω _{i} )}{\sum _{j = 1}^{c} p _{n} ( x _{0} , ω _{j} )} = \frac{\frac{k _{i}}{n V}}{\sum _{j = 1}^{c} \frac{k _{j}}{n V}} = \frac{k _{i}}{k}

Where:

$\sum_{j = 1}^{c} p_{n} (\underline{x_{0}}, ω_{j})$ : means the probability that $\underline{x_{0}}$ will belong to one of the classes $k_{j}$ , so it is equal to $\frac{k}{n V}$ .

Nearest Neighbour Decision Rule: Assign $\underline{x_{0}}$ to class $ω_{i}$ if and only if:

$\underline{x_{n}^{'}} \in Y_{n}$ is the nearest (according to Euclidean Distance) to $\underline{x_{0}}$ , among all those in $Y_{n}$ .
$\underline{x_{n}^{'}}$ belongs to class $ω_{i}$ .

Where:

$Y_{n} = {\underline{x_{1}}, \underline{x_{2}}, \dots, \underline{x_{n}}}$ : supervised sample.
$\underline{x_{0}}$ : pattern to be classified.

ALGORITHM: Search for the closest element of $\underline{x_{0}}$ in $Y_{n}$ , let’s say $x_{i}$ that belong to class $ω_{i}$ . Then we will classify $x_{0}$ with the class $ω_{i}$ .

K-Nearest Neighbor (K-NN): To not be confused with $k_{n}$ -NN ( $k_{n}$ -Nearest Neighbour)

$Y_{n} = {\underline{x_{1}}, \underline{x_{2}}, \dots, \underline{x_{n}}}$ : supervised sample.
$\underline{x_{0}}$ : pattern to be classified

⇒ Algorithm:

Consider the $k$ patterns in $Y_{n}$ that are closer to $\underline{x_{0}}$ in term of Euclidean distance)
$\underline{x_{0}}$ belongs to $ω_{i}$ , where $ω_{i}$ is the class with the highest relative frequency within the $k$ sample taken into consideration.

NOTE: While $k_{n}$ -NN estimates a pdf, the K-NN algorithm estimates $P (ω_{i} ∣ \underline{x})$

For $n \to \infty$ , the asymptotic behaviour of the K-NN tends to be optimal (~ex.: Bayesian)
In $K = 2$ cases or more generally, in cases where 2 or more classes have the same relative frequency we can expand the neighbour (increasing K) until there is only one class that has higher relative frequency with respect to the others.
The higher the value of $K$ the more accurate are the decisions taken, but it will take more time (trade-off)

Definition of an ANN (Artificial Neural Networks): An ANN is completely specified once we define its:

1. Architecture: An ANN is a directed or un-directed graph where each node (or vertices) is called a neuron and each arch (or edge) is called a synaptic connection, it has three subset: the input units, the output units and the hidden units. ⇒ For each node we can define an activation function ⇒ While for each arch we can define its weight. An architecture is completely defined with: Vertices, Edges, Input Units, Output Units, Hidden Units, Weights and Activation Functions.

1.1. Typical Activation Functions: ⇒ Step function or TLU (Threshold Logic Units): $f (a) = 1 (a), f (a) \in [0, 1]$ ⇒ Linear Function: $f (a) = a, f (a) \in R$ ⇒ Sigmoid function: $f (a) = \frac{1}{1 + e ^{- a}}, f (a) \in [0, 1]$ ⇒ Hyperbolic tangent sigmoid: $f (a) = tanh (a), f (a) \in [- 1, 1]$ ⇒ Gaussian: $f (a) = N (a; μ, σ^{2}), f (a) \in [0, 1]$ ⇒ ReLU (Rectifier Linear Units): $f (a) = max (0, a), f (a) \in [0, + \infty]$ ⇒ Leaky ReLU: $f (a) = max (λ a, a), where λ \in [- 0.1, 0.1]$

2. Dynamics: How the signal propagates. The dynamics of an ANN represent how the input data goes from start to end (how it “propagates”), we need to define how the weights interact with the data after they are processed by the activation function. ~Ex.: Let’s take $d$ input data $(x_{1}, x_{2}, \dots, x_{d})$ these data will not have any transformation in the input (even this part can be changed), then they will pass through a first hidden layer where for each node the activation function will be something like:

a_{j} = i = 1 \sum d w_{i} o_{i} z_{j} = σ (a_{j})

Where: $o_{i}$ : old layers (from $1$ to $d$ ) (in this case the input) $z_{j}$ : node $j$ belonging to the new layer (in this case the hidden layer) $σ$ : sigmoid function (our chosen activation function)

~Ex.:

This process is then repeated until the signal reaches the output layer.

Also the dynamics also define the clock (time trigger) of the ANN, which depends on its family of networks.

NOTE: The ANN topology specifies the hardware architecture, the value of the weights it’s the software, while the ANN dynamics (the living machine) represents the running process.

3. Learning: The ANN learns from the examples contained in the training set $τ = {\underline{z_{1}}, \underline{z_{2}}, \dots, \underline{z_{n}}}$ which is a continuous-valued data sample drawn from an underlying multivariate pdf. Main learning setups: ⇒ Supervised Learning: $τ = {(\underline{x_{1}}, \underline{y_{1}}), \dots, (\underline{x_{n}}, \underline{y_{n}})}$ ⇒ Unsupervised Learning: $τ = {\underline{x_{1}}, \dots, \underline{x_{n}}}$ ⇒ Semi-Supervised Learning: $τ = {(\underline{x_{1}}, \underline{y_{1}}), \dots, (\underline{x_{n}}, \underline{y_{n}}), \underline{x_{n + 1}}, \dots, \underline{x_{n + m}}}$ $n$ training data are supervised, they are a set of input $x_{i}$ and output $y_{i}$ , while $m$ data are unsupervised, we know the input $x_{j}$ but not its output ⇒ Reinforcement Learning: $τ = {\underline{x_{1}}, \underline{x_{2}}, \dots, (\underline{x_{t}}, \underline{y_{t}}), \underline{x_{t + 1}}, \dots}$ Where a reinforcement signal $y_{t}$ either a penalty or a reward is given every now and then.

3.1. Generalization: Learning is a process of progressive modification of the connection weights, aimed at inferring the (universal) law underlying the data. Far from being a mere memorization of the data, the laws learned this way are expected to generalize new data, previously unseen, this is called generalization capability.

🪴 Quartz 4.0

Explorer

AI - Lecture 11

Fast Recap:

Recap:

Original Files:

Graph View

Backlinks