Fast Recap:

Autoencoder
- An ANN where the training data is defined as $τ = {(x_{i}, x_{i})}$
Normal Use of an Autoencoder
1. Let $R^{d}$ be the original feature space, $τ = {(\underline{x}, \underline{\overset{y}{^}}) ∣ \underline{x} \in R^{d}, \underline{\overset{y}{^}} \in R^{m}}$ , so our goal is to train an ANN to realize the function $ϕ : R^{d} \to R^{m}$ .
2. From $τ$ we define $τ ’$ the training set for our autoencoder: $τ = {(\underline{x}, \underline{x})}$ and then train our autoencoder.
3. We remove just the output layer from our autoencoder and obtain a new function $ψ : R^{d} \to R^{k}$ such that $k < d$ , using this function on the input $\underline{x}$ we obtain a new set $τ^{''} = {(\underline{x}, \underline{z}) ∣ \underline{z} \in R^{k}}$
4. We train a new MLP via backpropagation on $τ^{'''} = {(\underline{z}, \underline{\overset{y}{^}})}$ and we obtain the function $\hat{ϕ} : R^{k} \to R^{m}$ .
5. We mount the two MLP (autoencoder and new MLP) on top of each other and obtain the function $ϕ : R^{d} \to R^{m}$
6. We can tune the completed MLP via backpropagation on the original data set $τ$ , if necessary
7. We can iterate this process stacking even more autoencoder at the beginning of the whole MLP

Recap:

Autoencoder (Auto-associative Neural Net): Train a neural network such that it has at least 1-hidden layer, with dimensions of the last hidden layer smaller than the dimension of the input layer, also it’s data set is a supervised set that has same input and output $Y = {(\underline{x_{1}}, \underline{x_{1}}), \dots, (\underline{x_{n}}, \underline{x_{n}})}$

If we separate the output layer what we end up with is an encoder and a decoder for our input data.

We can use just the encoder and attach it to the beginning of a new NN and use it to reduce the dimension of the input data.
We can use just the encoder to reduce all our input data and then use the new input with faster training time (smaller dimensions)
We can use the whole autoencoder as a noisy filter for our data, worsening the training data to obtain a more general model.

Tho the general approach of what we want to do is:

Let $R^{d}$ be the original feature space, $τ = {(\underline{x}, \underline{\overset{y}{^}}) ∣ \underline{x} \in R^{d}, \underline{\overset{y}{^}} \in R^{m}}$ , so our goal is to train an ANN to realize the function $ϕ : R^{d} \to R^{m}$ .
From $τ$ we define $τ ’$ the training set for our autoencoder: $τ = {(\underline{x}, \underline{x})}$ and then train our autoencoder.
We remove just the output layer from our autoencoder and obtain a new function $ψ : R^{d} \to R^{k}$ such that $k < d$ , using this function on the input $\underline{x}$ we obtain a new set $τ^{''} = {(\underline{x}, \underline{z}) ∣ \underline{z} \in R^{k}}$
We train a new MLP via backpropagation on $τ^{'''} = {(\underline{z}, \underline{\overset{y}{^}})}$ and we obtain the function $\hat{ϕ} : R^{k} \to R^{m}$ .
We mount the two MLP (autoencoder and new MLP) on top of each other and obtain the function $ϕ : R^{d} \to R^{m}$
We can tune the completed MLP via backpropagation on the original data set $τ$ , if necessary
We can iterate this process stacking even more autoencoder at the beginning of the whole MLP

ANNs : Patter Recognition and Probability Estimation: We can use an MLP as a non-parametric estimator for pattern recognition in 2 ways:

Use MLPs as discriminant function: Train them via backpropagation on a set labeled with $0/1$ outputs.
Probabilistic interpretation of the MLPs outputs.

NOTE: The MLP output may be interpreted as a probability if and only if it is constrained within the $[0, 1]$ range. This is guaranteed if sigmoid activation function are used. We also need to assert that the sum of all outputs equal to 1, this is done if we let: $\hat{P} (ω_{i} ∣ \underline{x}) = \frac{y _{i} ( x )}{\sum _{j = 1}^{c} y _{j} ( x )}$ Where: $y_{j} (\underline{x})$ is the $j$ -th ANN output over the current input $\underline{x}$ .

Or we can use the SOFTMAX normalization: $\hat{P} (ω_{i} ∣ \underline{x}) ≃ y_{i} (\underline{x}) = \frac{e ^{a_{i}}}{\sum _{j = 1}^{c} e ^{a_{i}}}$

Theorem: Lippmann, Richard: If we reach the global minimum using $0/1$ targets and the right MLP architecture, we are guaranteed that the MLP obtained this way is the optimal Bayesian Classifier, as long as the class-posteriors are continuous.

In practice, using backpropagation on real world data we will never find the global minimum.

Original Files:

NOTE: We can use this to worsen the data to use in the training test, so to make a more robust model.

🪴 Quartz 4.0

Explorer

AI - Lecture 16

Fast Recap:

Recap:

Original Files:

Graph View

Backlinks