(PNN) Parzen Neural Network

Starting from the training set $τ = {x_{1}, \dots, x_{n}}$ We want to obtain the estimate of $p (x_{0})$ , let’s call it $\overset{p}{^} (x_{0})$ . To do this we can use the following algorithm:

Define $h_{1} \in R$
Let $h_{n} = \frac{h _{1}}{n - 1}$
Let $V_{n} = h_{n}^{d}$
For $i = 1 : n$
1. Let $τ_{i} = τ \ {x_{i}}$
2. Let $y_{i} = \frac{1}{n - 1} \sum_{x \in τ_{i}} \frac{1}{V _{n}} φ (\frac{x _{i} - x _{0}}{h _{n}})$
Let $S = {(x_{i}, y_{i}) ∣ i = 1, \dots, n}$ a new training set
Train an ANN via backpropagation over $S$ .
Let $\overset{p}{^} (x_{0})$ be the function computed by the ANN
Return $\overset{p}{^} (x_{0})$

NOTE: $y_{i} = \frac{1}{n - 1} \sum_{x \in τ_{i}} \frac{1}{V _{n}} φ (\frac{x _{i} - x _{0}}{h _{n}})$ is the pdf estimation obtained by the Parzen window

NOTE: $φ (z)$ is defined as the hypercube function.

Some Insights:

If we use ReLU

Upgrade of the PNN: Better Approximation of the Tail of the Distribution

PROBLEM: Using the PNN the output $\overset{p}{^} (x_{0})$ will never reach $0$ , this is theoretically correct, but in practice it will result in an error, it’s better to have a PNN that gives $0$ when the $\overset{p}{^} (x_{0})$ is really small (an approximation if you will).

To solve this problem, first we normalize the data (which is always a good practice):

Let’s choose a meaningful interval $X$ zero-centered, for example let’s take $X = [0, 1]$ .
We make the data fit into this $X$ interval (by normalization).
We choose some values $x_{i}$ a little outside $X$
We add the couples ${(x_{i}, 0)}$ to the training set $S$ (the training set to train the ANN) This way we incentive the PNN to actually assume $0$ values in the output layer.

Upgrade of the PNN: Cross-Validated Likelihood

Another upgrade is to separate the training set $T$ in a validation set $V$ , and another, smaller training set $T^{'} = {T} \ {V}$

Declare $k \in N$ (not too big)
Choose and extract $k$ samples from $T^{'}$ , creating another training set $T_{k}^{'}$
Execute the PNN algorithm over $T_{k}^{'}$
Evaluate the Likelihood of the model (learned from $T_{k}^{'}$ ) over $V$ (this likelihood is called cross-validated likelihood)
If the likelihood is increased significantantly repeat from step $3.$ , otherwise return the output of the PNN algorithm.

NOTE: The cross-validated likelihood needs to be divided by $\int_{X} ϕ (x) d x$ at each integration step to be comparable

Upgrade of the PNN: Mixture of Experts

Create $c$ PNNs one for each class $w_{i}$ for $i = 1, \dots, c$
Train each PNN only on the data from its respective class
The output of the $i$ -th PNN $ϕ_{i} (x_{0})$ is the conditional probability $p (x_{0} ∣ w_{i})$
Now we can apply the Bayes Decision Rule to decide to which class $x_{0}$ belongs to:

i max P (w_{i}) \cdot p (x_{0} ∣ w_{i})

We consider the PNNs just created a single Expert
Another one will be: So we have two Expert in parallel now.
For the Gather we simply train it using backpropagation

Computational Complexity

The PNN with respect to the PW takes a long time to train, while the PW requires no training, but once the training is done the PNN requires a miniscule amount of computational complexity to find the estimate, while the PW takes a lot.

PNN training: $O (W \cdot T)$
PNN estimation time: $O (W)$
PW estimation time: $O (n \cdot T)$

Nonpaltry PDF

To make it simple a nonpaltry PDF is a PDF that is continuous in $R$ expect in a closed subset.

Then for a nonpaltry PDF we have a theorem that says that for $n \to \infty$ the PNN converges to the actual PDF.

From the Slides

Algorithm: (The solution listed as $1. X$ are all solution to the $1.$ problem)

Also in the PNN algorithm it is recommended to normalize data within a zero-centred meaningful range (meaningful range means not too little not to big, to be distinguished from the small zero-center range used in ANN with sigmoid activation function. This is especially helpful to:

If we know $X$ we can upgrade the PNN training algorithm

Architecture Selection

TODO WHAT??

Use of PNNs for pattern classification:

Complexity:

TRAINING COMPLEXITY*:

TEST COMPLEXITY:

Comparison of training and testing complexity with SVMs (TODO ) and with PW (Parzen Window) and $k_{n}$ -NN ( $k_{n}$ -Nearest Neighbour) training comparison testing comparison: To sum it up: