Classifier for Probability Estimation - K-NN, Parzen Window and Kernels

Parzen Window: Before training we can “filter” the data using a function, for example using the hypercube function that selects only the data in a specific region, an hypercube, wide $h_{n}$ , with $d$ dimensions (so its volume will be $V_{n} = (h_{n})^{d}$ ). We have defined the “window function” or “kernel” as:

φ (\underline{u}) = {10 ∣ u_{j} ∣ \leq \frac{1}{2} \forall j = 1, 2, \dots, d

The kernel $φ (\frac{x _{0} - x _{i}}{h _{n}})$ is a kernel having value $1$ only within the hypercube centered in $\underline{x_{0}}$ and having edge $h_{n}$ , the number of data in this hypercube ( $k_{n}$ ) is:

k_{n} = i = 1 \sum n φ (\frac{x _{0} - x _{i}}{h _{n}})

As we have defined before the estimate $p_{n} (\underline{x_{0}})$ of the unknown pdf $p (\cdot)$ is $\frac{k _{n}}{n V _{n}}$ , then:

p_{n} (\underline{x_{0}}) = \frac{1}{n V _{n}} \cdot i = 1 \sum n φ (\frac{x _{0} - x _{i}}{h _{n}})

NOTE: We have defined $φ (\cdot)$ as the hypercube function but this form is equivalent for all window functions.

For example one common choice of a kernel is the Gaussian Kernel:

φ (\underline{u}) = N (\underline{u}; \underline{0}, 1)

$N (\cdot)$ : Gaussian pdf.
$\underline{u}$ : vector of variables, this is a multivariate gaussian distribution
$\underline{0}$ : mean
$1$ : variance

Sufficient Condition for Asserting that $p_{n} (\cdot)$ is a PDF :

$φ (\underline{u}) \geq 0 \forall \underline{u} \in R^{d}$
$\int φ (\underline{u}) d \underline{u} = 1$

NOTE: Also, no matter the type of function $φ (\cdot)$ , we can just take $h_{n} = \frac{α}{n}$ where $α \in R$ is an hyperparameter decided arbitrarily, and we can guarantee, as seen in the previous lecture, that the estimated pdf $p_{n} (\underline{x})$ will converge to the actual pdf $p (\underline{x})$ , as the number of data $n \to \infty$ .

Dirac’s Window Function: Let us define:

δ_{n} (\underline{x}) = \frac{1}{V _{n}} \cdot φ (\frac{x}{h _{n}})

Notice how:

If $h_{n}$ is big the function $δ_{n} (\underline{x} - \underline{x_{i}})$ is a very smooth function, having small variation and representing a rectangle centered in $\underline{x_{i}}$ high $\frac{1}{V _{n}}$ and total area equal to $1$ .
if $h_{n}$ is really small, $h_{n} \to 0$ , than $δ_{n} (\underline{x} - \underline{x_{i}})$ is a Dirac’s Delta centered in $\underline{x_{i}}$ .

NOTE: This is a useful example to understand how a different kernel can impact the quality of the estimate $p_{n} (\underline{x})$ .

$k_{n}$ -Nearest Neighbour : Given that $h_{n} = \frac{α}{n}$ we could have some problems:

If $α$ is too small ⇒ $p_{n} (\cdot) = 0$ .
If $α$ is too big ⇒ $p_{n} (\cdot) = E [p (\cdot)]$ .

We can solve this problem generalizing the value of $α$ , first we want to assert that:

$lim_{n \to \infty} k_{n} = \infty$
$lim_{n \to \infty} \frac{k _{n}}{n} = 0$

A possible solution to this problem is taking $k_{n} = n$ , such that:

p_{n} (\underline{x_{0}}) = \frac{k _{n}}{n V _{n}} = \frac{1}{n V _{n}}

Now instead of deciding the Volume $V_{n}$ we create it this way:

Chose our center $\underline{x_{0}}$
Create an hypersphere of dimension $d$ around $\underline{x_{0}}$
Let the sphere expand until it includes $k_{n} = n$ data.
Calculate the volume $V_{n}$ of this hypersphere.

To have more flexibility we can define $k_{n} = α \cdot n$ where $α \in R$ is an hyperparameter decided arbitrarily.

Original Files

Where:

$n$ : number of hypercubes.
$R_{n}$ : hypercube (cube wide $h_{n}$ with dimension $d$ ).
$V_{n}$ : Volume of the hypercube, $V_{n} = (h_{n})^{d}$
$φ (x)$ : window function, or kernel
$k_{n}$ : number of data given, in this case the sum of all the $n$ hypercubes’ volumes.
$p (\underline{x})$ : multivariate pdf where $\underline{x}$ is a vector.
$p_{n} (\underline{x})$ : $n$ -th estimation of the pdf $p (\underline{x})$ .

NOTE: Gaussian Kernel : $φ (\underline{u}) = N (\underline{u}; \underline{0}, 1)$ $\underline{u}$ : vector of variables, this is a multivariate gaussian distribution $\underline{0}$ : mean $1$ : variance