University AI - Complete Mind Map

Recap:

Video to understand the differences of K-NN, Parzen Window Classifiers and what Kernels are.

Differences of K-NN and Parzen Window Classifier:

K-NN (K-Nearest Neighbour) has fixed number of points and we calculate the size of the window In the example both window of the 3-NN have 3 data points inside them, one region is larger the other is smaller.

Parzen Window: has a fixed window size and we count the number of data points inside the window. In the example both window of the Parzen Window have the same size, one has more data points inside them, the other has less.

NOTE: Both Parzen Window and K-NN are methods for a PROBABILITY CLASSIFIER.

Kernel: As we can see from the next image, it’s unfair to classify based on a “step” function: We classify this class (the yellow X) as 100% blue, because 2 blue dot are inside it and a red dot is just a little out of the region, so it’s presence is not considered during the classification. ⇒ To solve this problem we introduce the KERNELs

Using the normal K-NN and Parzen Window Classifier we use a step function, saying that if you are inside the region $R$ “your vote” counts as $1$ , as we can see in this other example: In this example the class $X$ has a $66%$ of being blue and $33%$ of being red, not too fair.

Let’s see the difference if we use a Kernel: This way the closer you are from the center, the more you count, the farther away you are the less you count, but you still count.

NOTE: In practice the Kernel is a function. It usually takes in input the center of the Region and a data point, it gives out the “filtered” weight of that particular data point.

Parzen Window: Before training we can “filter” the data using a function, for example using the hypercube function that selects only the data in a specific region, an hypercube, wide $h_{n}$ , with $d$ dimensions (so its volume will be $V_{n} = (h_{n})^{d}$ ). We have defined the “window function” or “kernel” as:
$φ (\underline{u}) = {10 ∣ u_{j} ∣ \leq \frac{1}{2} \forall j = 1, 2, \dots, d$
The kernel $φ (\frac{x _{0} - x _{i}}{h _{n}})$ is a kernel having value $1$ only within the hypercube centered in $\underline{x_{0}}$ and having edge $h_{n}$ , the number of data in this hypercube ( $k_{n}$ ) is:
$k_{n} = i = 1 \sum n φ (\frac{x _{0} - x _{i}}{h _{n}})$
As we have defined before the estimate $p_{n} (\underline{x_{0}})$ of the unknown pdf $p (\cdot)$ is $\frac{k _{n}}{n V _{n}}$ , then:
$p_{n} (\underline{x_{0}}) = \frac{1}{n V _{n}} \cdot i = 1 \sum n φ (\frac{x _{0} - x _{i}}{h _{n}})$

NOTE: We have defined $φ (\cdot)$ as the hypercube function but this form is equivalent for all window functions.

For example one common choice of a kernel is the Gaussian Kernel:
$φ (\underline{u}) = N (\underline{u}; \underline{0}, 1)$

$N (\cdot)$ : Gaussian pdf.

$\underline{u}$ : vector of variables, this is a multivariate gaussian distribution

$\underline{0}$ : mean

$1$ : variance

Sufficient Condition for Asserting that $p_{n} (\cdot)$ is a PDF :

$φ (\underline{u}) \geq 0 \forall \underline{u} \in R^{d}$

$\int φ (\underline{u}) d \underline{u} = 1$

NOTE: Also, no matter the type of function $φ (\cdot)$ , we can just take $h_{n} = \frac{α}{n}$ where $α \in R$ is an hyperparameter decided arbitrarily, and we can guarantee, as seen in the previous lecture, that the estimated pdf $p_{n} (\underline{x})$ will converge to the actual pdf $p (\underline{x})$ , as the number of data $n \to \infty$ .

Dirac’s Window Function: Let us define:
$δ_{n} (\underline{x}) = \frac{1}{V _{n}} \cdot φ (\frac{x}{h _{n}})$
Notice how:

If $h_{n}$ is big the function $δ_{n} (\underline{x} - \underline{x_{i}})$ is a very smooth function, having small variation and representing a rectangle centered in $\underline{x_{i}}$ high $\frac{1}{V _{n}}$ and total area equal to $1$ .

if $h_{n}$ is really small, $h_{n} \to 0$ , than $δ_{n} (\underline{x} - \underline{x_{i}})$ is a Dirac’s Delta centered in $\underline{x_{i}}$ .

NOTE: This is a useful example to understand how a different kernel can impact the quality of the estimate $p_{n} (\underline{x})$ .

$k_{n}$ -Nearest Neighbour : Given that $h_{n} = \frac{α}{n}$ we could have some problems:

If $α$ is too small ⇒ $p_{n} (\cdot) = 0$ .

If $α$ is too big ⇒ $p_{n} (\cdot) = E [p (\cdot)]$ .

We can solve this problem generalizing the value of $α$ , first we want to assert that:

$lim_{n \to \infty} k_{n} = \infty$

$lim_{n \to \infty} \frac{k _{n}}{n} = 0$

A possible solution to this problem is taking $k_{n} = n$ , such that:
$p_{n} (\underline{x_{0}}) = \frac{k _{n}}{n V _{n}} = \frac{1}{n V _{n}}$
Now instead of deciding the Volume $V_{n}$ we create it this way:

Chose our center $\underline{x_{0}}$

Create an hypersphere of dimension $d$ around $\underline{x_{0}}$

Let the sphere expand until it includes $k_{n} = n$ data.

Calculate the volume $V_{n}$ of this hypersphere.

To have more flexibility we can define $k_{n} = α \cdot n$ where $α \in R$ is an hyperparameter decided arbitrarily.

Naming :

$μ, \underline{μ}$ : mean and vector mean

$σ^{2}, Σ$ : variance and covariance matrix

$\underline{Θ}$ : parameter vector, for example: $\underline{Θ} = (\underline{μ}, Σ)$

$ω_{i}$ : $i$ classes, for example the gender (male/female) we want to identificate.

$w_{i}$ : weight

$b_{i}$ : bias

$\underline{x}, \underline{y}$ : data, could mean data in input or training data.

$\underline{x_{0}}$ : is our pattern of interest, for example an input data we want to estimate after training the model

$p (\underline{x})$ : is the real probability we are searching.

$p_{n} (\underline{x})$ : $n$ -estimate of the pdf $p (\cdot)$ where $n$ is the number of data used to calculate this estimate.

$Y = {\underline{y_{1}} \dots, \underline{y_{n}}}$ a set of data, it usually used to indicate the training set. We could find $Y = Y_{1} \cup Y_{2} \cup \dots \cup Y_{c}$ where $Y_{i}$ refers to the data of a single class $ω_{i}$ , and it is given according to the distribution $p (Y ∣ ω_{i})$ .

$τ$ : set of data belonging to the test set .

$V$ : set of data belonging to the validation set.

$c$ : number of samples used as the training set

$P (w_{i} ∣ \underline{x})$ : real probability that $\underline{x}$ , the data or variable we want to classify belongs to/is identified as the class $ω_{i}$ (~ex.: in reality the percentage of male and female is $48% / 52%$ ).

$\hat{P} (w_{i} ∣ \underline{x})$ : estimated probability of $P (w_{i} ∣ \underline{x})$ (~ex.: we estimate that the the percentage of male and female is $50% / 50%$ , tho this is not actually true).

$g_{i} (\underline{x})$ : discriminant function of class $w_{i}$ , usually it is defined as: $g_{i} (x) = lo g p (\underline{x} ∣ ω_{i}) + lo g P (ω_{i})$ $g_{i} (\underline{x}) = w_{i}^{t} \underline{x} + b_{i}$ (in the linear case)

$D (\underline{x}) = w_{i}$ : decision rule, a simple decision rule could be: $D (\underline{x}) = ω_{i} iff g_{i} (\underline{x}) \geq g_{j} (\underline{x})$

Link to original

🪴 Quartz 4.0

Explorer

University AI - Complete Mind Map

Recap:

Graph View

Backlinks