Fast Recap:

Recap:

Validation of Classifiers: Given our labeled data we divide it in training set $Y = Y_{1} \cup Y_{2} \cup \dots \cup Y_{c}$ , test set $τ = τ_{1} \cup τ_{2} \cup \dots \cup τ_{c}$ and validation set $V = V_{1} \cup V_{2} \cup \dots \cup V_{c}$ , then:

We select a model $p (\underline{x} ∣ \underline{Θ})$ .
Using the training set $Y$ we estimate $\hat{\underline{Θ}}$ .
After defining a cost function we calculate the evaluated error $\hat{P} (\underline{x} ∣ \underline{Θ})$ on the validation set $V$ .
If $\hat{P} (\underline{x} ∣ \underline{Θ})$ is not good, we restart from point ( $1.$ or $2.$ )
We do the final evaluation of the model error ( $\hat{P} (\underline{x} ∣ \underline{Θ})$ ) using the test set $τ$ .

“Leave One Out” Method : Used if the data sample is small and it is difficult/expensive to add data

Let $Y = {y_{1}, y_{2}, \dots, y_{n}}$
Loop for $i = 1 : n$ $\underline{x} = y_{i}, Y ’ = Y \ {y_{i}}$ Use $Y ’$ to estimate $\hat{\underline{Θ}}$ Compute and store $\hat{P} (error ∣ \underline{x})$ using the model with hyperparameters $\hat{\underline{Θ}}$
$\hat{P} (error) = E [\hat{P} (error ∣ \underline{x})] = \frac{1}{n} \sum_{i = 1}^{n} \hat{P} (error ∣ \underline{y_{i}})$

The error is always evaluated in “new data”, data that is not in the training set and the model has not yet seen.
At the end, all data is used both for training and testing, no data is “wasted”.
It must be used on small data set, this method scales bad

“Many-Fold Crossvalidation” Method: Alternative to the normal $Y, V, τ$ method, it is based on the idea of the “Leave One Out” method, but it scales much better

Let $Y = {y_{1}, y_{2}, \dots, y_{n}}$
Loop for $i = 1 : k$ where: $k < n$ Create a test set $τ_{i}$ with a certain percentage of still unused data. Use $Y ’ = Y \ τ_{i}$ to estimate $\hat{\underline{Θ}}$ . Compute and store $\hat{P} (error ∣ τ_{i})$ calculated on $τ_{i}$ and using the newly founds hyperparameters $\hat{\underline{Θ}}$ .
$\hat{P} (error) = E [\hat{P} (error ∣ \underline{x})] = \frac{1}{k} \sum_{i = 1}^{k} \hat{P} (error ∣ τ_{i})$

Naming :

$μ, \underline{μ}$ : mean and vector mean
$σ^{2}, Σ$ : variance and covariance matrix
$\underline{Θ}$ : parameter vector, for example: $\underline{Θ} = (\underline{μ}, Σ)$
$ω_{i}$ : $i$ classes, for example the gender (male/female) we want to identificate.
$w_{i}$ : weight
$b_{i}$ : bias
$\underline{x}, \underline{y}$ : data, could mean data in input or training data.
$Y = {\underline{y_{1}} \dots, \underline{y_{n}}}$ a set of data, it usually used to indicate the training set. We could find $Y = Y_{1} \cup Y_{2} \cup \dots \cup Y_{c}$ where $Y_{i}$ refers to the data of a single class $ω_{i}$ , and it is given according to the distribution $p (Y ∣ ω_{i})$ .
$τ$ : set of data belonging to the test set .
$V$ : set of data belonging to the validation set.
$c$ : number of samples used as the training set
$P (w_{i} ∣ \underline{x})$ : real probability that $\underline{x}$ , the data or variable we want to classify belongs to/is identified as the class $ω_{i}$ (~ex.: in reality the percentage of male and female is $48% / 52%$ ).
$\hat{P} (w_{i} ∣ \underline{x})$ : estimated probability of $P (w_{i} ∣ \underline{x})$ (~ex.: we estimate that the the percentage of male and female is $50% / 50%$ , tho this is not actually true).
$g_{i} (\underline{x})$ : discriminant function of class $w_{i}$ , usually it is defined as: $g_{i} (x) = lo g p (\underline{x} ∣ ω_{i}) + lo g P (ω_{i})$ $g_{i} (\underline{x}) = w_{i}^{t} \underline{x} + b_{i}$ (in the linear case)
$D (\underline{x}) = w_{i}$ : decision rule, a simple decision rule could be: $D (\underline{x}) = ω_{i} iff g_{i} (\underline{x}) \geq g_{j} (\underline{x})$

Original Files:

Referring to the second “Note*”*:

$P (w_{i} ∣ \underline{x})$ : real probability that $\underline{x}$ , the data or variable we want to classify belongs to/is identified as the class $ω_{i}$ (~ex.: in reality the percentage of male and female is $48% / 52%$ ).
$\hat{P} (w_{j} ∣ \underline{x})$ : $\hat{P} (w_{i} ∣ \underline{x})$ : estimated probability of $P (w_{i} ∣ \underline{x})$ (~ex.: we estimate that the the percentage of male and female is $50% / 50%$ , tho this is not actually true).
$\sum_{j \neq = i} \hat{P} (w_{j} ∣ \underline{x})$ : estimated error probability for the class $i$ .

The idea of projecting $X$ on $Y$ is the same as in the Unscented Kalman Filter and the Particle Filter, creating a pdf given some data:

🪴 Quartz 4.0

Explorer

AI - Lecture 7

Fast Recap:

Recap:

Original Files:

Graph View

Backlinks