Example of Chain Rule

Given the chain rule:

\frac{\partial}{\partial w _{i}} e_{k} = \frac{\partial V}{\partial f} \frac{\partial f}{\partial w _{i}}

In the case of NN with sigmoid activation function we have $V (f (\overset{w}{^}, \overset{x}{^}_{k}), y_{k})$

f (\overset{w}{^}, \overset{x}{^}_{k}) = σ (\overset{w}{^}, \overset{x}{^}_{k})

So its partial derivative with respect to $w$ is:

\frac{\partial f}{\partial w ^ _{i}} = \frac{\partial σ ( a _{k} )}{\partial w ^ _{i}} \overset{x}{^}_{k_{i}}

Also, Given:

V = \frac{1}{2} (f (\overset{w}{^}, \overset{x}{^}_{k}), y_{k})^{2} = \frac{1}{2} (σ (a_{k}) - y_{k})^{2}

We have that:

\frac{\partial V}{\partial f} = (σ (a_{k}) - y_{k})

So:

\frac{\partial V}{\partial w _{i}} = = \frac{\partial V}{\partial f} \frac{\partial f}{\partial w _{i}} = (σ (a_{k}) - y_{k}) \cdot \frac{\partial σ ( a _{k} )}{\partial w ^ _{i}} \overset{x}{^}_{k_{i}}

We also know that:

\frac{\partial σ ( a _{k} )}{\partial w ^ _{i}} = σ (a_{k}) \cdot (1 - σ (a_{k}))

To help us with notations we define the delta error:

δ_{k} := \frac{\partial σ ( a _{k} )}{\partial w ^ _{i}} (σ (a) - y_{k})

So:

\frac{\partial V}{\partial w _{i}} = δ_{k} \cdot \overset{x}{^}_{k_{i}}

Backpropagation Formula:

The main formula to remember for the backpropagation algorithm using a sigmoidal NN is:

\frac{\partial V}{\partial w _{i}} = σ (a_{k}) \cdot [1 - σ (a_{k})] \cdot [σ (a_{k}) - y_{k}]

One-hot Encoding

Source LABEL ENCODING (Look at the Categorical value column)

╔════════════╦═════════════════╦════════╗ 
║ CompanyName Categoricalvalue ║ Price  ║
╠════════════╬═════════════════╣════════║ 
║ VW         ╬      1          ║ 20000  ║
║ Acura      ╬      2          ║ 10011  ║
║ Honda      ╬      3          ║ 50000  ║
║ Honda      ╬      3          ║ 10000  ║
╚════════════╩═════════════════╩════════╝

ONE-HOT ENCODING:

╔════╦══════╦══════╦════════╦
║ VW ║ Acura║ Honda║ Price  ║
╠════╬══════╬══════╬════════╬
║ 1  ╬ 0    ╬ 0    ║ 20000  ║
║ 0  ╬ 1    ╬ 0    ║ 10011  ║
║ 0  ╬ 0    ╬ 1    ║ 50000  ║
║ 0  ╬ 0    ╬ 1    ║ 10000  ║
╚════╩══════╩══════╩════════╝

We usually prefer one-hot encoding in respect to categorical value for 2 main reasons

The label encoding assumes hierarchy, if our ML Model internally calculates the average then if we use label encoding we have that: VM < Acura < Honda, which doesn’t make any sense.
The one-hot encoding can also be compared to the output of a sigmoid function, or any other ML activation function that output belongs $\in [0, 1]$ .

Entropy Loss

V \frac{\partial V}{\partial a} = - y lo g (σ (a)) - (1 - y) lo g (1 - σ (a)) = y (1 - σ (a)) + (1 - y) (σ (a)) = {lo g (1 - σ (a)) lo g (σ (a)) for y = 0 for y = 1 = {1 - σ (a) σ (a) for y = 0 for y = 1

REMEMBER: For classification problems $y$ can only be $0$ or $1$

OBSERVATION: For the Entropy loss the delta error is null only for absolute minima

OBSERVATION: If the neuron is saturated ( $σ (a) = 0 or 1$ ) but the actual output is the opposite ( $y = 1 or 0$ ) the entropy returns a big value, think of it as notifying the NN that it made a big mistake. When the loss is big loss the next step made by the NN will also be big, so with the Entropy loss is much easier to escape the condition of saturation

🪴 Quartz 4.0

Explorer

ML - Lecture 11

Example of Chain Rule

Backpropagation Formula:

One-hot Encoding

Entropy Loss

Graph View

Table of Contents

Backlinks