Practical Issues and Some Insights:

Weights must be initialized randomly and they must belong in a small interval $[- ε, + ε]$ . ⇒ If $ε$ is to big, the sigmoid will saturate. ⇒ if the weights are all equals to each other then the model will not “brake symmetry” (the weight in each layer will be equal to each other, no matter the number of epochs).
It’s a good practice to normalize the inputs, a typical choice is in the range $[- 1, + 1]$ .
Using sigmoid function will automatically normalize the outputs in the interval $[0, 1]$ .
Regularization: Reduce the number of dimension of the model, to do this we can opt for two famous techniques: ⇒ Weight-Sharing: Some connection are forced to use the same weight, the learning for those weights ( $Δ w$ ) is computed by averaging the different values of the originals $Δ w_{ij}$ ⇒ Weight-Decay: Numerically smaller weights means simpler solutions:

C = \frac{1}{2} i \sum (\overset{y_{i}}{^} - y_{i})^{2} + \frac{α}{2} i, j \sum (w_{ij})^{2}

Where: $\frac{α}{2} \sum_{i, j} (w_{ij})^{2}$ : is called the regularization term, (if the weights increase so does the cost $C$ ).

We can use more flexible activation function, they will require more time in the learning phase, but should bring a better, faster and smaller model. For example, instead of the common sigmoid, we might opt for:

f (a) = \frac{λ}{1 + e ^{- (a - b) / θ}}

Learning may be more stable by including a momentum term, (or inertia), in the delta rule:

Δ w (t + 1) = - η \frac{\partial C}{\partial w ( t )} + ρ Δ w (t)

Where: $ρ \in (0, 1)$ is the momentum rate, how much should we consider the old $Δ w$ in the new one.

🪴 Quartz 4.0

Explorer