Why is SOFTMAX used?

Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the SOFTMAX function. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.” — Page 184, Deep Learning, 2016.

SOFTMAX units naturally represent a probability distribution over a discrete variable with k possible values, so they may be used as a kind of switch.” — Page 196, Deep Learning, 2016.