Feed-Forward Neural Network Backpropagation: Using the Chain Rule to Train Multi-Layer Perceptrons

0
42

Feed-forward neural networks, often called multi-layer perceptrons (MLPs), learn by adjusting weights so that predictions get closer to the correct outputs. The core mechanism that makes this possible is backpropagation. Backpropagation puts the chain rule from calculus to compute gradients efficiently across multiple layers, enabling weight updates through gradient-based optimisation. If you are exploring neural networks through data science classes in Pune, understanding backpropagation at a conceptual and practical level will make model training far less “mysterious”.

1) The MLP training loop: forward pass, loss, and the need for gradients

An MLP is organised into layers. Each layer performs two steps: a linear transformation and a non-linear activation.

  • Linear step: z(l)=W(l)a(l−1)+b(l)z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}z(l)=W(l)a(l−1)+b(l)
  • Activation: a(l)=f(z(l))a^{(l)} = f(z^{(l)})a(l)=f(z(l))

Here, a(0)a^{(0)}a(0) is the input vector, W(l)W^{(l)}W(l) and b(l)b^{(l)}b(l) are the weights and biases of layer lll, and f(⋅)f(\cdot)f(⋅) is an activation function such as ReLU, sigmoid, or tanh.

After the forward pass, the network produces an output y^\hat{y}y^​. A loss function measures error, for example mean squared error (regression) or cross-entropy (classification). Training aims to minimise loss by changing parameters. To do that, we need the gradient: how much the loss changes when each weight changes. Computing these gradients directly, weight by weight, is expensive. Backpropagation solves this by reusing intermediate results.

2) Backpropagation intuition: the chain rule across layers

The chain rule states that if a value depends on another value which depends on a third value, then derivatives multiply along the path. In an MLP, the loss depends on outputs, outputs depend on activations, activations depend on weighted sums, and weighted sums depend on weights. Backpropagation walks backward through the network and calculates gradients layer by layer.

A useful way to think about it is error signals moving backward. For each layer, we compute:

  1. The gradient of loss with respect to that layer’s pre-activation z(l)z^{(l)}z(l).
  2. Use it to get gradients for W(l)W^{(l)}W(l) and b(l)b^{(l)}b(l).
  3. Pass the error signal back to the previous layer.

This backward flow is efficient because it avoids recomputing repeated partial derivatives.

3) Core equations: deltas and weight updates

Backpropagation typically uses the term delta for the error signal at each layer:

δ(l)=∂L∂z(l)\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}}δ(l)=∂z(l)∂L​For the output layer, δ(L)\delta^{(L)}δ(L) depends on the loss function and the activation. For example, with softmax + cross-entropy, the gradient simplifies nicely (often y^−y\hat{y} – yy^​−y), which is one reason this pairing is popular.

For a hidden layer, the chain rule gives:

  • δ(l)=((W(l+1))Tδ(l+1))⊙f′(z(l))\delta^{(l)} = \left( (W^{(l+1)})^T \delta^{(l+1)} \right) \odot f'(z^{(l)})δ(l)=((W(l+1))Tδ(l+1))⊙f′(z(l))(W(l+1))Tδ(l+1)(W^{(l+1)})^T \delta^{(l+1)}(W(l+1))Tδ(l+1) propagates error backward through weights.
  • f′(z(l))f'(z^{(l)})f′(z(l)) is the derivative of activation function.
  • ⊙\odot⊙ means element-wise multiplication.

Once δ(l)\delta^{(l)}δ(l) is known, gradients are straightforward:

∂L∂W(l)=δ(l)(a(l−1))T,∂L∂b(l)=δ(l)\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T,\quad \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}∂W(l)∂L​=δ(l)(a(l−1))T,∂b(l)∂L​=δ(l)Finally, gradient descent updates parameters:

W(l)←W(l)−η∂L∂W(l),b(l)←b(l)−η∂L∂b(l)W^{(l)} \leftarrow W^{(l)} – \eta \frac{\partial L}{\partial W^{(l)}},\quad b^{(l)} \leftarrow b^{(l)} – \eta \frac{\partial L}{\partial b^{(l)}}W(l)←W(l)−η∂W(l)∂L​,b(l)←b(l)−η∂b(l)∂L​where η\etaη is the learning rate. In practical training, these gradients are averaged over a mini-batch.

4) Practical considerations: why training can fail (and how to fix it)

Even with correct backpropagation, training can be unstable if gradients vanish or explode.

  • Vanishing gradients: Common with deep networks and saturating activations (sigmoid, tanh). Gradients become tiny, so early layers learn slowly.
  • Exploding gradients: Gradients become huge, causing unstable updates.

Common fixes include:

  • Activation choice: ReLU and its variants often reduce vanishing gradients.
  • Weight initialisation: Methods like Xavier/Glorot (tanh) and He initialisation (ReLU) keep signal scales healthier.
  • Normalisation: Batch Normalisation can stabilise training by reducing internal covariate shift.
  • Learning rate control: Use schedules, warmup, or adaptive optimisers (Adam) to prevent overshooting.
  • Regularisation: L2 weight decay and dropout help generalisation, especially in small-to-medium datasets.

If you are practising model building in data science classes in Pune, it is also worth learning gradient checking on small networks. Numerically approximating gradients and comparing them to backprop results can confirm your implementation is correct.

Conclusion

Backpropagation is the engine that trains feed-forward neural networks. By applying the chain rule systematically, it computes gradients efficiently for every weight and bias across layers, enabling stable learning through gradient-based updates. Once you understand deltas, activation derivatives, and how gradients flow backward, you can diagnose training issues with more confidence and make better architectural and optimisation choices. For learners taking data science classes in Pune, mastering backpropagation is a strong foundation for deeper topics like optimization tricks, modern architectures, and scalable training workflows.