Backpropagation Refresher

Tags: Machine Learning, Notes

Sometimes when I write reports, I end up summarizing backpropagation, which spurs me to derive it again for myself. Backpropagation isn’t particularly complicated mathematically, but if you’re not super comfortable applying the chain rule until your arm falls off, or combining calculus and linear algebra, it might feel a bit involved. Especially when my calculus is rusty, I occasionally go down some less sensible routes and take a while to get there – hopefully, writing this down will prevent me going the wrong way again. This is something of a crash-course in the derivation, without any of the diagrams or other niceties necessary to make it even slightly intelligible.

The problem

Starting with a loss function $$C$$ defined with respect to an output $$\boldsymbol{h}^L$$ from a neural network with $$L$$ layers, comprising example activation functions $$\sigma(\boldsymbol{z}^l)$$ where $$\boldsymbol{z}^l = W^l \boldsymbol{h}^{l-1} + \boldsymbol{b}^l$$ with weight matrices $$W^l$$ and biases $$b^l$$, the error of the network is given by:

$\text{loss}=C(\boldsymbol{y},\sigma(W^L\sigma(W^{L-1}\sigma(\ldots)+\boldsymbol{b}^{L-1})+\boldsymbol{b}^L))$

We want to change the weights in order to minimise the loss. To do this we need to work out the change in the error with respect to the weights and biases – then we can say the appropriate update to the weights $$\Delta W^l_{i,j} \propto - \frac{\partial{}C}{\partial W^l_{i,j}{}}$$ for weight $$(i,j)$$ in layer $$l$$, and similarly for biases.

The wrong way

When my calculus is rusty, the seemingly safest way to proceed is to simply attack, chain-rule in hand, all the way from beginning to end. While explicit, the problems with this approach are that it masks some insight into what’s going on, it’s not obvious how to convert it into matrix form, and you may drown in the index sea before making it to shore.

Assuming $$\frac{\partial{}C}{\partial\boldsymbol{h}^L{}}$$ and $$\left.\frac{\partial{}\sigma(\boldsymbol{z})}{\partial\boldsymbol{z}{}}\right|^{\boldsymbol{z}=\boldsymbol{z}'}=\sigma'(\boldsymbol{z'})$$ are easily evaluated, we can thoughtlessly jump straight to deriving with respect to one of the weights (this is what makes it bad). We’ll pick $$l=L-2$$ to capture the recursive aspect.

\begin{align} \frac{\partial{}C}{\partial W^{L-2}_{i,j}{}} &= \sum_m\frac{\partial{}C}{\partial h_m^L{}}\frac{\partial{}h_m^L}{\partial W^{L-2}_{i,j}{}},\\ &= \sum_m\frac{\partial{}C}{\partial h_m^L{}}\sigma'(z^L)_m\frac{\partial{}z_m^L}{\partial W^{L-2}_{i,j}{}}.\\ \frac{\partial{}z_m^L}{\partial W^{L-2}_{i,j}{}}&=\sum_nW^L_{m,n}\frac{\partial{}h^{L-1}_n}{\partial W^{L-2}_{i,j}{}},\\ &=\sum_nW^L_{m,n}\sigma'(z^{L-1})_n\frac{\partial{}z^{L-1}_n}{\partial W^{L-2}_{i,j}{}}. \end{align}

At this point we can see the recursive tail $$\left(\frac{\partial{}z^l_x}{\partial W^{L-2}_{i,j}{}}\right)$$, but this time we keep on expanding it to evaluate the differential and bind indices $$p, q$$.

\begin{align} \frac{\partial{}z^{L-1}_n}{\partial W^{L-2}_{i,j}{}}&=\sum_pW^{L-1}_{n,p}\sigma'(z^{L-2})_p\frac{\partial{}z^{L-2}_n}{\partial W^{L-2}_{i,j}{}},\\ &=\sum_pW^{L-1}_{n,p}\sigma'(z^{L-2})_p\frac{\partial{}}{\partial W^{L-2}_{i,j}{}}\left(\sum_q W^{L-2}_{p,q}h_q^{L-3} + b_p^{L-2} \right),\\ &=W^{L-1}_{n,i}\sigma'(z^{L-2})_ih_j^{L-3}, \end{align}

where the last line follows from

$\frac{\partial{}W^{L-2}_{p,q}}{\partial W^{L-2}_{i,j}{}} =\begin{cases} 1 &\text{if }p,q=i,j \\ 0 &\text{otherwise} \end{cases},$ allowing us to bind $$p,q$$ indices to $$i,j$$.

Now we can write the whole equation for the delta matrix in one go:

$\frac{\partial{}C}{\partial W^{L-2}_{i,j}{}} = \sum_m\frac{\partial{}C}{\partial h_m^L{}}\sigma'(z^L)_m \sum_nW^L_{m,n}\sigma'(z^{L-1})_n W^{L-1}_{n,i}\sigma'(z^{L-2})_ih_j^{L-3}.$

Enter the Matrix

Not the prettiest, but it’s correct at least - you could write a nested for-loop to read off the matrix values. However, for performance, we want to rewrite this in matrix form. We might have a notion that we want to transpose our weight matrices, but the justification doesn’t pop out from the current dependent sequence of scalar products. Instead, we want to see the transposed scalar product, $$\sum_i W_{i,j} x_i$$. To do this we will use the identity

$\sum_ia_i\sum_jb_j=\sum_{\substack{i\\j}} a_ib_j$

and rearrange.

\begin{align} \frac{\partial{}C}{\partial W^{L-2} _ {i,j}{}} &= \color{red}{\sum _ m\frac{\partial{}C}{\partial h_m^L{}}\sigma'(z^L) _ m} \color{blue}{\sum_n W^L _ {m,n}\sigma'(z^{L-1})_n} \color{green}{W^{L-1}_{n,i}\sigma'(z^{L-2})_i}h_j^{L-3},\\ &= h_j^{L-3}\sum _ {\substack{m\\n}} \color{red}{\frac{\partial{}C}{\partial h _ m^L{}}\sigma'(z^L) _ m} \color{blue}{W^L _ {m,n}\sigma'(z^{L-1}) _ n} \color{green}{W^{L-1} _ {n,i}\sigma'(z^{L-2}) _ i},\\ &= h_j^{L-3} \sum _ {\substack{m\\n}} \color{green}{W^{L-1} _ {n,i}} \color{blue}{W^L _ {m,n}} \color{red}{\frac{\partial{}C}{\partial h_m^L{}}\sigma'(z^L) _ m} \color{blue}{\sigma'(z^{L-1}) _ n} \color{green}{\sigma'(z^{L-2}) _ i},\\ &= h_j^{L-3} \color{green}{\sum_nW^{L-1} _ {n,i}} \color{blue}{\sum_m W^L _ {m,n}} \color{red}{\frac{\partial{}C}{\partial h_m^L{}}\sigma'(z^L) _ m} \color{blue}{\sigma'(z^{L-1}) _ n} \color{green}{\sigma'(z^{L-2}) _ i}. \end{align}

We’ve essentially inverted the network application, making our new input the derivative of the output error, and multiplying it by the tranposes of the weight matrices and the derivative of the activation before it.

Now we can put this into explicit matrix form. Starting with the final layer,

$\begin{gather} \color{red}{\nabla C \circ \sigma'(\boldsymbol{z}^L)}\\ \color{blue}{({{W^L}}^T} \color{red}{(\nabla C \circ \sigma'(\boldsymbol{z}^L))} \color{blue}{) \circ \sigma'(\boldsymbol{z}^{L-1})}\\ \color{green}{({{W^{L-1}}}^T} \color{blue}{({{W^L}}^T} \color{red}{(\nabla C \circ \sigma'(\boldsymbol{z}^L))} \color{blue}{) \circ \sigma'(\boldsymbol{z}^{L-1}))} \color{green}{\circ \sigma'(\boldsymbol{z}^{L-2})}\\ \color{green}{({{W^{L-1}}}^T} \color{blue}{({{W^L}}^T} \color{red}{(\nabla C \circ \sigma'(\boldsymbol{z}^L))} \color{blue}{) \circ \sigma'(\boldsymbol{z}^{L-1}))} \color{green}{\circ \sigma'(\boldsymbol{z}^{L-2})} {\boldsymbol{h}^{L-3}}^T\\ \end{gather}$

And now it becomes obvious how to break this up:

\begin{align} D^L &= \nabla C \circ \sigma'(\boldsymbol{z}^L),\\ D^l &= ({W^{l+1}}^TD^{l+1})\circ \sigma'(\boldsymbol{z}^l),\\ \frac{\partial{}C}{\partial W^l{}} &= D^{l} {\boldsymbol{h}^{l-1}}^T ,\\ \frac{\partial{}C}{\partial\boldsymbol{b}^l{}} &= D^{l}. \end{align}

Slightly nicer way

Instead of doing the whole derivative in one go, it’s helpful to break it into smaller chunks as we go, and work with vectors wherever obvious (Any derivative with respect to a tensor has the same dimensions as that tensor). So we have

\begin{align} \frac{\partial{}C}{\partial\boldsymbol{h}^L{}} &= \nabla C,\\ \frac{\partial{}C}{\partial\boldsymbol{z}^L{}} &= \nabla C \circ \sigma'(\boldsymbol{z}^L).\\ \end{align}

and then we sojourn into index notation,

\begin{align} \frac{\partial{}z_i^L}{\partial z_j^{L-1}{}} &= W^L_{i,j}\sigma'\left(z^{L-1}\right)_j,\\ \frac{\partial{}C}{\partial z^{L-1}_j{}} &= \sum_i \left(\frac{\partial{}C}{\partial\boldsymbol{z}^L{}}\right)_i \frac{\partial{}z_i^L}{\partial z_j^{L-1}{}},\\ &= \sum_i \left(\frac{\partial{}C}{\partial\boldsymbol{z}^L{}}\right) _ i W^L_{i,j}\sigma'\left(z^{L-1}\right)_j. \end{align} From this we get our expected transpose, \begin{align} \frac{\partial{}C}{\partial\boldsymbol{z}^{L-1}{}} &= \left({W^L}^T \frac{\partial{}C}{\partial\boldsymbol{z}^L{}}\right) \circ \sigma'\left(z^{L-1}\right),\\ &= \left({W^L}^T \left(\nabla C \circ \sigma'(\boldsymbol{z}^L)\right)\right) \circ \sigma'\left(z^{L-1}\right), \end{align}

and we can say

\begin{align} D^L &= \nabla C \circ \sigma'(\boldsymbol{z}^L),\\ D^l &= ({W^{l+1}}^TD^{l+1})\circ \sigma'(\boldsymbol{z}^l),\\ \frac{\partial{}C}{\partial W^l{}} &= D^{l} {\boldsymbol{h}^{l-1}}^T,\\ \frac{\partial{}C}{\partial\boldsymbol{b}^l{}} &= D^{l}, \end{align}

as before. Much neater!