In the last video, we learned how gradient descent works for the case of a single neural network. Then, we wondered how gradient descent should work for feedforward neural networks that have many layers. If you use such networks, we need to train adjustable parameters in these networks. But large networks may have many layers of neurons, with up to hundreds of thousands, or even millions of parameters. To make gradient descent work in practice in such settings, we need some sort of numerical optimization of calculation of that many gradients. Such algorithm called back-propagation that allows gradient descent to work efficiently with large neural networks was suggested in 1986 in a groundbreaking paper by Rumelhart, Hinton and Williams. As its name suggests, backpropagation works backwards from outputs to the inputs using the channel of derivative recursively. This may already sound familiar to you from our previous video about the TensorFlow, and how it implements the reverse-mode autodiff for automatic calculation of derivatives of arbitrary functions. And if it does, it does it for the right reason, because backpropagation is exactly gradient descent where all derivatives are computed, used a reverse model autodiff method. To see how it works in details, let's recall the working of the reverse-mode autodiff and TensorFlow. In our video on TensorFlow, we saw how it works on a simple example of a function of two variables X and Y. The main idea there was a combination of a forward and backward pass and a reliance on a chain rule for calculation of derivatives. Now, let's see how essentially the same method works to calculate gradients or for a neural network, with respect to all of its parameters. Assume that we minimize a mean squared expected loss for a train set as we did for linear regression. But this time, the function Y hat of W is given by the final node, fN of some neural network. For example, we might have a neural network with two hidden layers of this type. We can schematically write the output fN of such neural network as a compound function shown here. The best way to understand this formula is from outputs to the inputs. First function fN depends on parameters WN, and on inputs to this function. But the inputs to fN, are given by outputs of the previous layer, which is denoted f sub N, minus one here. This function depends on its own vector of parameters WN, W sub N minus one, and so on. Now, let's see how the chain rule works backwards from the top of the network to compute all derivatives recursively. Let's start with the last node fN. There are two weights, W sub N one, and W sub N two, that enter the final node fN of this network. Computing the gradient with respect to either weight W sub I, where I equals one or two is done as a straightforward application of the chain rule. So we have the derivative of the lost function f, With respect to W sub Ni, is given by a product of the derivative of N with respect to the output fN of the last node, times the derivative of the last node with respect to weight, W sub Ni. Let's write it as a product of delta N, times the derivative of fN with respect to the input weight, W sub Ni, where Delta N is the derivative of the lost function f with respect to the output of node fN. In this expression, the term delta N depends on the laws observed for this set of parameters, while the second factor does not depend on the laws but only depends on the form or function fN and weights used there. Now let's continue going backwards and consider the last hidden layer, L sub N minus one. For derivatives with respect to weights W sub N minus one, that specify functions fN minus one here, we again apply the chain rule to express it as a product of these three derivatives. First, we take the derivative of N with respect to fN, then the derivative of fN with respect to f sub N minus one, comma I, and then finally, the derivative of f sub N minus one, with respect to weights W sub N minus one for i equal one or two. But let's note a nice trick here. Let's denote these product of directive f with respect to fN and fN with respect to f sub N minus one, as delta sub N minus one comma i. As we already computed delta sub N, then we can easily evaluate the expression for delta sub N minus one. Now, know that by introducing delta sub N minus one as we just did that derivatives of the lost f with respect to weights W sub N minus one, have exactly the same form as in the previous expression with the only difference that now we have index and minus one, instead of N. So, instead of the error delta N at the last layer, we have a new error delta sub N minus one, made of delta N and the derivative of fN, with respect to f sub N minus one. In other words, that error, delta N has backpropagated to level N minus one from the last layer LN, and became delta sub N minus one. Otherwise, calculation of gradients of the final lost function with respect to weights of the lost human layer is the same as for the very last layer. Now, we are already starting to see a pattern here. Let's see how we can continue such recursive calculation for the layer L sub N minus two. If we want to calculate the derivative of the lost function with respect to weight WN sub N minus two comma two, we have to include both nodes at the higher level L sub N minus one. These are nodes shown here. So the derivative will include some of two terms, each having a product of four derivatives in it. But we already know that this product is equal to delta sub N minus one. That's this. So, the whole expression can be written again in a very similar form to the previous expression as a product of delta sub N minus two times these derivative. Again, the error for this layer is given by linear combination of delay of the error from the previous layer times those derivatives. These calculation can be continued for all levels and all weights over a neural network. This produces a fast recursive calculation of all derivatives with respect to all weights. Once they are all calculated, we can use the gradients to perform one step of gradient descent. For the next step, the whole procedure is repeated again until conversions. So, we saw in this video that backpropagation of the training course provides the most practical approach to using gradient descent with neural network that might have a very large number of weight parameters. Now, if we think of a software implementation of backpropagation in terms of a computational graph, made of nodes representing weights and activation functions of a neural network, then we immediately realized that in TensorFlow, it's actually available to us via TensorFlow's autodiff functionality. Now, we are almost ready to start playing with neural nets, gradient descent, and backpropagation, all in TensorFlow. Only one step remains, which is to see how it works for real world data sets that tend to be large. It turns out that the version of Gradient Descent method quotes the casting gradient descent is best suited for such tasks. Let's see how it works in the next video.