In this video, you will learn how neural networks can be trained to fit a mapping function from training data to labels, and why we need gradients and automatic differentiation to achieve this. At the end of the video, you will understand the general procedure to train neural networks which involve an algorithmic technique called backpropagation. Neural network training consists of repeating three simple steps over and over until the parameters of the model converge to a good solution. The first step is known as the forward step, which involves passing the data through the model to obtain the output, also called the model prediction. This is depicted in the diagram as the purple nodes. The model prediction is then compared to the ground truth label via a loss function which gives an estimate of how well our model is performing at the current task. The model loss is what we want to minimize during neural network training. The next step is the backward step. This involves computing the gradient of the loss function with respect to the parameters of the network. Typically, this is done in reverse. The gradient of the loss with respect to the last layer parameters are computed first and then the upstream gradients are computed recursively using the chain rule. We will see very soon that computing the gradients this way is efficient and is what makes training deep neural networks computationally tractable. The final step is the optimize or optimization step. In this step, we use the gradients computed from the backward step to update the model parameters in a way that improves the model and minimizes the loss. Repeating these steps several times causes the model parameters to evolve towards a good solution during model training. So what exactly are gradients and why are they necessary? The gradient of a function at a given point is the value of the first derivative of the function at the point. Visually, you can think of the gradient as the slope of a tangent line to the function at that point. For a function that depends on more than one parameter, you can express the gradient of the function with respect to each parameter as a vector. The gradient of the function at the point gives the direction of steepest increase, this means that the negative gradient gives the direction of steepest decline. Therefore, by following the direction of the negative gradient, you eventually arrive at a local minimum for that function. This procedure is what is known as gradient descent, and was introduced by French mathematician, Augustin Louis Cauchy in the middle of the 19th century. So how exactly do we compute the gradient? For simple functions, you can compute gradients analytically by taking advantage of the product rule or the quotient rule from differential calculus. However, for complex functions like the last functions of deep convolutional neural networks, that are the results of complex chained operations, computing the gradients analytically in one go is not feasible so we turned to automatic differentiation. Automatic differentiation is a way to numerically compute the derivatives of a function at a point. It takes advantage of the fact that any function that can be expressed as a computer program will execute a sequence of arithmetic operations and simple functions with known derivatives. Typically automatic differentiation proceeds by recording the operations to build a computational graph of the function. Then, it repeatedly apply the chain rule to the operations in the computational graph to compute the derivatives. How do we use the gradients computed by automatic differentiation? That's where backpropagation comes in. Backpropagation is an efficient algorithm for training neural networks that uses automatic differentiation and dynamic programming to update the network parameters. In fact we have already seen backpropagation in action. The three-step iterative process to train neural networks that was introduced earlier is informed by the backpropagation algorithm. Let's work through an example for a simple model with a single hidden layer and a few parameters. For backpropagation to be effective, we store in the forward pass the intermediate outputs of the hidden layers of the neural network, and then the value of the loss function as well. In the backward pass, we use the chain rule to compute the gradients of the lastly first, working backwards from the loss to the input layer. During the backward pass, we use the stored intermediate outputs and gradients of [inaudible] to compute the gradient of [inaudible] layers. Once we obtain the gradient for all parameters, the gradients are then used to improve the parameter values. Let's summarize the key points. You have seen that at the high level, three inner neural network consists of iteratively performing three steps. The forward step, the backward step and lastly the optimized step. Specifically, you saw that to improve the parameter values and minimize the loss, you take a step in the direction of the negative gradient, a process known as gradient descent. Additionally, you said that you can use automatic differentiation to compute the gradients of arbitrary functions, so that you can optimize them with gradient descent, simply by recording the computational graph and applying the chain rule. Finally, you saw that the efficient procedure for doing this is called the backpropagation algorithm. Next, you'll see how automatic differentiation and backpropagation are implemented in gluon with the autograd module.