We train neural network models to get better as they see more data. Usually, getting better means minimizing the loss function. As you have seen earlier, to achieve this, you iteratively compute the gradient of the loss with respect to the parameters and then update the parameters using this gradient. While gradient calculation for simple functions are straightforward. For complex models, working it out by hand can be incredibly complex. This is where automatic differentiation comes in. We can use the autograd package from MXNet to calculate gradients of arbitrary functions. First, let's import the MXNet framework and import the autograd library from MXNet. We will also import the NDArray library. As you learned in previous sections, the NDArray module is MXNet way of representing multidimensional arrays. As a toy example, let's say we're interested in differentiating a function; f of x equals 2x squared with respect to the parameter x. We can start by assigning initial value to x. Let's create a two-by-two array that consists simply of the numbers 1, 2, 3, and 4. While computing the gradient of f of x with respect to x, you will need a place to store and retrieve the gradient. For a particular NDArray, you need to use attach grad to signal that you're going to compute the gradient of a function with respect to that NDArray. In this example, we're calling attach grad on x because we want to compute the gradient of f with respect to x. We also need to allocate space to store the gradient after it's computed and attach grad takes care of this as well. Now, let's evaluate the function y equals f of x. First, we need to define the function. Here I have written a Python function that computes f of x equals 2x squared. As long as the argument x is passed in as an NDArray and we keep the objects as NDArray is in the function, autograd we'll be able to compute the gradient. Now that we have the function, we can evaluate it. To tell MXNet to record the function evaluation so that we can compute the gradients later, we need to put the evaluation insight and autograd record scope. As you can see, the contents of y are the results of evaluating 2x squared. Taking the bottom left input value of three, the corresponding y value is 2 times 3 squared which is 18. Now, to compute the gradient you will have to call the backward function on y. This will invoke backpropagation on y and compute gradient of y with respect to all its upstream computational dependencies. It will store the gradient in the space allocated by the attach grad function. Now, we can verify the computer gradients to see if it's correct. The analytical derivative of y equals f of x equals 2x squared is 4x. This means that when x is the multidimensional NDArray 1, 2, 3, 4, the derivative should be 4, 8, 12, 16. The gradient computed on an NDArray that you call attach grad on will be stored in the grad property of x. Sometimes it's necessary to write dynamic programs where the execution flow depends on values computed in real-time as part of the execution. Typically, you use Python control flow operators like, if statements and for a while loops to create a dynamic fluid that depends on the data. For these kinds of function, MXNet will record the execution trees and compute the gradient as well. Consider the following function f. It takes in an input vector of size two each drawn randomly from the uniform distribution on negative one to one. F doubles the input vector until its norm or length which is 1,000. Then it selects one element depending on the sum of its elements. If it's positive, it returns the first element and the second element if it's negative. Here's how to implement that in a Python loop using while loops and if statements. Here we multiply x by two in a loop until the norm is greater or equal to 1,000. Here we pick index 0 if it's positive and index 1 if it's negative. Here's a plot showing a visual of what the function does to some randomly initialized vectors. We see that the norm of the vector or the length of the line from the origin to the point keeps increasing until it exceeds the circle of radius 1,000 around the origin. Now, let's compute the gradient. First, we initialize x with two numbers drawn randomly from a uniform distribution between minus one and one. Then you attach grad to x and record the trace of the function evaluation before calling backwards to compute the gradients. How do we evaluate these gradients? Breaking it down, we know that y is a linear function of one of the items in x. If we represent the coefficient in the linear function as k, then the gradient with respect to x will either be k, 0 or 0, k depending on which element from x we picked. We also know what K is, it is the number of times we doubled x in the loop. The draw from the random uniform distribution gives us x and the value of x.grad is consistent with our analytical reasoning.