In this section, we're going to cover some final missing pieces to keep in mind before starting to look into actually coding up our own neural networks. Now let's go over the learning goals for this section. In this section, we're going to cover some of the details of training neural network models. A lot of this will be review as we'll go back over stochastic gradient descent, as well as other batching approaches and important terminology. The reason why we do this, is because once we start to actually implement our neural nets in Python, you'll actually have to tune each one of these different parameters, that we're going to discuss here. Given our different data points within our dataset, we now know how to compute the derivative for each one of our weights, and we went over different options on how to use that derivative to update our weights using different optimizers. Now I want to review how often we should actually go about updating our weights. As this is going to be something again that we're going to have to tune when creating our neural net models in Python. What do I mean by how often we need to update our weights? We're going back and reviewing this idea of using all of our dataset, part of our dataset, or maybe even just a single row. On our classical approach, we'll be getting the derivative for the entire dataset, and we'll use that derivative to update our weights. So using the entire dataset. The pros is that each step will be informed by all data, but the con will be that this can tend to be very slow, especially as that dataset grows very large. Now on the other end of the spectrum, we again have stochastic gradient descent. With stochastic gradient descent, we get the derivative at just a single row, at just a single point and take a step in that direction. This means that the steps may be less informed, each one of those individual steps, but you ultimately take many more of those steps as you run through your entire dataset. The hope is, and the idea being that with us being able to quickly take more steps, it'll ultimately balance out any missteps you make along the way. With the idea that you can take missteps at every iteration, you probably want a smaller step taken each time, so you don't veer too far away in the wrong direction. Also since it won't be perfectly fitting to the entire dataset, this will also help in slightly regularizing your model as well. Then we have our compromise using mini-batch gradient descent. Here we'll get the derivative using just a subset of our dataset and then take a step in that direction according to the derivative of that subset. The typical mini batch size will tend to be 16 or 32 rows, and you can tune this approximately the more rows that you choose, the slower it may take to run. Again, think about the stochastic gradient descent being a single row running very quickly. So the larger you have to run that derivative on, the slower it may take. The idea of this compromise is meant to strike obviously a balance between the extremes of that full batch gradient descent and stochastic gradient descent. Now just to hammer this all home, let's visualize each of these approaches in comparison to one another. We see all the way to the left here, faster and less accurate steps and all the way to the right we have slower and more accurate steps. I want you to think, given everything we just discussed, where stochastic gradient descent will fall, where mini-batch gradient descent will fall, and where full batch gradient descent will fall. All the way here to the left as I hope you predicted on your own, we're going to have a stochastic gradient descent, where we'll have faster, less accurate steps and that we see the zigzag going as it tries to optimize the model. Then on the other end of the spectrum, we have full batch gradient descent, which is going to be that slower, but more accurate steps taken. Then finally, we have our compromise in the mini-batch gradient descent where it falls somewhere in the middle, it's not quite as fast as stochastic, but faster than full batch and it's not quite as accurate as full batch, but it is more accurate than stochastic gradient descent. Now, just to review some batching terminology, we have full batch using the entire dataset to compute the gradient before updating, we have mini-batch which uses a smaller portion of the data, but more than just that single example that you would use the stochastic gradient descent and then we stochastic gradient descent which just uses a single example to compute the gradient before updating. Though sometimes something to note as you do some learning on your own, people actually will use SGD to refer to mini-batch. Be aware of that as you start to read your own literature in regards to choosing your batch size. Now another piece of important terminology is going to be this idea of an epoch, and that epoch is going to be one of those hyperparameters that you're going to have to tune when you're actually implementing your neural nets in Python. It refers to a single pass through all of the training data. Now what do I mean by that? If we think about a full batch gradient descent, there will be one step taken at every epoch because we're setting how many times are passing through the data. Sorry, I didn't do every single step we pass through all the data, we do a full epoch. In SGD, in stochastic gradient descent, there's going to be n steps taken per epoch. We're going to take as many steps as there are rows in the dataset every time you run through an epoch, because again an epoch just means that we have ran through the whole dataset. Then with minibatch, there's going to be n, the number of rows divided by the batch size number of steps taken per an epoch. If you just think about the dataset being 360 rows and we say batch size of 36, we will take 10 steps at every single epoch. When training we often refer to the number of epochs that are needed for that model to be trained and that's going to be an important hyperparameter that we're going to tune as we try to create our own neural net models in Python. That closes out this video. In the next video, we're going to discuss another piece of terminology worth understanding, namely data shuffling.