In this video, we're going to focus on the idea of oversampling. We've seen this before. This concept here is again, to try and increase the size of our minority class, so that it's similar to size of our majority class. What we touched on in the last videos basically comes down to this idea of random oversampling. This is going to be the simplest approach. What we do is that we just randomly resample with replacement the rows from our minority class. Here, there's no concern for where these points lie in space, and whether certain points will be more or less indicative of the cluster of the actual minority class. This random oversampling will work best for categorical data where our features distance from other samples may not have as much interpretive underlying similarity. We don't have to worry about whether or not there's a cluster and whether we're trying to define that cluster as correctly when we're working with categorical data. Now another approach would be to synthetically oversample. This time we're actually creating new samples of that minority class that don't yet exist. The first step is going to be to start with one of the points in that minority class. We then choose one of the K nearest neighbors. We just choose one of them to start. We see here that we have chosen whatever it is between X_zi and X_i, so X_i is our minority class, X_zi is one of those K nearest neighbors. Then we create a new point randomly between that line connecting the two points. We see this X_new, or it can be anywhere in between X_i and X_zi. Then we'll repeat this for each one of the number of neighbors that we have set out. If we have K neighbors and K neighbor is equal to 3, then we do this for the three nearest neighbors. Now there's two main approaches for the synthetic oversampling, both based off of this K nearest neighbors as it's foundation. One of them is going to be SMOTE and the other is going to be ADASYN. We'll start here with SMOTE. Now let's start with SMOTE. SMOTE is going to be short for synthetic minority oversampling technique, and it can be broken down into different subsets. The first subset is going to be regular SMOTE, where we connect to the minority class points to any neighbors, even those of other classes, as long as they are nearest neighbors. Then we use those connecting lines that we just discussed to randomly generate our new points, somewhere in between those connecting lines. We then have borderline SMOTE, where we have to first classify our points as outliers, safe, or in-danger. Where outliers will refer to those points where all neighbors are from a different class. We start off with a point from a minority class and we'll look at, let's say, three neighbors, if we set our K neighbor is equal to three. All three of those neighbors are from a different class, that would be an outlier. We then have safe, which refers to those values for which all neighbors are from the same class. If again, we have our three neighbors and we're looking at a minority class point, all three of the points nearby are all from the same class. Then in-danger, where at least half of the nearest neighbors are from the same class, but they're not all from the same class. Two out of our three, are from the same class there. We can break down boarderline into two different types. We have borderline one, where we connect minority in-danger points only to minority points and then use those connections to generate our new samples. Then we have borderline two SMOTE, where we connect minority in-danger points to whatever's nearby, no matter what the class is. Then finally, we have SVM SMOTE, which uses an underlying support vector machine classifier to find support vectors, and then uses that to generate samples considering those support vectors. For both borderline and SVM SMOTE, a neighborhood is defined using the parameter and neighbors to decide the number of neighbors to use, to decide whether a sample is in-danger, whether it's safe or whether it's an outlier. Now ADASYN or adaptive synthetic sampling works very similarly to SMOTE. It starts off again by looking at the classes in the neighborhood of each minority point. However, the number of samples generated for each point is going to be proportional to the number of samples which are not from the same class as that point in a given neighborhood. Therefore, with ADASYN, more samples will be generated in the area that the nearest neighbor rule is not respected, thus putting more weight on values that would have been originally misclassified. Now all of these are going to be motivated by K nearest neighbors, but these oversampling techniques will help with any classification for which balance is an issue. We discussed random oversampling, different versions of SMOTE, as well as ADASYN oversampling. In our next video, we will go over different technical methods available for undersampling. All right, I'll see you there.