Now we introduce another means of regularization for linear regression namely lasso regression. The only difference here between this and ridge regression will be how we penalize the cost function using our coefficients. With ridge or L2, we use the coefficient squared, and with lasso we'll be using the absolute value of each one of these coefficients. What we have here is built off the different norms for vector length, or in other words, just two different ways of measuring the magnitude of each one of the coefficients. So how can we compare and contrast lasso and ridge? Now lasso is directly proportional to the absolute value of the coefficients, rather than with ridge where it's proportional to the square of the coefficients. So don't be askewed by the outliers. Now lasso stands for least absolute shrinkage and selection operator, essentially using that absolute value to penalize our coefficients. Similar to ridge this will work by giving the user a means to reduce complexity. An increase in lambda again will raise that bias but lower our variance or lower our complexity. And lasso is more likely than ridge to perform feature selection. In other words, it is more likely to completely zero out certain coefficients and we'll see why geometrically in later video. So a different regularization term here is going to be that L1 norm instead of the L2. Here we're taking the sum of the absolute values. The larger values are still going to be penalized. But again not as strongly before with ridge we squared the coefficients. So larger values had even more of a penalty in relation to the coefficients. The fact that it's going to selectively shrink some coefficients, compared to ridge which is more likely to shrink all the coefficients at once. As mentioned before, lasso will therefore actually eliminate certain features and perform feature selection for you. In practice, It's also worth noting that lasso is slower to converge than ridge due to the underlying optimization solution. So you want to take into account optimization timing or computation timing which will be faster for ridge versus perhaps higher interpretability because once you remove certain features, you can get a better idea of which features are actually important. Now looking back at the same polynomial regressions that we had looked at before, this time trying to find that underlying function, not using polynomials not using ridge, but rather using lasso. We will end up being able to optimize our Solution by coming up with the right lambda value. So a lambda that is too low which we see all the way out here to the right where it's essentially zero, then we are not regularizing enough, we're not adding enough penalty to the absolute value of each one of these coefficients. All the way to the left, we had the lambdas too high. We are regularizing too much, putting too much penalty for high coefficients, and not allowing for enough complexity. And then finally that just write value as with ridge will be somewhere in the middle that allow us to find that trade-off between bias and variance to ensure that we're able to generalize when we create our models. So again, we look at the effects of differing values for lambda. This time the difference between this and ridge is that it will quickly zero out a lot of the values. So here we see one of the coefficients being especially high compared to the rest when we have a higher lambda value. Then in the middle one, we have that just right where we had actually eliminated let's say about two-thirds of our features and only kept around one-third of our features. So again, eliminating many of our features while also reducing some. And then finally again if our lambda is equal to zero then we have no regularization and all the coefficients will remain unaffected. So looking at the graph that shows the relationship between our features and our lambda values. We notice that the overall values will decrease but if there's high multicollinearity again, we will also see this increase as one feature takes on more explanatory value as others begin to be zeroed out. So similar to what we saw with ridge, whereas our lambda increases, our coefficients should shrink some multicollinearity may cause one of those to increase for a sort portion until ultimately decreasing as well. And similar to what we mentioned before, it's possible if we are trying to find the optimal model if we're starting with a very complex model. We can reduce variance by quite some amount without increasing our bias too much. Again, the goal being to come up with the regularization term and thus the right complexity that gives you the lowest error on your holdout set. Now I want to get into elastic net, by first thinking about the question, how are we going to choose between our different models? As before, if our goal is going to be prediction accuracy, we can use validation to give us a means of figuring out which models and which hyperparameters are optimal on our hold out sets. So how well will each one of our models actually generalize to new data. If our goal is interpretability, lasso has this extra bonus of eliminating features that are not as important. Of course we also must be cautious on the other hand as our model may truly depend on many of the features. Also if timing is a priority, we should also highlight that again the ridge regression will be more computationally efficient. So things to keep in mind when deciding between ridge and lasso, which one is going to do better on a holdout set, which one is going to give you higher interpretability with lasso being able to get rid of many of the coefficients, and then finally with the ridge we're able to come up with a more computationally efficient algorithm. And also you may want to penalize certain weights even higher which the ridge regression will do for you. Now elastic net as we see here in the formula is a means of coming up with a hybrid approach between ridge and lasso. The idea here being that you have your lambda, all the way out here to the left, decide the portion you want to penalize higher coefficients in general. And then the alpha value that we see will represent the percentage that you attribute to the L2 and L1 weights to the ridge and regression weights. Thus we're introducing a third hybrid option besides ridge and lasso which again we should optimize using this cross validation method that we saw earlier to find these optimal hyperparameters. So elastic net again is going to be a mixture of ridge and lasso. The idea is can we get the best of both worlds? And this mixture is going to be determined by, here we have it written as lambda one and lambda two. But again this going to be some lambda value which is going to give you the weight for how much you're going to penalize higher coefficients. And then how much do you want to attribute that lambda to either the absolute value of the coefficients or the square of those coefficients. And the resulting effects of using the elastic net will be assuming that you choose the correct lambdas will be the same where we have either under fitting, perfectly fitting, or over fitting our underlying model. And the goal here or what we see here is that the two lambda values are equal, these obviously can be different depending which one you want to attribute more weight to, whether you want to attribute more weight to the square of the coefficients, the absolute value of the coefficients. Again with lasso, if you have more weight towards the absolute value, you'll be more likely to remove some of your coefficients. If you use the ridge, then you'll be more likely to penalize higher weights. Now let's briefly touch on another means of feature selection outside of just using lasso. Namely we're going to work with recursive feature elimination or RFE. An RFE is going to be the tool that sklearn provides to do feature selection recursively and automatically. So the way that it works is that I'll first be choosing a model. We as the users will have to choose which model we want to limit the features of. We will then explicitly define how many features we want to end up with for that feature. And RFE will then repeatedly run the model, it will measure the different feature importances and recursively remove less important features. The way that it does this in sklearn is that the model that we choose to pass through must either have an attribute for coefficients or for feature importances, and then the RFE will then eliminate the smallest value for the value for coefficients or for the feature importances. An important note is that with this in mind, we must ensure that we first scale our data if we're going to do something like linear regression so that each one of the coefficients that are linear regression learns are going to be measured on that same scale. So how do we use this in sklearn? The first thing that we're going to want to do is import the class containing our feature selection method. So from sklearn.feature_selection, we import our RFE, that's just going to be the name of our sklearn object. As we do with our sklearn objects or going to initiate the class. We're going to set it equal to RFE mod, and we're going to pass in first. We will need an actual model. That model which we have EST can either be a linear regression object, a lasso object, or even a random forest or more complex object as long as that object, as long as that model, has feature importances or it has coefficients, in order for RFE to recursively eliminate the smallest value. And then we also want to predefine how many features we ultimately want to keep within our model. We then fit the instance on the data and then predict the expected value. So RFE mod, which we initiate before, it has the model built-in, right, that's the EST, which is either a linear regression, lasso, or something more complicated such as random forest. So we're able to fit that and it's going to fit that model to X train and y train. And then we can also ultimately predict given some X test, and what it'll do is it'll fit according to the number of features that we want to ultimately keep at the end of the day. And then another class that's available to us is RFECV. And that class rather than doing feature elimination by looking at feature importance will also incorporate cross-validation to ensure that we are doing well on each one of our hold out sets. And as we eliminate features, which one will actually affect the value or our error on that holdout set. So to recap, in this section, we reminded ourselves about the idea and the relationship between model complexity and error. We started off by recapping the importance of having that holdout set as we want to ensure that as we increase our complexity we're also increasing our error. So we want to ensure that as we increase complexity, we're also able to generalize well to new data. We discussed regularization as an approach to overfitting. So introduced how regularizations gives us a means to take a complex model and appropriately tune it to ensure that we have the correct level of complexity on our holdout set. We discussed some basic approaches to regularization including ridge, lasso, and elastic net. Again, each of these just tuning that hyperparameter of alpha or lambda, and then also to keep in mind that moving forward we'll use the same technique when we work with much more complex models to ensure that we are also able to reduce complexity using regularization. And then finally, we introduced recursive feature elimination in order to take any model no matter how complex and easily eliminating less important features. That closes out our section here. After this we will move into a notebook to go over everything we just learned and I look forward to seeing you there.