So with ridge regression we're now taking the cost function that we just saw and adding on a penalty that is a function of our coefficients. Namely is going to be the residual sum of squares, which is our original error, plus that lambda value that we choose ourselves, multiplied by the weights that we find squared. Therefore the higher these weights are the coefficients are the more we add on to our cost function. So now rather than just trying to reduce the error between outcome and prediction, we're now trying to reduce error while also ensuring that our model is not too complex, right? Every time that we increase our coefficients, we are also increasing our cost function which you are ultimately trying to minimize. Now we want to keep in mind with our original linear regression cost function, scaling would not have a large effect on our eventual outcome. But now that we have added on this coefficient weights to our cost function, scale will be of utmost importance. So why is this the case? This is something that we discussed before. Imagine we're working with two variables to predict sales for a certain product. One of our variables is going to be the number of stores and the other x variable is going to be the price of that. Item if the price is between $8 and $10 and the number of scores is between 10,000 and 20,000 stores, the coefficient for price of $1 change if that's going to be our unit will have a much larger coefficient and regards to its change in sales compared to a one unit change in stores. Therefore if we end up with a higher price coefficient we'll end up being highly penalized for that coefficient, right? We're going to penalize large coefficients using our new ridge regression. Therefore we need to make sure that we first make sure that all of our different features are on the same scale. And we can do that using the standardization technique that we see here just subtracting the mean and dividing by the standard deviation. So some things to keep in mind overall when you're working with the ridge regression. First of all, the complexity penalty lambda is going to be applied proportionally to the square of the coefficient values. So recall that as we increase or decrease the lambda what we are doing is increasing or decreasing the effect of the square of each one of the coefficient values. The penalty term has the effect of therefore shrinking our coefficients towards 0. Not exactly 0 like lasso, but the higher the coefficient is the more the penalty is so therefore we want to be reducing the size of those coefficients. This is going to impose bias on the model but also reduce variance. Recall this trade-off the idea of regularization is inherently to reduce variance to reduce the complexity of the model. Now we can select the best regularization strength lambda by doing cross-validation. As we saw in our last notebook, we can increase and decrease lambda and see how well it performed on hold out sets. And then finally as we just mentioned in our last slide, it's going to be best practice to scale our features for example, using the standard scalar and this will ensure that our penalties aren't impacted by the variable scale. So coming back to our new cost function that we're using for ridge regression, we have taken the original linear regression cost function at an honored on a penalty reduce complexity of our model. The penalty shrinks the magnitude of all of our coefficients as we discussed by adding on this extra weight to our cost function. And then we want to note that the way that it's actually going to work under the hood is that larger coefficients are even more strongly penalize because of this squaring. Note that this means that a coefficient of 2 will be penalized for times as much as a coefficient is 1 right because 1 squared is 1, 2 squared is 4. Similarly if the coefficient was 3 would penalized 9 times as much as a coefficient of 1. So the idea to keep in mind is that this will penalize larger weights even proportionally to the lower weight coefficients. Now let's take this concept of lambda and that affecting the coefficients in relation to those polynomial graphs that we saw earlier. So again, the idea here is trying to find that true function that blue line given our sample data. So let's say that we are starting with a model that's just polynomial degree 15. And rather than testing different polynomials polynomials of degree 1, 4 and 15 we can introduce slender term. So starting with polynomial of degree 15. We can say that lambda equals a 0.1 which in this case will be referencing a very high lambda. Now that's going to be relative according to the coefficients that you have. So you will have to play around with the lambda when you're looking at different lambda values in order to see which one actually fits given the data and the coefficients that you're working with. But here we're going to sue lambda equals 0.1, is very high and therefore we end up with a high bias we're not able to fit well to the model because we have put too much weight on the coefficients on our regularization term. Here in that middle value, we have regularized our weights so that it won't be as complex as the full polynomial 15 model, but it won't be as simple as the one that had lambda the equal to 0.1 in regards to that regularization term, right? So it's going to be something closer to the middle. And then here if you're working with polynomial of degree equal to 15 our lambda is equal to 0 and that's essentially just trying to pick the polynomial 15 model without any regularization term. So these are going to be the regularization cost for each one of the different values that we saw, the idea being that how much will that affect the coefficients that we choose? So if we looked all the way to the left, we'll see that that's going to be the lowest values for our coefficients across the board, right? That's going to be higher lambda value means that we want lower coefficients in the middle. It'll be somewhere in the middle and all the way out to the right with lambda equal to 0 we put no penalty for each one of our coefficient terms. And therefore we will have the highest coefficient values between these three graphs that we see. Now what happens in regards to the relationship between the lambda values and those standardized coefficients. As we discussed generally moving to the right as lambda increases the standardized coefficients should decrease so there should be that inverse relationship. We do see this ratings increase as the other features that were working with here are decreasing and that would just be something due to multicollinearity. And you'll see at a certain point then they all start to decrease and they're all decreasing monotonically towards 0 once it reaches a certain threshold. So ideally as you increase alpha of your lambda, alpha is going to be the value that you choose when you're actually working with an sklearn, but here we call it lambda. As we increase lambda we will decrease each one of the coefficients. So here we see this complexity trade-off and it's possible that variance reduction may actually outpace the increase in bias. So you can find a better fit model without having to increase bias too much. So what do we mean by this? What we mean is we can reduce the complexity while still consistently having enough information to show that relationship between x and y in our training set. So we're not increasing bias too much, the idea here being that there may not necessarily be a linear trade-off. Rather for the example that we see here, we're able to reduce complexity for some time while barely affecting that bias. And this may happen if we're starting with an extremely over fit model and we can keep reducing variance without increasing that bias. In order to find eventually that optimal value that's going to allow us to have the lowest mean squared error on our holdout set. So that close out this video on ridge regression, in the next video we will introduce reduce another means of regularization for linear regression called lasso regression. I'll see you there.