So, here we are in the Tenserflow playground. We have some dataset that looks like this. We have the blue dots on the upper right corner, the orange dots on the lower left corner and we are trying to basically draw a separation line that separates these two things. And in order to do that, as inputs we have X1, X2, X1 squared, X2 squared, and X1 times X2. First of all, which of these are raw inputs, and which of these are created features? Well, X1 and X2 are the raw inputs. X1 squared, X2 squared, and X1, X2 are features that we created from the raw inputs X1 and X2. Which of these are feature crosses? X1X2 is obviously a feature cross, but if you squint it at a little bit you can realize that X1 squared is also a feature cross. It's a self cross. It's a self join, if you will. You're taking X1 and X1 and crossing them together to get X1 squared. So, one way to think about it is that we have two raw inputs X1 and X2, and we have three feature crosses X1 squared, X2 squared, and X1X2. But now, it's just terminology. You can call X1 squared and X2 some transformation of the input rather than a feature cross. No problem. So, we have five inputs to our model, and we want to train it. So, let's go ahead and do that. I'll go ahead and pick the play button and we start training it and notice something strange that's happening. Right down here, at the lower left corner, you see that blue that happened? It went away after a while but imagine that we didn't have that option. So, let's try this again. We don't know how long we are going to be training. Let's say we trained up to this point, we train for 230 epochs. That's a long time. We trained on 230 epochs and we come up with something strange. What? This thing here. That triangle is an indicator of overfitting. There is really no data there. So, it is a plausible explanation and the model, we're not trying to make it any simpler than it needs to be. So, it goes ahead and puts stuff in there. Now, one of the reasons that this happens is because we are allowing the model to overfit. And one way that we can allow our model overfit, is to give it the same data in multiple ways. What happens if I turn off X1X2. So, at this point you only have X1, X2, X1 squared, and X2 squared. I'll restart this and at this point, again notice that there is this crazy boundary that happens in the early stage of training. Lets do this again. We will stop this and will stop at around 200 epochs. So, there we go. At 200 epochs, and again you see that the boundary isn't great, there is this white stuff in here with craziness. Again because we have those extra features, X1 and X2. What happens if we take out X1 and X2? So, we now only have the raw data X1 and X2 alone. So, I will basically do this and I'll start it and again I'll stop at around 200 epochs. And you notice that now it is pretty perfect. I just have this line and that is something that you want to be aware of, that you can have too much of a good thing that feature crosses are a temptation for the model to overfit. But we also didn't notice something, that if you train for a very long time, let's just take these off this is what he started with, if we train for a very long time, this tends to get better but still the fact that it's because it's an overfitting happens is why you get this curved boundary, that's another symptom of things being overfit. So, if we train for a very long time, this thing goes away, this artifact in the lower left corner goes away, but we still have this curved boundary and the reason you can have a curved boundary rather than a straight line that we know is the simplest effective model, is because we gave the model lots of degrees of freedom. Now to be frank, if you look at this, the weights of X1 and X2 are much higher than the weights of any of these three things. But, X1 times X2 that feature cross, does get the weight and because it does get a weight, it can mess things up. Surprisingly, the models decision boundary looks kind of crazy. In particular, there is this region in the bottom left that's hinting towards blue even though there is no visible support for that in the data. Tensorflow playground uses random starting point, so your result might be different. This is why I put up what I got as a picture. You might have gotten something slightly different. Notice a relative thickness of the five lines running from input to output. These lines show the relative weights of the five features. The lines emanating from X1 and X2 are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal uncrossed features, but they're contributing enough to mess with a generalisation. What if we remove the feature crosses completely? In other words, use only the raw data. Removing all the feature crosses, gives you a more sensible model. There is no longer a curved boundary suggestive of overfitting. After 1,000 iterations, test loss should be a slightly lower value than when the feature crosses are used. Although your results may vary a bit depending on the dataset. The data in this exercise is basically linear data plus noise. If we use a model that is too complicated for such simple data, if we use a model with too many feature crosses, we give it the opportunity to fit to the noise in the training data. You can often diagnose this by looking at how the model performs on independent tests data. Incidentally, and we'll talk about regularisation later in the course on art and science of ML, incidentally this explains why L1 regularization can be such a good thing. What L1 regularization does, is that it zeroes out the weight of a feature if necessary. In other words, the impact of L1 regularisation is to remove features.