What if I discretize the x1 axis by drawing not just one white line but lots of these black lines? And we do the same thing for the x2 axis by drawing a whole bunch of black lines. Now, we have discretized the x1 axis and the x2 axis. When we drew two white lines, we ended up with four quadrants. So what about now? If I have m vertical lines and n horizontal lines, we will end up with m plus one times n plus one grid cells, right? Now, let's consider what this looks like when we discretize x1 and x2 and then multiply. Now, remember this diagram that we had when we divided the input space into quadrants. Essentially, we get to make a different prediction for each of the quadrants. So what about this green box? What is going to be your prediction for that box? Yellow, right? How about now? Blue, but there's a hint of yellow, too. Let's count the number of blue points and the number of yellow points and call it 85 percent blue. You see now how the probabilities are coming in. What about now? Anyway, let's see why this works well as a linear model. When you one hot and cold the first set of values, and then you one hot and cold the second set of values, and then you feature cross them, you're essentially left with one node that fires for points that fall into that bucket. So think about it, the x3 will be one only if x1 equals one and x2 equals one. So for any point in the input space, only one bucket fires. Now, if you take these feature crossed values and feed them into a linear regression, what does the wait w3 have to be? Yup, the ratio of blue dots to yellow dots in the grid cell corresponding to x1 and x2. So that's why a feature cross is so powerful. You essentially discretize the input space and memorize the training data set. But can you see how this could be problematic? What if you don't have enough data? What's a model going to learn here? It's going to learn that the prediction has to be blue, is that true? Well, there are ways around this. You don't have to discretize the input space equally. Instead, you can use different sized boxes, and use box sizes that are tied to the entropy or the information content in the box. You can also group or cluster boxes together. So there are ways around this. Still, you should realize that feature crosses are about memorization, and memorization is the opposite of generalization which is what machine learning aims to do. So, should you do this? In a real world machine learning system, there is place for both. Memorization works when you have so much data that for any single grid cell in your input space, the distribution of data is statistically significant. When that's the case, you can memorize. You're essentially just learning the mean for every grid cell. Of course, deep learning also needs lots of data to play in this space. Whether you want to feature cross or you want to use many layers, you need lots of data. Incidentally, if you're familiar with traditional machine learning, you may not have heard much about feature crosses. The fact that feature crosses memorize and only work on larger data sets is one reason that you may not have heard much about it. But you will find feature crosses extremely useful in real-world data sets. The larger your data, the smaller you can make your boxes, and the more finely you can memorize. So, feature crosses or a powerful pre-processing technique on large data sets.