Imagine that you are writing a machine learning model that looks at a car and tells you whether or not that car is a taxi. You and I know that white cars in Rome and yellow cars in New York tend to be taxis. But we want our machine learning model to learn this from a dataset consisting of car registrations. So, assume that your input data looks like this: red, Rome; white, Rome; et cetera and the labels are whether or not it's a taxi. So essentially, the car color and the city are your two input features, and you need to use these features in your linear model to predict whether or not the car is a taxi. How would you do it? You take the first input, the car color, and you one-hat encode it. You take the second input, the city name, and you one-hot encode it. You take these and you send them straight to your linear model. Now, let's say you give a weight of 0.8 to yellow cars because 80 percent of the yellow cars in your training dataset are taxis. So, W3 now, is 0.8. Of course, you won't give it a weight of 0.8. This weight will be learned by gradient descent, but that's what gradient descent is going to do. Unfortunately, this weight of 0.8 is true for yellow cars in all cities, not just New York. How would you fix it? Would you give a high weight to New York? That doesn't work. Now, all cars in New York get this high weight. Do you see the problem? Add in a feature cross, and what happens? We now have an input node corresponding to red cars in New York, and other to yellow cars in New York, and a third to white cars in New York, and a fourth for green cars in New York, and similarly for cars in Rome. And now, the model can learn quite quickly that yellow cars in New York and white cars in Rome tend to be taxis and give those two nodes a high weight. Everything else, zero weight. Problem solved. So, this is why feature crosses are so powerful. Feature crosses bring a lot of power to linear models. Using feature crosses plus massive data is a very efficient strategy for learning highly complex spaces. Neural networks provide another way to learn highly complex spaces. But feature crosses let linear models stay in the game. Without feature crosses, the expressivity of linear models would be quite limited. With feature crosses, once you have a massive dataset, a linear model can learn the nooks and crannies of your input space. So, feature crosses allow a linear model to memorize large datasets. The idea is, you can assign a weight to each feature cross, and this way the model learns about combinations of features. So, even though it's a linear model, the actual underlying relationship between inputs and outputs is non-linear. Why are we so concerned about making linear models work well? Think back to the previous course. We talked about convex problems and non-convex problems. Neural networks with many layers are non-convex. But optimizing linear models is a convex problem, and convex problems are much, much, much easier than non-convex problems. So, for a long time, sparse linear models were the only algorithm that we or anyone had that could scale to billions of training examples and billions of input features. The predecessors to TensorFlow at Google, SETI, SmartAss Siebel. They were all truly massive scale learners. Now, this has changed in the last few years and neural networks now can also handle massive scale data, often with the assistance of GPUs and TPUs but sparse linear models are still a fast, low cost option. Using sparse linear models as a pre-processor for your features will often mean that your neural network converges much faster.