Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data, with the goal being to predict the missing entries in the matrix. Those predictions can then be used to recommend items to users. Factorization machines are predictive models that use the idea of matrix factorization to represent interactions between input variables. These interaction terms are added to our usual linear regression formula to generate a model that can account for more complicated relationships between the input variables. The key idea that distinguishes factorization machines from generic nonlinear regression is that these interaction parameters are estimated from a weight matrix that is factorized into components corresponding to each input. A degree 2 factorization machine will have pairwise interaction terms, while higher degree factorization machines can have higher order interactions. In each case the weight matrices corresponding to the interactions are factorized so that individual weights for each input are learned for the interactions. This allows effective training of the interaction terms even when the data is sparse, and it would otherwise be hard to effectively learn the weights for interaction terms. Factorization machines take the ideas behind matrix factorization and extend them to create a robust predictive model for sparse data that can be used for classification or regression on any dataset. Factorization machines model the ratings by summing the average rating over all users and items, the average ratings given by a user, the average ratings given by an item, and a pairwise interaction term that accounts for the relationship between a user and an item. The affinity between users and items is modeled as the inner product between two vectors of features: one for the user and another for the item. These features are collectively known as factors. The basic idea is to discover two matrices that, when you multiply them, you return, or at least approximate, the original matrix. The ratings matrix is therefore modeled by multiplying the user factors matrix by the item factors matrix. These factors have the dimension of the number of users by k for user factors and k by the number of items for the items factors, where k is chosen by the user. The larger the k, the more robust the factorization machine. The smaller the k, the quicker it is to create the system. The ratings matrix is then re-created by multiplying the factors together. The prediction of a rating of item ij by user ui is calculated by the dot product of the two vectors corresponding to ui and ij. The question then becomes how to estimate the factors P and Q. One method to obtain these factors is to first initialize their values. Then calculate the squared difference between the predictions and the observed ratings, and minimize the error iteratively using stochastic gradient descent. To avoid overfitting, an L2 regularization penalty can be added. This effectively controls the magnitude of the user and item factors such that P and Q give a good approximation to the ratings matrix without using large values.