Factorization machines are a relatively new and powerful

tool for modeling high-dimensional and sparse

data, with the goal being to predict the missing

entries in the matrix.

Those predictions can then be used

to recommend items to users.

Factorization machines are predictive models

that use the idea of matrix factorization

to represent interactions between input variables.

These interaction terms are added

to our usual linear regression formula

to generate a model that can account

for more complicated relationships between the input

variables.

The key idea that distinguishes factorization machines

from generic nonlinear regression

is that these interaction parameters

are estimated from a weight matrix that

is factorized into components corresponding to each input.

A degree 2 factorization machine will have pairwise interaction

terms, while higher degree factorization machines can

have higher order interactions.

In each case the weight matrices corresponding

to the interactions are factorized

so that individual weights for each input

are learned for the interactions.

This allows effective training of the interaction terms

even when the data is sparse, and it would otherwise

be hard to effectively learn the weights for interaction terms.

Factorization machines take the ideas

behind matrix factorization and extend

them to create a robust predictive model

for sparse data that can be used for classification

or regression on any dataset.

Factorization machines model the ratings

by summing the average rating over all users and items,

the average ratings given by a user,

the average ratings given by an item,

and a pairwise interaction term that

accounts for the relationship between a user and an item.

The affinity between users and items

is modeled as the inner product between two

vectors of features: one for the user and another for the item.

These features are collectively known as factors.

The basic idea is to discover two matrices

that, when you multiply them, you

return, or at least approximate, the original matrix.

The ratings matrix is therefore modeled

by multiplying the user factors matrix by the item factors

matrix.

These factors have the dimension of the number

of users by k for user factors and k

by the number of items for the items factors,

where k is chosen by the user.

The larger the k, the more robust

the factorization machine.

The smaller the k, the quicker it is to create the system.

The ratings matrix is then re-created by multiplying

the factors together.

The prediction of a rating of item ij by user ui

is calculated by the dot product of the two vectors

corresponding to ui and ij.

The question then becomes how to estimate the factors P and Q.

One method to obtain these factors

is to first initialize their values.

Then calculate the squared difference

between the predictions and the observed ratings,

and minimize the error iteratively using

stochastic gradient descent.

To avoid overfitting, an L2 regularization penalty

can be added.

This effectively controls the magnitude of the user and item

factors such that P and Q give a good approximation

to the ratings matrix without using large values.