In this demonstration, we build a factorization machine to predict ratings given by movie viewers to movies. The original data set contained 10,000 users and 10,000 items, which is equivalent to a matrix with 100,000,000 values. However, the original matrix was sparse, with 90% missing values. The data set movie_ratings contains only the 10,000,000 nonmissing entries. Instead of a 10,000-by 10,000-matrix, the data is organized as a 10,000,000-by-3 table. The 10 million rows correspond to all combinations of users and items. The first column is the unique user ID, followed by the unique movie ID, and then the rating given by the user to the item ranging from 1 to 5, with 5 being the best. Both users and movies are identified with an anonymous ID. As always, let’s first load the os, sys, and SWAT packages, and let’s also load the numpy, pandas, and matplotlib packages to analyze data locally. We’ll also add the %matplotlibinline statement to print plots within the notebook and the SWAT options statement to print messages from CAS in the notebook. Next, connect to CAS with all the same arguments as previous demonstrations. Next, use the read sas wrapper to load the movie_ratings data set into memory. I’ll use the casOut argument to change the name on the CAS server to ratings. Now that the data is in-memory, I’ll use SWAT functionality to view it. I’ll print the dimension of the table and the first few rows to make sure it’s been read in correctly, and I’ll also find the overall mean rating in this data. As expected, the table is 10,000,000 by 3, the user_id and movie_id variables are anonymous, and the average overall rating is 2.51. I’ll save this value as mean_rating for later use. Next, I’ll load the simple action set and then use the distinct action to find the number of distinct values for all three variables in the data. As expected, there are 10,000 users, 10,000 movies, and five possible ratings. I’ll also use the freq action from the simple action set to find the frequency of each rating in the data and save the table as tbl. Remember, this is an instance where the table is copied to the client, so I can use native open source functionality to add the proportion of ratings for each level. The largest proportion of ratings is for levels 2 and 3, and surprisingly, there are only about 50,000 level 5 ratings out of the 10,000,000 observations. Since tbl is on the client, I can plot the table as well to visually see the proportion by rating level. Levels 2 and 3 have similar proportions, as well as levels 1 and 4. Next, I’ll load the fedsql action set to explore the biases within the data. Biases occur because users innately rate on different scales. For example, a five-star rating does not necessarily mean the same thing for two different users, or a person might rate a movie poorly simply because he or she routinely doesn’t like the genre. Or perhaps some people tend to simply give low scores because they have high standards. Factorization machines account for these biases in their predictions. They account for a global bias, the rating versus the overall average rating, a user bias, the average rating given by the user, an item bias, the average rating given to the movie, and a pairwise interaction bias, which accounts for the bias between a user and a movie. Let’s first query the data to find the user bias. I’ll select the user_id variable, count the number of ratings for each user and name it frequency, and find the average rating for each user. I’ll group the results by user and order them in ascending order. We can use open source syntax here to add the user_bias variable to this table from the query. The user bias is simply the average rating minus the mean rating that we found before, and we can use the head function to view this new table. For example, the first user rated 1015 movies and gave those movies an average rating of 1.53, which equates to a bias that is almost 1 whole unit below the overall average rating. We calculate the movie bias similarly by replacing user_id with movie_id. The first movie here was rated 992 times and had an average rating of 1.58, which again is almost a whole unit below the average. Next, let’s partition the data before we build the factorization machine. I’ll load the sampling action set and use the srs action to do so. I’ll use 90% of the data for training, which equates to 9,000,000 observations, and pass the partition indicator into the ratings table. Now I’ll load the factmac action set to build the model. I’ll specify the table name and use the partition indicator to pass in only the training data as usual. The target here is the rating variable, and the inputs and nominals are the same, both user_id and movie_id. For the purposes of speed for this demonstration, I’ll choose to use only 10 factors, optimize the factors over only 5 iterations, and use a learning rate of 0.1 for the optimization. With more time available, you’ll want to try more factors and optimization iterations to improve the model. I’ll set a seed and use the saveState argument to save the model for scoring under the name factmac_model, and the output argument enables us to save the scored training data. Submit the cell to build the factorization machine. The output displays the model specification, the number of observations, the mean and standard deviation of the target, the number of levels for the nominal inputs, the optimization history (where you can see how the loss is reduced each iteration), the final loss, and the new output CAS table containing the scored training data. Using the fetch action, we can see the contents of the new CAS table. It includes the three original variables, the partition indicator, and the predicted rating, which appears to be reasonably close to the actual rating, at least for these five observations. Because we saved the model as an analytic store, we need to load the aStore action set to score the validation data. To do so, we’ll use the score action, specify only the validation data in the table argument, restore the factmac_model table, copy all the variables into the output table for comparison purposes, and save the scored validation as factmac_scored. When I run the score action, the output just shows timing information. To evaluate the predictions, I’ll first use a DATA step to add the error as a variable to the factmac_scored table and then SQL to compute the mean square error. Let’s first load the dataStep action set and then use the runCode action to run DATA step functionality on the CAS server. I’ll use the same name, factmac_scored, in both the DATA and SET statements to add the error variable to the table, which is simply the actual rating minus the predicted rating. When I run this, you’ll see that the table has increased in size by the one column I added. Next, I’ll query the data set and compute the mean squared error by averaging the squared error terms, and I’ll also compute the root mean squared error. So for this model, the average squared error is about 0.17 and the RMSE is about 0.41. Again, the resulting error here could be reduced using more factors and optimization iterations. As an alternative to using a DATA step and then SQL to find the MSE, we could instead use SWAT functionality to mimic open source data frame manipulation. First, we’ll create a reference to the scored validation data, called factmac_scored again. Then, we can add the error and the squared error to the table as if we are adding a new variable to the data frame. Using the head function, you can see that the last three variables have been added to this reference table. To find the MSE, I’ll just take the mean of the squared error variable, and take the root of the MSE to find the RMSE. Of course, the values are the same as before, we just used a different method to find this information. Next, let’s find the average prediction value for each level of actual ratings to see which levels are the easiest to predict. Using SQL, we’ll select the rating, find the frequency of each level of rating in the validation data, find the average prediction in each level, and group the results by the rating level. Finally, I’ll sort the table by the rating level from 1 to 5. You can see that the average prediction is closest to the actual ratings for levels 2 and 3, as they are incorrect, on average, by only about 0.1. On average, the model is off by about 0.3 for levels 1 and 4. And on average, the model is actually closest to a prediction value of 4 for the actual rating level of 5. Next, I’m going to round the predictions to their nearest whole rating using a DATA step. I’ll create the variable P_rating_round, and if the value is rounded down to a value of zero, I’ll change the value to 1. I’ll run this DATA step and print the first five observations to make sure that the variable was correctly added to the table. Here the new variable was rounded up for the first few and then down for the third observation as expected. Next, I’ll use the rounded predictions to create a crosstabulation between the predictions and actual ratings. I’ll use the crossTab action from the simple action set we loaded earlier. The row variable will be rating, and the column variable will be the rounded ratings. In this case, the crosstab table is copied to the client, and I can use open source data frame functionality on it. I’ll set the column and row names appropriately based on the data, and I’ll print the matrix. So here is the frequency of predictions for each rating level. Ideally, we would see large values on the diagonal to indicate better performance. However, one thing to note from this table is that the predicted rating is only a maximum of one whole unit from the actual rating. That is, for example, we never predict a value of 4 when the actual value is 2. The proportion of correct predictions is simply the sum of the diagonal divided by the total number of predictions, which is approximately 0.75. To print the conditional probabilities of predicted ratings given the actual ratings, I’ll divide the row values by their row sums. So, for example, the probability of predicting a rating of 1 given that the rating is actually 1 is about 0.72. The first four diagonal entries are near a value of three quarters. However, the model did a poor job of correctly predicting a rating of 5, possibly in part to the fact that there were very few 5 ratings in the training data. Now that we’ve built the factorization machine and created predictions, we can actually recommend movies to users. To do this, I created a function called useri_top5, which as its name implies, finds the top five recommended items for a given user. The function simply wraps up SQL functionality in order to query the data. It searches the table for a given input user, orders the information in descending order of the rating, and then returns the top five items. Therefore, I can specify any user ID as input to the function (for example, user ID 1), and it returns the following movie IDs to recommend. Finally, end the CAS session with the endSession action.