Recommender baselines extended tutorial

This tutorial provides a detailed description of a few recommender models which are widely used in both research and industry. All of them are proved to be strong and reliable baselines for many use cases. They are also often chosen as candidate generators followed by neural networks and gradient boosting decision trees for re-ranking of retrieved items.

With RecTools you can use any of them with a few lines of code in the same interface.

Table of Contents

Prepare data
Matrix Factorization
- iALS[Hybrid]
  - Model description | Recommendations | RecTools implementation | Model Application | Links
- BPR MF
  - Model description | Recommendations | RecTools implementation | Model Application | Links
- LightFM[Hybrid] [Hot & Warm & Cold inference] [Logistic / BPR / WARP loss]
  - Model description | Recommendations | RecTools implementation | Model Application | Links
- PureSVD
  - Model description | Recommendations | RecTools implementation | Model Application | Links
Nearest Neighbours
- ItemKNN
  - Model description | Recommendations | RecTools implementation | Model Application | Links
Linear Autoencoders
- EASE
  - Model description | Recommendations | Model Application | Links

[2]:

import numpy as np
import pandas as pd
import warnings
from pathlib import Path
import os
import threadpoolctl

warnings.filterwarnings('ignore')

from implicit.als import AlternatingLeastSquares
from implicit.bpr import BayesianPersonalizedRanking
from implicit.nearest_neighbours import CosineRecommender

# lightfm extension is required for the LighFM section. You can install it with `pip install rectools[lightfm]`
try:
    from lightfm import LightFM
except ModuleNotFoundError:
    pass

from rectools import Columns
from rectools.dataset import Dataset
from rectools.models import (
    ImplicitALSWrapperModel,
    ImplicitBPRWrapperModel,
    LightFMWrapperModel,
    PureSVDModel,
    ImplicitItemKNNWrapperModel,
    EASEModel
)

# For vector models optimized ranking
os.environ["OPENBLAS_NUM_THREADS"] = "1"
threadpoolctl.threadpool_limits(1, "blas");

Prepare data

We are using KION dataset for this tutorial. The data was gathered from the users of MTS KION video streaming platform and includes user-item interactions together with user and item features

[ ]:

%%time
!wget -q https://github.com/irsafilo/KION_DATASET/raw/f69775be31fa5779907cf0a92ddedb70037fb5ae/data_en.zip -O data_en.zip
!unzip -o data_en.zip
!rm data_en.zip

[10]:

# Read data
DATA_PATH = Path("data_en")
users = pd.read_csv(DATA_PATH / 'users_en.csv', index_col=0)
print(users.shape)
users.head(2)

(840197, 5)

[10]:

	user_id	age	income	sex	kids_flg
0	973171	age_25_34	income_60_90	M	1
1	962099	age_18_24	income_20_40	M	0

[11]:

items = pd.read_csv(DATA_PATH / 'items_en.csv', index_col=0)
print(items.shape)
items.head(2)

(15963, 16)

[11]:

	item_id	content_type	title	title_orig	release_year	genres	countries	for_kids	age_rating	studios	description	keywords	actors_translated	actors_transliterated	directors_translated	transliterated
0	10711	film	Talk to her	Hable con ella	2002.0	drama, foreign, detective, melodrama	Spain	NaN	16.0	NaN	Marco, a journalist, interviews the famous Tor...	Talk, her, 2002, Spain, friends, love, strong,...	Adolfo Fernández, Ana Fernández, Dario Grandin...	Adol'fo Fernandes, Ana Fernandes, Dario Grandi...	Pedro Almodovar	Pedro Al'modovar
1	2508	film	Naked Peppers	Search Party	2014.0	foreign, adventure, comedy	USA	NaN	16.0	NaN	The main character has learned not to invite h...	Naked, Peppers, 2014, USA, friends, weddings, ...	Adam Palley, Brian Huskey, JB Smoove, Jason Ma...	Adam Palli, Brajan Haski, Dzh.B. Smuv, Dzhejso...	Scott Armstrong	Skot Armstrong

[12]:

interactions = (
    pd.read_csv(DATA_PATH / 'interactions.csv', parse_dates=["last_watch_dt"])
    .rename(columns={"last_watch_dt": Columns.Datetime})
)
print(interactions.shape)
interactions.head(2)

(5476251, 5)

[12]:

	user_id	item_id	datetime	total_dur	watched_pct
0	176549	9506	2021-05-11	4250	72.0
1	699317	1659	2021-05-29	8317	100.0

[13]:

# Process interactions
interactions[Columns.Weight] = np.where(interactions['watched_pct'] > 10, 3, 1)
interactions = interactions[["user_id", "item_id", "datetime", "weight"]]
print(interactions.shape)
interactions.head(2)

(5476251, 4)

[13]:

	user_id	item_id	datetime	weight
0	176549	9506	2021-05-11	3
1	699317	1659	2021-05-29	3

[14]:

# Process user features to the form of a flatten dataframe
users = users[["user_id", "age", "sex"]]
users.fillna('Unknown', inplace=True)
user_features_frames = []
for feature in ["sex", "age"]:
    feature_frame = users.reindex(columns=[Columns.User, feature])
    feature_frame.columns = ["id", "value"]
    feature_frame["feature"] = feature
    user_features_frames.append(feature_frame)
user_features = pd.concat(user_features_frames)
print(user_features.shape)
user_features.head(2)

(1680394, 3)

[14]:

	id	value	feature
0	973171	M	sex
1	962099	M	sex

[15]:

# Process item features to the form of a flatten dataframe
items = items.loc[items[Columns.Item].isin(interactions[Columns.Item])].copy()
items["genre"] = items["genres"].str.lower().str.replace(", ", ",", regex=False).str.split(",")
genre_feature = items[["item_id", "genre"]].explode("genre")
genre_feature.columns = ["id", "value"]
genre_feature["feature"] = "genre"
item_features = genre_feature
print(item_features.shape)
item_features.head(2)

(39942, 3)

[15]:

	id	value	feature
0	10711	drama	genre
0	10711	foreign	genre

[16]:

# Prepare hot test users
test_hot_users = [176549]  # have interactions and features
print(interactions[interactions["user_id"] == test_hot_users[0]].shape)
interactions[interactions["user_id"] == test_hot_users[0]].head(2)

(82, 4)

[16]:

	user_id	item_id	datetime	weight
0	176549	9506	2021-05-11	3
3815	176549	15469	2021-05-25	3

[17]:

print(user_features[user_features["id"] == test_hot_users[0]].shape)
user_features[user_features["id"] == test_hot_users[0]]

(2, 3)

[17]:

	id	value	feature
48323	176549	M	sex
48323	176549	age_35_44	age

[18]:

# Prepare warm test users
test_warm_users = [1097541]  # have features but don't have interactions
interactions[interactions["user_id"] == test_warm_users[0]].shape

[18]:

(0, 4)

[19]:

print(user_features[user_features["id"] == test_warm_users[0]].shape)
user_features[user_features["id"] == test_warm_users[0]]

(2, 3)

[19]:

	id	value	feature
290301	1097541	F	sex
290301	1097541	age_35_44	age

[20]:

# Prepare cold test users
test_cold_users = [99999999]  # don't have features or interactions
interactions[interactions["user_id"] == test_cold_users[0]].shape

[20]:

(0, 4)

[21]:

user_features[user_features["id"] == test_cold_users[0]].shape

[21]:

(0, 3)

[22]:

# Create dataset with both user and item features
dataset = Dataset.construct(
    interactions_df=interactions,
    user_features_df=user_features,
    cat_user_features=["sex", "age"],
    item_features_df=item_features,
    cat_item_features=["genre"],
)

RANDOM_STATE=60

Matrix Factorization

The basic idea behind Matrix Factorization is to discover user and item embeddings in a common latent space so that the product of two matrices would reconstruct the original user-item interactions matrix as close as possible. The goal is to predict the missing values in the original matrix and use revealed scores for recommendations.

Implicit ALS

Model description

Goal of the model is to present interactions matrix as a product of user(X) and item(Y) embeddings. Implicit ALS model treats all non-zero entries in the matrix as value 1. The actual weight of the interactions is treated as confidence in the observation. Zero entries receive low confidence since this they are treated as missing values and might actually hide items highly relevant to users. Non-zero entries with high confidence will have greater impact on the loss when not predicted correctly. Overall loss functions is the following:

\[min_{x*, y*}{ \sum_{u, i}^{}c_{ui}(p_{ui}-x_{u}^Ty_{i})^2 +\lambda (\sum_{u}\lVert x_{u} \rVert ^2 + \sum_{i}\lVert y_{i} \rVert ^2)}\]

$c_{ui}$ - confidence in observing the item (${c_{ui} = 1 + \alpha r_{ui}}$)

$r_{ui}$ - weight assigned to interaction of user u with item i
$\alpha$ - rate of increase in confidence. Determines to what extent change in weight modifies confidence

\(p_{ui}\) - binary representation, whether user and item had interaction
\(x_{u}\), \(y_{i}\) - vectors we need to find for users and items
\(\lambda\) - regularization term to avoid overfitting

Since both X and Y matrices have to be calculated, an alternating least squares algorithm is used. The procedure is done by repeatadly performing following 2 steps:

Fix X (user matrix) and find optimal Y (item matrix)
Fix Y (item matrix) and find optimal X (usermatrix)

This algorithm simplifies calculations, as when one of the matrices is fixed, the cost function becomes quadratic and has an easily achievable minimum. After the algorithm has converged, values of X and Y are taken as embeddings.

Recommendations

Recommendations for all users are received from multiplication of $X^T$ and $Y$, after that top-k can be extracted from $\hat{p} = X^T Y$. As an example consider recommendation procedure for one user, when item and user embeddings are received:

Calculate predicted preferences $\hat{p}_{u} = x_{u}^Ty$ (first row of $\hat{p}$ in the picture). It contains information about how likely first user is to be interested in all items
Take top K items with the greatest value of $\hat{p}_{ui}$. In the picture if K=1, first item should be recommended, as 8 > 3 and 8 > 7.4

RecTools implementation

RecTools provides a wrapper for implicit library iALS implementation which is the most efficient and widely used: implicit.als.AlternatingLeastSquares model. Additionally, RecTools offers a modification, which allows to add explicit user and item features to the model.

For items it is done in 2 steps (for user explicit features procedure is similar):

Explicit item features are added as additional columns to item embeddings
Same number of columns is added to user embeddings (paired to explicit item features)

If both user and item explicit features are used, each embedding matrix contains three logical parts: latent factors, explicit features and paired features. Explicit features will remain their original values after training. Paired features can either be fit together with latent features or separately, this is a hyper-parameter for the wrapper.

Model application

Specify latent embeddings size with AlternatingLeastSquares factors
Specify regularization
Specify iterations for number of model training epochs
Specify alpha for confidence computation with AlternatingLeastSquares alpha. Interactions weights that were already present in the dataset will be multiplied by this number. Remember from above that $c_{ui} = 1 + \alpha r_{ui}$ and interactions weights are treated as confidence in observations
Pass AlternatingLeastSquares model to the wrapper
Specify a way to fit paired features with ImplicitALSWrapperModel fit_features_together
Use dataset with features for fit and recommend to add feature processing to the model

[16]:

%%time
model = ImplicitALSWrapperModel(
    AlternatingLeastSquares(
        factors=10,  # latent embeddings size
        regularization=0.1,
        iterations=10,
        alpha=50,  # confidence multiplier for non-zero entries in interactions
        random_state=RANDOM_STATE,
    ),
    fit_features_together=False,  # way to fit paired features
)
model.fit(dataset);

CPU times: user 27.3 s, sys: 326 ms, total: 27.6 s
Wall time: 27.6 s

[17]:

recos = model.recommend(
    users=test_hot_users,
    dataset=dataset,
    k=3,
    filter_viewed=True,
)
recos.merge(items[["item_id", "title_orig"]], on="item_id")

[17]:

	user_id	item_id	score	rank	title_orig
0	176549	7793	2.856475	1	Radioflash
1	176549	16361	2.820227	2	Doom: Annihilation
2	176549	14025	2.778898	3	We Die Young

Vectors from trained model include all of the latent, explicit and paired factors. Vector size is factors + (number of explicit user features) + (number of explicit item features)

[18]:

user_vectors, item_vectors = model.get_vectors()
print(user_vectors.shape)
print(item_vectors.shape)

(962179, 176)
(15706, 176)

Links

iALS original paper: Collaborative Filtering for Implicit Feedback Datasets
Optimization for iALS training: Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering
implicit library iALS model documentation
Proving iALS quality: Revisiting the Performance of iALS on Item Recommendation Benchmarks
Proving iALS with features quality: Benchmark quality: RecTools wrapper for iALS with features

Bayesian Personalized Ranking Matrix Factorization (BPR MF)

Model description

iALS model tries to minimize (weighted) reconstruction error during matrix factorization, providing a pointwise loss: it calculates how well each point in interactions matrix is reconstructed by the model.

Bayesian personalized ranking introduces a pairwise loss instead. For each user model takes a pair of items: one positive and one negative where positive item was present in user interactions and negative item wasn’t. The goal of the algorithm is to rank positive item higher then negative one. It is useful for cases when only positive interactions are present in data and when the goal is to maximize ROC AUC.

BPR is derived from a Bayesian formulation of the problem that aims to maximize posterior probability. Resulting likelihood formula in general case is the following:

\[\sum_{(u,i,j) \in D_s} ln \sigma (\hat{p}_{uij}) - \lambda_{\theta} \lVert \theta \rVert ^2\]

\(D_s := \{(u,i,g)| i \in I_{u}^+ \wedge j \in I \backslash I_{u}^+\}\), set containing triplets of user, positive item and negative item
\(\sigma (\hat{p}_{uij})\) - probability that a user really prefers item i to item j
\(\hat{p}_{uij}\) - arbitrary function of the model which captures the relationship between user u, item i and item j
\(\theta\) - parameters of the model to find

For matrix factorization formula can be rewritten taking into account that matrix factorization implies user and item embeddings dot product as an arbitrary scoring function. Here item biases can also be present:

$\hat{p}_{uij} = \hat{p}_{ui} - \hat{p}_{uj} = x_uy_i + b_i - x_uy_j - b_j = x_u(y_i - y_j) + (b_i - b_j)$
$\Theta = (e^U, e^I)$

Thus, the formula for likelihood function for BPR matrix factorization is:

\[\sum_{(u,i,j) \in D_s} ln \sigma ((x_u(y_i - y_j) + b_i - b_j) - \lambda( \lVert e^U \rVert ^2 + \lVert e^I \rVert ^2)\]

The learning method is usually based on stochastic gradient descent.

Recommendations

Recommendations for all users are received from multiplication of learnt embedding matrixes $X^T$ and $Y$ (user and/or item biases can be added to final scores depeding on implementation). After that top-k items can be extracted.

RecTools implementation

RecTools provides a wrapper for the implicit BayesianPersonalizedRanking model. implicit model implementation has a few important features: - Model ignores the weights of interactions and treats all interacitons in the dataset equally - Model training process is not deterministic even when random_state is provided - regularization factor is shared for user and item factors - Model is trained with SGD - Popularity-based negative sampling is applied

Model application

Specify latent embeddings size with BayesianPersonalizedRanking factors
Specify learning_rate
Specify regularization
Specify iterations for number of model training epochs
Pass BayesianPersonalizedRanking model to the wrapper

[23]:

%%time
model = ImplicitBPRWrapperModel(
    BayesianPersonalizedRanking(
        factors=10,  # latent embeddings size
        regularization=0.1,
        iterations=10,
        random_state=RANDOM_STATE,
    ),
)
model.fit(dataset);

CPU times: user 776 ms, sys: 404 ms, total: 1.18 s
Wall time: 1.31 s

[24]:

recos = model.recommend(
    users=test_hot_users,
    dataset=dataset,
    k=3,
    filter_viewed=True,
)
recos.merge(items[["item_id", "title_orig"]], on="item_id")

[24]:

	user_id	item_id	score	rank	title_orig
0	176549	15297	0.085663	1	Klinika schast'ya
1	176549	10440	0.085404	2	Khrustal'nyy
2	176549	13865	0.058471	3	V2. Escape from Hell

[25]:

user_vectors, item_vectors = model.get_vectors()
print(user_vectors.shape)
print(item_vectors.shape)

(962179, 11)
(15706, 11)

[35]:

# constant `1`` was the last value of each user embedding on model initialization
# These values are used for item biases multiplication
# But note that implicit framework changes these values during training
user_vectors[-1]

[35]:

array([-0.02226695,  0.01090536,  0.00503718, -0.02011885, -0.00435659,
        0.04752969, -0.01570643,  0.00631561, -0.03203988,  0.02608355,
        1.0004302 ], dtype=float32)

[36]:

# item bias is the last value of each item embedding
item_vectors[0]

[36]:

array([ 0.00281794, -0.00169547, -0.00304997,  0.00588632,  0.00074773,
       -0.00124499,  0.00424563,  0.00047185,  0.00077127, -0.00010804,
       -0.02039964], dtype=float32)

Links

BPR Loss original paper: BPR: Bayesian Personalized Ranking from Implicit Feedback
Implicit library BPR model documentation
Comparison of BPR implementations: Revisiting BPR: A Replicability Study of a Common Recommender System Baseline

LightFM

Model description

LightFM is a python library with an equally named model that has a few special features to add to a classic Matrix Factorization task. It is widely used in both industry and RecSys challenges. Key features are:

LightFM introduces biases to both users and items in addition to their latent representations.
LightFM is a hybrid matrix factorization model, which aims to include user and item metadata into matrix factorization procedure. To incorporate it LightFm estimates an embedding for each feature and uses a linear combination of feature embeddings to find user/item latent representation.
As a result, LightFM provides a solution for building recommendations in warm and cold start scenarios. It utilises item biases and user features embeddings for this task.
When building a model, LightFM provides a few loss functions to choose from:

Logistic
BPR (Bayesian Personalized Ranking). Building a LightFM model with BPR loss with no features will result in a BPR-MF model
WARP (Weighted Approximate-Rank Pairwise) and k-OS WARP. These are usually the best performing choices for the top-K recommendation task.

To incorporate user metadata, user latent representation is derived:

Concat user identity matrix with user features (matrix A)
Take matrix of user factors with prespecified n_factors (matrix B)
Compute $A\cdot B$. Each row in this matrix is a latent representation for one user

To find item latent representations similar procedure should be performed.

LightFM with logistic loss maximizes likelihood of receiving interactions matrix. Zero entries are considered as negatives, non-zero entries are considered positives.

\[L(e^U, e^I, b^U, b^I) = \prod_{(u,i)\in S^{+}} \hat p_{ui} \times \prod_{(u,i)\in S^{-}}(1 - \hat p_{ui})\]

\(\hat p_{ui} = f(x_{u}\cdot y_{i} + b_{u} + b_{i})\), where f is sigmoid function, as binary prediction is made
\(x_{u} = \sum_{j\in f_{u}} e_{j}^{U}\), \(y_{i} = \sum_{j\in f_{i}} e_{j}^{I}\) - latent representations of user and item with \(e_{j}^{U}\) and \(e_{j}^{I}\) as feature embeddings
\(b_{u} = \sum_{j\in f_{u}} b_{j}^{U}\), \(b_{i} = \sum_{j\in f_{i}} b_{j}^{I}\) - bias term for users and items. \(b_{j}^{U}\) and \(b_{j}^{I}\) are scalar biases
\(S^{+}\) and \(S^{-}\) are observed and not observed interactions, respectively

BPR is a pairwise loss, which maximizes the difference of scores between positive and random negative examples for user. BPR is derived from a Bayesian formulation of the problem that aims to maximize posterior probability. Resulting likelihood formula in general case is the following:

\[\sum_{(u,i,j) \in D_s} ln \sigma (\hat{p}_{uij}) - \lambda_{\theta} \lVert \theta \rVert ^2\]

\(D_s := \{(u,i,g)| i \in I_{u}^+ \wedge j \in I \backslash I_{u}^+\}\), set containing triplets of user, positive item and negative item
\(\sigma (\hat{p}_{uij})\) - probability that a user really prefers item i to item j
\(\hat{p}_{uij}\) - arbitrary function of the model which captures the relationship between user u, item i and item j
\(\theta\) - parameters of the model to find

For LightFM framework formula can be rewritten taking into account that:

$\hat{p}_{uij} = \hat{p}_{ui} - \hat{p}_{uj} = x_uy_i + b_u + b_i - x_uy_j - b_u - b_j = x_u(y_i - y_j) + (b_i - b_j)$
$\Theta = (e^U, e^I, b^U, b^I)$
LightFM has 2 regularization parameters, which is applied both to embeddings and biases: $\lambda_{item}$ - for items and $\lambda_{user}$ - for users.

Thus, formula is:

\[\sum_{(u,i,j) \in D_s} ln \sigma (x_u(y_i - y_j) + (b_i - b_j)) - \lambda_{user}( \lVert e^U \rVert ^2 + \lVert b^U \rVert ^2) - \lambda_{item} (\lVert e^I \rVert ^2 + \lVert b^I \rVert ^2)\]

WARP, unlike BPR, takes a non-random negative example, which is chosen by comparing scoring predictions. In case prediction for negative example is greater than for positive, negative is taken for optimisation. Gradient update is proportional to number of examples sampled before appropriate one is found, as it reflects how many samples are ranked incorrectly. Such an approach results in better ranking, thus, it is useful when precision@k should be maximized and only positive interactions are present.

Model is trained using asynchronous stochastic gradient descent. Weights of the interactions define how much each observation affects the loss.

Recommendations

For hot user that already had interactions during training scores are computed from user and item latent representations and biases: $score = x_{u}\cdot y_{i} + b_u + b_i$
In the warm start scenario the user has no interactions, but user features are known. In this case user representation is taken from user features embeddings and then score is computed: $score = x_{u}\cdot y_{i} + b_i$
In the cold start scenario the user has no interactions and no user features are known. In this case model recommends popular items, basing on biases only $score = b_i$

Top k items with highest scores for each user are taken as recommendations. Sigmoid function can be skipped during recommendations for faster inference

RecTools implementation

RecTools provides a wrapper for the LightFM model and additionally: - offers 10-25 times faster inference - provides recommendations for hot, cold and warm users in the same interface out of the box

Model application

lightfm extension for rectools is required to run this code. You can install it with pip install rectools[lightfm]

Select loss from “logistic”, “warp”, “bpr”, “warp-kos”. “logistic” is default but it usually has the worst performance
Specify embeddings size with LightFM no_components
Specify l2 regularization penalty on features with LightFM item_alpha and user_alpha
Specify other LightFM params
Pass LightFM model to the wrapper
Specify epochs for number of model training epochs
Use dataset with features for fit and recommend to add feature embeddings to the model

[19]:

%%time
model = LightFMWrapperModel(LightFM(no_components=10, loss="bpr", random_state=RANDOM_STATE))
model.fit(dataset);

CPU times: user 6.28 s, sys: 53.8 ms, total: 6.33 s
Wall time: 6.33 s

Recommend for any type of users (hot / warm / cold) with just passing the necessary ids to recommend

[20]:

recos = model.recommend(
    users=test_hot_users + test_warm_users + test_cold_users,
    dataset=dataset,
    k=3,
    filter_viewed=True,
)
recos.merge(items[["item_id", "title_orig"]], on="item_id").sort_values(["user_id", "rank"])

[20]:

	user_id	item_id	score	rank	title_orig
0	176549	15861	-220.293716	1	The Dark Tower
1	176549	14361	-220.354294	2	Legion
2	176549	7797	-220.420486	3	Daikaijû kettô: Gamera tai Barugon
3	1097541	15297	-221.292953	1	Klinika schast'ya
5	1097541	3734	-221.551758	2	Prababushka lyogkogo povedeniya
6	1097541	10440	-221.598358	3	Khrustal'nyy
4	99999999	15297	1.297418	1	Klinika schast'ya
7	99999999	10440	1.273827	2	Khrustal'nyy
8	99999999	2657	0.924446	3	Podslushano

Vectors from model include user biases, item biases and embeddings. Vector size is no_components + 2

[21]:

user_vectors, item_vectors = model.get_vectors(dataset)  # pass dataset to process features
print(user_vectors.shape)  # these include all users with features (hot + warm)
print(item_vectors.shape)

(1058088, 12)
(15706, 12)

[22]:

# user bias is the first value, constant `1`` is the second value (for item biases multiplication)
user_vectors[0]

[22]:

array([-2.23379684e+02,  1.00000000e+00,  5.62264979e-01, -1.28065735e-01,
       -4.61941302e-01, -1.90039784e-01, -8.23755711e-02, -9.92318988e-02,
        2.55903333e-01, -1.77466646e-02, -4.70214367e-01,  3.65182817e-01])

[23]:

# constant `1`` is the first value (for user biases multiplication), item bias is the second values
item_vectors[0]

[23]:

array([ 1.        , -0.2681751 ,  0.3677758 ,  2.59228992, -1.3499459 ,
       -2.1741879 ,  0.18057333, -3.33403778,  0.66558087, -0.87214583,
       -1.09833658, -1.58371723])

Links

LightFM original paper: Metadata Embeddings for User and Item Cold-start Recommendations
LightFM documentation
BPR Loss original paper: BPR: Bayesian Personalized Ranking from Implicit Feedback
LightFM explanation on WARP and K-os WARP loss: Learning-to-rank using the WARP loss
Benchmark inference speed: RecTools wrapper for LightFM

PureSVD

Model description

Singular value decompozition (SVD) proposes the best approximation of Interactions matrix with respect to least squares error. PureSVD model does not include any regularization and it cannot assign weights to observations. Instead the model tries to reconstruct all of the values in the matrix as close as possible including zero-entries. This approach is usually not the best one for implicit feedback datasets but it was a very important step in recommender models evolution.

Model aims to factorize interactions matrix into 3 matrices $U$, $\Sigma$ and $V$, as shown in the picture below. $U$ and $V$ matrices are orthonormal and $\Sigma$ is diagonal with square roots of Interactions eigenvalues on the diagonal.

RecTools implementation

RecTools implements a truncated version of SVD. For that k factors should be extracted, meaning that k greatest singular values are taken from $\Sigma$ and k first columns from U and V.

Latent representations of users and items are the following:

user factors $X$ = $U$
item factors $Y$ = $\Sigma \cdot V^{T}$

RecTools uses scipy implementation of scipy.sparse.linalg.svds function to compute decomposition.

Recommendations

Recommendation for user/item pair is the following: $\hat{r_{ui}} = x_u y_i$
Take k items with greatest $\hat{r_{ui}}$ for fixed user

Model application

Specify embedding size with PureSVDModel factors
If we use dataset with features for fit and recommend, model just skips this info

[24]:

%%time
model = PureSVDModel(factors=10)
model.fit(dataset);

CPU times: user 681 ms, sys: 19.7 ms, total: 700 ms
Wall time: 699 ms

[25]:

recos = model.recommend(
    users=test_hot_users,
    dataset=dataset,
    k=3,
    filter_viewed=True,
)
recos.merge(items[["item_id", "title_orig"]], on="item_id")

[25]:

	user_id	item_id	score	rank	title_orig
0	176549	7571	1.893236	1	100% Wolf
1	176549	12173	1.524853	2	Avengers: Endgame
2	176549	10942	0.996734	3	MARVEL'S THE AVENGERS

[26]:

user_vectors, item_vectors = model.get_vectors()
print(user_vectors.shape)
print(item_vectors.shape)

(962179, 10)
(15706, 10)

Links

PureSVD original paper: Performance of Recommender Algorithms on Top-N Recommendation Tasks

Nearest Neighbours

ItemKNN

Model description

Model bases on the idea that users may like items similar to what they have interacted with previously. To achieve this goal model starts with computing item-to-item similarities from the interactions matrix.

Algorithm used is the following:

Get items vectors as columns in the Interactions matrix
Compute distances between item vectors (e.g. Cosine, TF-IDF, BM25)
Keep only K closest vectors for each item
Form item-item similarity matrix with every item having K filled scores for top similar items
Build recommendations by multiplying user interactions on item-item similarity matrix so that users receive similar items to the ones that they have already interacted with

$IMG_92C8B6EE9930-1.jpeg$

Recommendations

Consider an example of how to make a top-1 recommendation for one user.

Suppose user interacted with items 2 and 4, which have weights 2 and 4, respectively
Calculate similarity matrix for each item, which includes 2 neighbors (for other items set similarity equal to 0). Resulting matrix stores information on item closeness
For recommendation use interaction values as weights and calculate how likely a user is to like an item. For instance, as the user has weight 2 for item 2, we can go through the closest items and add results to the respective item. Closest to item 2 is item 1, thus, add to item 1 the following: $2 \cdot 0.8 = 1.6$, add to item 4: $2 \cdot 0.5 = 1$
After all calculations are performed, sum the values for each item and recommend items with the greatest sum. Also, it is possible to filter out already seen recommendations by setting their value to 0.

RecTools implementation

RecTools provides wrapper for implicit library ItemKNN models

Model application

pass one of the base models: implicit.nearest_neighbours.CosineRecommender, implicit.nearest_neighbours.TFIDFRecommender, implicit.nearest_neighbours.BM25Recommender
specify K for number of neighbours in item-item similarity matrix
specify model params for BM25Recommender
If we use dataset with features for fit and recommend, model just skips this info

[27]:

%%time
model = ImplicitItemKNNWrapperModel(CosineRecommender(K=10))
model.fit(dataset);

CPU times: user 764 ms, sys: 32.2 ms, total: 797 ms
Wall time: 796 ms

[28]:

recos = model.recommend(
    users=test_hot_users,
    dataset=dataset,
    k=3,
    filter_viewed=True,
)
recos.merge(items[["item_id", "title_orig"]], on="item_id")

[28]:

	user_id	item_id	score	rank	title_orig
0	176549	11749	5.105861	1	Incredibles 2
1	176549	16270	4.125425	2	Coco
2	176549	11985	3.047437	3	Toy Story 4

Links

implicit library itemKKN documentation
Comparison of Cosine, TF-IDF and BM25 metrics for the algorithm: Distance Metrics for Fun and Profit

Linear Autoencoders

Autoencoders in general attempt to make output of the model as close to input as possible. In recommendation setting there is a class of models that aim for the same task. Interactions matrix should be approximated as closely as possible based on the data that is present only in this matrix itself. Linear autoencoders like EASE and SLIM were found to be strong recommender baselines by many academy and industry practitioners.

EASE

Model description

The model’s goal is to find a dense item-item similarity matrix which will reconstruct the interactions matrix when being multiplied on it. Unlike neighborhood models, its computation is done by loss minimization.

Matrix we would like to receive is B, it can be received from the following minimization problem:

\[\begin{split}\begin{equation} \begin{cases} min_{B} \lVert X - XB \rVert _{F}^{2} + \lambda \cdot \lVert B \rvert\rvert _{F}^{2} \\ diag(B) = 0 \end{cases}\,. \end{equation}\end{split}\]

X - interactions matrix
B - weight-matrix, which should be found
\(\lVert X - XB \rvert\rvert _{F}\) - Frobenius norm
\(\lambda\) - regularization term to avoid overfitting
Constraint is needed, as otherwise model may return identity B and because of the fact that replicating input model should generalize (not recommend the same item as input)

By analyzing closed-form solution it was found out that B can be calculated from the inverse of Gram matrix = $X \cdot X^{T}$. Thus, simple algorithm is used to find B.

Algorithm:

Define $G = X \cdot X^{T}$
Add $\lambda$ to diagonal indices of G
Find inverse of G ($G^{-1}$)
B = $\frac{G^{-1}}{diag(G^{-1})}$, where $diag(G^{-1})$ is array of diagonal elements
Set all elements on the diagonal of B equal to 0

Recommendations

Recommendations for one user are defined as following:

Compute scores $s_{u} = x_{u} \cdot B$ where $x_{u}$ refers to user row in interactions matrix
Take k items with greatest scores

Model application

regularization is the only hyper-parameter for the model
This algorithm requires a lot of RAM during fit method. Out-of-memory issues are possible for big datasets. Reasonable catalog size for local development is about 30k items. Reasonable amount of interactions is about 20m.
If we use dataset with features for fit and recommend, model just skips this info

[29]:

%%time
model = EASEModel(regularization=500)
model.fit(dataset);

CPU times: user 3min 22s, sys: 2.13 s, total: 3min 25s
Wall time: 3min 25s

[30]:

recos = model.recommend(
    users=test_hot_users,
    dataset=dataset,
    k=3,
    filter_viewed=True,
)
recos.merge(items[["item_id", "title_orig"]], on="item_id")

[30]:

	user_id	item_id	score	rank	title_orig
0	176549	11749	1.545711	1	Incredibles 2
1	176549	334	1.422080	2	Brave
2	176549	8254	1.272578	3	Finding Nemo

Links

EASE original paper: Embarrassingly Shallow Autoencoders for Sparse Data
Proving EASE quality: Everyone’s a Winner! On Hyperparameter Tuning of Recommendation Models
Proving EASE quality: Top-N Recommendation Algorithms: A Quest for the State-of-the-Art
Proving EASE quality: Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis
Proving linear autoencoders quality (SLIM model): Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches