Examples of calculating different metrics with RecTools
Table of Contents
Load and preprocess data: Movielens
Build model
Calculate metrics
Metrics initialization
Single metric calculation
Per user metric calculation
Multiple metrics calculation with one function
We provide all types of metrics to measure model performance from different aspects
Classification:
HitRate, Precision (with R-Precision variant which divides by minimum between k and number of user test items), Recall, Accuracy, MCC, F1Beta
Ranking:
MRR, MAP (with an option to divide by k ot by number of user test items), NDCG (with an option to select log base)
Advanced AUC based ranking:
PartialAUC, PAP (Partial AUC + Precision joint metric)
Beyond Accuracy:
Serendipity, MeanInvUserFreq (mean inverse user frequency to calculty “novelty”), Intra-List Diversity (based on some meta features of items)
Popularity bias:
AvgRecPopularity
Recommendations data quality:
SufficientReco (share of filled recommendations in the list), UnrepeatedReco (share of unique items recommended for each user), CoveredUsers (share of test users that have at least one recommendation)
Between-model comparison:
Intersection (share of common user-item pairs between different models recommendations)
[2]:
import numpy as np
import pandas as pd
from implicit.nearest_neighbours import TFIDFRecommender
from rectools import Columns
from rectools.dataset import Dataset
from rectools.metrics import (
Precision,
NDCG,
AvgRecPopularity,
Intersection,
HitRate,
SufficientReco,
DebiasConfig,
IntraListDiversity,
Serendipity,
calc_metrics,
)
from rectools.metrics.distances import PairwiseHammingDistanceCalculator
from rectools.models import ImplicitItemKNNWrapperModel
/Users/dmtikhonov/git_project/metrics/RecTools/.venv/lib/python3.10/site-packages/lightfm/_lightfm_fast.py:9: UserWarning: LightFM was compiled without OpenMP support. Only a single thread will be used.
warnings.warn(
Load and preprocess data: Movielens
[3]:
%%time
!wget -q https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ml-1m.zip
!unzip -o ml-1m.zip
!rm ml-1m.zip
Archive: ml-1m.zip
inflating: ml-1m/movies.dat
inflating: ml-1m/ratings.dat
inflating: ml-1m/README
inflating: ml-1m/users.dat
CPU times: user 125 ms, sys: 55.1 ms, total: 180 ms
Wall time: 5.83 s
[4]:
%%time
ratings = pd.read_csv(
"ml-1m/ratings.dat",
sep="::",
engine="python", # Because of 2-chars separators
header=None,
names=[Columns.User, Columns.Item, Columns.Weight, Columns.Datetime],
)
print(ratings.shape)
ratings.head()
(1000209, 4)
CPU times: user 2.27 s, sys: 71.3 ms, total: 2.34 s
Wall time: 2.35 s
[4]:
| user_id | item_id | weight | datetime | |
|---|---|---|---|---|
| 0 | 1 | 1193 | 5 | 978300760 |
| 1 | 1 | 661 | 3 | 978302109 |
| 2 | 1 | 914 | 3 | 978301968 |
| 3 | 1 | 3408 | 4 | 978300275 |
| 4 | 1 | 2355 | 5 | 978824291 |
[5]:
ratings["datetime"] = pd.to_datetime(ratings["datetime"] * 10 ** 9)
ratings["datetime"].min(), ratings["datetime"].max()
[5]:
(Timestamp('2000-04-25 23:05:32'), Timestamp('2003-02-28 17:49:50'))
[6]:
%%time
movies = pd.read_csv(
"ml-1m/movies.dat",
sep="::",
engine="python", # Because of 2-chars separators
header=None,
names=[Columns.Item, "title", "genres"],
encoding_errors="ignore",
)
print(movies.shape)
movies.head()
(3883, 3)
CPU times: user 6.26 ms, sys: 1.53 ms, total: 7.79 ms
Wall time: 8.6 ms
[6]:
| item_id | title | genres | |
|---|---|---|---|
| 0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
| 1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
| 2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
| 3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
| 4 | 5 | Father of the Bride Part II (1995) | Comedy |
Build model
[7]:
# Split once by train and test to demonstrate how different metrics work
split_dt = pd.Timestamp("2003-02-01")
df_train = ratings.loc[ratings["datetime"] < split_dt]
df_test = ratings.loc[ratings["datetime"] >= split_dt]
[8]:
%%time
# Prepare dataset, fit model and generate recommendations
dataset = Dataset.construct(df_train)
model = ImplicitItemKNNWrapperModel(TFIDFRecommender(K=10))
model.fit(dataset)
recos = model.recommend(
users=ratings[Columns.User].unique(),
dataset=dataset,
k=10,
filter_viewed=True,
)
CPU times: user 1.02 s, sys: 40.5 ms, total: 1.06 s
Wall time: 1.08 s
Calculate metrics
Metrics initialization
To calculate a metric it is necessary to create its object.
Most metrics have k parameter - the number of top recommendations that will be used for metric calculation. Some metrics have additional parameters.
Simple metrics
[9]:
serendipity = Serendipity(k=10)
precision = Precision(k=10, r_precision=True) # r_precision means division by min(k, n_user_test_items)
ndcg = NDCG(k=10, log_base=3)
Metric with complex additional parameter
To calculate any diversity metric (e.g. IntraListDivirsity) you need to measure distance between items.
For example, you can use Hamming distance.
As features, let’s use movie genres.
[10]:
movies["genre"] = movies["genres"].str.split("|")
genre_exploded = movies[["item_id", "genre"]].set_index("item_id").explode("genre")
genre_dummies = pd.get_dummies(genre_exploded, prefix="", prefix_sep="").groupby("item_id").sum()
genre_dummies.head()
[10]:
| Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| item_id | ||||||||||||||||||
| 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
[11]:
distance_calculator = PairwiseHammingDistanceCalculator(genre_dummies)
ild = IntraListDiversity(k=10, distance_calculator=distance_calculator)
Single metric calculation
The easiest way to calculate metric is to use calc method.
Every metric has it, but arguments are different.
[17]:
precision_value = precision.calc(reco=recos, interactions=df_test)
print(f"precision: {precision_value}")
precision: 0.08501683501683503
[13]:
catalog = df_train[Columns.Item].unique()
serendipity_value = serendipity.calc(
reco=recos,
interactions=df_test,
prev_interactions=df_train,
catalog=catalog
)
print("Serendipity: ", serendipity_value)
Serendipity: 2.3436131849908687e-05
[14]:
print("NDCG: ", ndcg.calc(reco=recos, interactions=df_test))
NDCG: 0.06808226116073855
[15]:
%%time
print("ILD: ", ild.calc(reco=recos))
ILD: 3.1908278145695363
CPU times: user 460 ms, sys: 39.5 ms, total: 499 ms
Wall time: 501 ms
Per user metric calculation
If you need to get metric value for every user, use calc_per_user method.
[18]:
precision_per_user = precision.calc_per_user(reco=recos, interactions=df_test)
print("\nprecision per user:")
display(precision_per_user.head())
print("Values are equal? ", precision_per_user.mean() == precision_value)
precision per user:
user_id
195 0.3
229 0.0
343 0.0
349 0.0
398 0.5
dtype: float64
Values are equal? True
Multiple metrics calculation with one function
It is possible to calculate a bunch of metrics using only one function - calc_metrics.
It contains important optimisations in performance: if several metrics do the same calculations, they will be performed only once.
[16]:
# Here we provide a debias config for one of the metrics apart from calculating it's regular value
# Check our "Debiased metrics calculation user guied" for more info
metrics = {
"hit_rate@10": HitRate(k=10),
"hit_rate_debiased@10": HitRate(k=10, debias_config=DebiasConfig(iqr_coef=1.5, random_state=32)),
"sufficient@10": SufficientReco(k=10, deep=True),
"sufficient@20": SufficientReco(k=20, deep=True),
"pop_bias@10": AvgRecPopularity(k=10, normalize=True),
"ndcg@10": ndcg,
"serendipity@10": serendipity,
"diversity@10": ild,
"intersection@10": Intersection(k=10)
}
# Some arguments can be omitted if they are not needed for metrics calculation.
calc_metrics(
metrics,
reco=recos,
interactions=df_test, # needed fo all `TruePositive` based metrics
prev_interactions=df_train, # needed for serendipity
catalog=catalog, # needed for serendipity
ref_reco = {"same_model": recos} # needed for intersection. usually this should be recos from a different model
)
[16]:
{'hit_rate@10': 0.1717171717171717,
'hit_rate_debiased@10': 0.16161616161616163,
'ndcg@10': 0.06808226116073855,
'pop_bias@10': 0.0017362993523658327,
'diversity@10': 3.1908278145695363,
'serendipity@10': 2.3436131849908687e-05,
'intersection@10_same_model': 1.0,
'sufficient@10': 1.0,
'sufficient@20': 0.5}