Examples of calculating different metrics with RecTools

Table of Contents

Load and preprocess data: Movielens
Build model
Calculate metrics
- Metrics initialization
- Single metric calculation
  - Per user metric calculation
- Multiple metrics calculation with one function

We provide all types of metrics to measure model performance from different aspects

Classification:
- HitRate, Precision (with R-Precision variant which divides by minimum between k and number of user test items), Recall, Accuracy, MCC, F1Beta
Ranking:
- MRR, MAP (with an option to divide by k ot by number of user test items), NDCG (with an option to select log base)
Advanced AUC based ranking:
- PartialAUC, PAP (Partial AUC + Precision joint metric)
Beyond Accuracy:
- Serendipity, MeanInvUserFreq (mean inverse user frequency to calculty “novelty”), Intra-List Diversity (based on some meta features of items)
Popularity bias:
- AvgRecPopularity
Recommendations data quality:
- SufficientReco (share of filled recommendations in the list), UnrepeatedReco (share of unique items recommended for each user), CoveredUsers (share of test users that have at least one recommendation)
Between-model comparison:
- Intersection (share of common user-item pairs between different models recommendations)

[2]:

import numpy as np
import pandas as pd

from implicit.nearest_neighbours import TFIDFRecommender

from rectools import Columns
from rectools.dataset import Dataset
from rectools.metrics import (
    Precision,
    NDCG,
    AvgRecPopularity,
    Intersection,
    HitRate,
    SufficientReco,
    DebiasConfig,
    IntraListDiversity,
    Serendipity,
    calc_metrics,
)
from rectools.metrics.distances import PairwiseHammingDistanceCalculator
from rectools.models import ImplicitItemKNNWrapperModel

/Users/dmtikhonov/git_project/metrics/RecTools/.venv/lib/python3.10/site-packages/lightfm/_lightfm_fast.py:9: UserWarning: LightFM was compiled without OpenMP support. Only a single thread will be used.
  warnings.warn(

Load and preprocess data: Movielens

[3]:

%%time
!wget -q https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ml-1m.zip
!unzip -o ml-1m.zip
!rm ml-1m.zip

Archive:  ml-1m.zip
  inflating: ml-1m/movies.dat
  inflating: ml-1m/ratings.dat
  inflating: ml-1m/README
  inflating: ml-1m/users.dat
CPU times: user 125 ms, sys: 55.1 ms, total: 180 ms
Wall time: 5.83 s

[4]:

%%time
ratings = pd.read_csv(
    "ml-1m/ratings.dat",
    sep="::",
    engine="python",  # Because of 2-chars separators
    header=None,
    names=[Columns.User, Columns.Item, Columns.Weight, Columns.Datetime],
)
print(ratings.shape)
ratings.head()

(1000209, 4)
CPU times: user 2.27 s, sys: 71.3 ms, total: 2.34 s
Wall time: 2.35 s

[4]:

	user_id	item_id	weight	datetime
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

[5]:

ratings["datetime"] = pd.to_datetime(ratings["datetime"] * 10 ** 9)
ratings["datetime"].min(), ratings["datetime"].max()

[5]:

(Timestamp('2000-04-25 23:05:32'), Timestamp('2003-02-28 17:49:50'))

[6]:

%%time
movies = pd.read_csv(
    "ml-1m/movies.dat",
    sep="::",
    engine="python",  # Because of 2-chars separators
    header=None,
    names=[Columns.Item, "title", "genres"],
    encoding_errors="ignore",
)
print(movies.shape)
movies.head()

(3883, 3)
CPU times: user 6.26 ms, sys: 1.53 ms, total: 7.79 ms
Wall time: 8.6 ms

[6]:

	item_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

Build model

[7]:

# Split once by train and test to demonstrate how different metrics work
split_dt = pd.Timestamp("2003-02-01")
df_train = ratings.loc[ratings["datetime"] < split_dt]
df_test = ratings.loc[ratings["datetime"] >= split_dt]

[8]:

%%time

# Prepare dataset, fit model and generate recommendations
dataset = Dataset.construct(df_train)
model = ImplicitItemKNNWrapperModel(TFIDFRecommender(K=10))
model.fit(dataset)
recos = model.recommend(
    users=ratings[Columns.User].unique(),
    dataset=dataset,
    k=10,
    filter_viewed=True,
)

CPU times: user 1.02 s, sys: 40.5 ms, total: 1.06 s
Wall time: 1.08 s

Calculate metrics

Metrics initialization

To calculate a metric it is necessary to create its object.

Most metrics have k parameter - the number of top recommendations that will be used for metric calculation. Some metrics have additional parameters.

Simple metrics

[9]:

serendipity = Serendipity(k=10)
precision = Precision(k=10, r_precision=True)  # r_precision means division by min(k, n_user_test_items)
ndcg = NDCG(k=10, log_base=3)

Metric with complex additional parameter

To calculate any diversity metric (e.g. IntraListDivirsity) you need to measure distance between items.

For example, you can use Hamming distance.

As features, let’s use movie genres.

[10]:

movies["genre"] = movies["genres"].str.split("|")
genre_exploded = movies[["item_id", "genre"]].set_index("item_id").explode("genre")
genre_dummies = pd.get_dummies(genre_exploded, prefix="", prefix_sep="").groupby("item_id").sum()
genre_dummies.head()

[10]:

	Action	Adventure	Animation	Children's	Comedy	Crime	Documentary	Drama	Fantasy	Film-Noir	Horror	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
item_id
1	0	0	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0
3	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0
4	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0

[11]:

distance_calculator = PairwiseHammingDistanceCalculator(genre_dummies)
ild = IntraListDiversity(k=10, distance_calculator=distance_calculator)

Single metric calculation

The easiest way to calculate metric is to use calc method.

Every metric has it, but arguments are different.

[17]:

precision_value = precision.calc(reco=recos, interactions=df_test)
print(f"precision: {precision_value}")

precision: 0.08501683501683503

[13]:

catalog = df_train[Columns.Item].unique()

serendipity_value = serendipity.calc(
    reco=recos,
    interactions=df_test,
    prev_interactions=df_train,
    catalog=catalog
)
print("Serendipity: ", serendipity_value)

Serendipity:  2.3436131849908687e-05

[14]:

print("NDCG: ", ndcg.calc(reco=recos, interactions=df_test))

NDCG:  0.06808226116073855

[15]:

%%time
print("ILD: ", ild.calc(reco=recos))

ILD:  3.1908278145695363
CPU times: user 460 ms, sys: 39.5 ms, total: 499 ms
Wall time: 501 ms

Per user metric calculation

If you need to get metric value for every user, use calc_per_user method.

[18]:

precision_per_user = precision.calc_per_user(reco=recos, interactions=df_test)
print("\nprecision per user:")
display(precision_per_user.head())

print("Values are equal? ", precision_per_user.mean() == precision_value)


precision per user:

user_id
195    0.3
229    0.0
343    0.0
349    0.0
398    0.5
dtype: float64

Values are equal?  True

Multiple metrics calculation with one function

It is possible to calculate a bunch of metrics using only one function - calc_metrics.

It contains important optimisations in performance: if several metrics do the same calculations, they will be performed only once.

[16]:

# Here we provide a debias config for one of the metrics apart from calculating it's regular value
# Check our "Debiased metrics calculation user guied" for more info

metrics = {
    "hit_rate@10": HitRate(k=10),
    "hit_rate_debiased@10": HitRate(k=10, debias_config=DebiasConfig(iqr_coef=1.5, random_state=32)),
    "sufficient@10": SufficientReco(k=10, deep=True),
    "sufficient@20": SufficientReco(k=20, deep=True),
    "pop_bias@10": AvgRecPopularity(k=10, normalize=True),
    "ndcg@10": ndcg,
    "serendipity@10": serendipity,
    "diversity@10": ild,
    "intersection@10": Intersection(k=10)
}


# Some arguments can be omitted if they are not needed for metrics calculation.
calc_metrics(
    metrics,
    reco=recos,
    interactions=df_test,  # needed fo all `TruePositive` based metrics
    prev_interactions=df_train,  # needed for serendipity
    catalog=catalog,  # needed for serendipity
    ref_reco = {"same_model": recos}  # needed for intersection. usually this should be recos from a different model
)

[16]:

{'hit_rate@10': 0.1717171717171717,
 'hit_rate_debiased@10': 0.16161616161616163,
 'ndcg@10': 0.06808226116073855,
 'pop_bias@10': 0.0017362993523658327,
 'diversity@10': 3.1908278145695363,
 'serendipity@10': 2.3436131849908687e-05,
 'intersection@10_same_model': 1.0,
 'sufficient@10': 1.0,
 'sufficient@20': 0.5}