Examples of calculating different metrics with RecTools

Table of Contents

  • Load and preprocess data: Movielens

  • Build model

  • Calculate metrics

    • Metrics initialization

    • Single metric calculation

      • Per user metric calculation

    • Multiple metrics calculation with one function

We provide all types of metrics to measure model performance from different aspects

  • Classification:

    • HitRate, Precision (with R-Precision variant which divides by minimum between k and number of user test items), Recall, Accuracy, MCC, F1Beta

  • Ranking:

    • MRR, MAP (with an option to divide by k ot by number of user test items), NDCG (with an option to select log base)

  • Advanced AUC based ranking:

    • PartialAUC, PAP (Partial AUC + Precision joint metric)

  • Beyond Accuracy:

    • Serendipity, MeanInvUserFreq (mean inverse user frequency to calculty “novelty”), Intra-List Diversity (based on some meta features of items)

  • Popularity bias:

    • AvgRecPopularity

  • Recommendations data quality:

    • SufficientReco (share of filled recommendations in the list), UnrepeatedReco (share of unique items recommended for each user), CoveredUsers (share of test users that have at least one recommendation)

  • Between-model comparison:

    • Intersection (share of common user-item pairs between different models recommendations)

[2]:
import numpy as np
import pandas as pd

from implicit.nearest_neighbours import TFIDFRecommender

from rectools import Columns
from rectools.dataset import Dataset
from rectools.metrics import (
    Precision,
    NDCG,
    AvgRecPopularity,
    Intersection,
    HitRate,
    SufficientReco,
    DebiasConfig,
    IntraListDiversity,
    Serendipity,
    calc_metrics,
)
from rectools.metrics.distances import PairwiseHammingDistanceCalculator
from rectools.models import ImplicitItemKNNWrapperModel
/Users/dmtikhonov/git_project/metrics/RecTools/.venv/lib/python3.10/site-packages/lightfm/_lightfm_fast.py:9: UserWarning: LightFM was compiled without OpenMP support. Only a single thread will be used.
  warnings.warn(

Load and preprocess data: Movielens

[3]:
%%time
!wget -q https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ml-1m.zip
!unzip -o ml-1m.zip
!rm ml-1m.zip
Archive:  ml-1m.zip
  inflating: ml-1m/movies.dat
  inflating: ml-1m/ratings.dat
  inflating: ml-1m/README
  inflating: ml-1m/users.dat
CPU times: user 125 ms, sys: 55.1 ms, total: 180 ms
Wall time: 5.83 s
[4]:
%%time
ratings = pd.read_csv(
    "ml-1m/ratings.dat",
    sep="::",
    engine="python",  # Because of 2-chars separators
    header=None,
    names=[Columns.User, Columns.Item, Columns.Weight, Columns.Datetime],
)
print(ratings.shape)
ratings.head()
(1000209, 4)
CPU times: user 2.27 s, sys: 71.3 ms, total: 2.34 s
Wall time: 2.35 s
[4]:
user_id item_id weight datetime
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
[5]:
ratings["datetime"] = pd.to_datetime(ratings["datetime"] * 10 ** 9)
ratings["datetime"].min(), ratings["datetime"].max()
[5]:
(Timestamp('2000-04-25 23:05:32'), Timestamp('2003-02-28 17:49:50'))
[6]:
%%time
movies = pd.read_csv(
    "ml-1m/movies.dat",
    sep="::",
    engine="python",  # Because of 2-chars separators
    header=None,
    names=[Columns.Item, "title", "genres"],
    encoding_errors="ignore",
)
print(movies.shape)
movies.head()
(3883, 3)
CPU times: user 6.26 ms, sys: 1.53 ms, total: 7.79 ms
Wall time: 8.6 ms
[6]:
item_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

Build model

[7]:
# Split once by train and test to demonstrate how different metrics work
split_dt = pd.Timestamp("2003-02-01")
df_train = ratings.loc[ratings["datetime"] < split_dt]
df_test = ratings.loc[ratings["datetime"] >= split_dt]
[8]:
%%time

# Prepare dataset, fit model and generate recommendations
dataset = Dataset.construct(df_train)
model = ImplicitItemKNNWrapperModel(TFIDFRecommender(K=10))
model.fit(dataset)
recos = model.recommend(
    users=ratings[Columns.User].unique(),
    dataset=dataset,
    k=10,
    filter_viewed=True,
)
CPU times: user 1.02 s, sys: 40.5 ms, total: 1.06 s
Wall time: 1.08 s

Calculate metrics

Metrics initialization

To calculate a metric it is necessary to create its object.

Most metrics have k parameter - the number of top recommendations that will be used for metric calculation. Some metrics have additional parameters.

Simple metrics

[9]:
serendipity = Serendipity(k=10)
precision = Precision(k=10, r_precision=True)  # r_precision means division by min(k, n_user_test_items)
ndcg = NDCG(k=10, log_base=3)

Metric with complex additional parameter

To calculate any diversity metric (e.g. IntraListDivirsity) you need to measure distance between items.

For example, you can use Hamming distance.

As features, let’s use movie genres.

[10]:
movies["genre"] = movies["genres"].str.split("|")
genre_exploded = movies[["item_id", "genre"]].set_index("item_id").explode("genre")
genre_dummies = pd.get_dummies(genre_exploded, prefix="", prefix_sep="").groupby("item_id").sum()
genre_dummies.head()
[10]:
Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
item_id
1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
[11]:
distance_calculator = PairwiseHammingDistanceCalculator(genre_dummies)
ild = IntraListDiversity(k=10, distance_calculator=distance_calculator)

Single metric calculation

The easiest way to calculate metric is to use calc method.

Every metric has it, but arguments are different.

[17]:
precision_value = precision.calc(reco=recos, interactions=df_test)
print(f"precision: {precision_value}")
precision: 0.08501683501683503
[13]:
catalog = df_train[Columns.Item].unique()

serendipity_value = serendipity.calc(
    reco=recos,
    interactions=df_test,
    prev_interactions=df_train,
    catalog=catalog
)
print("Serendipity: ", serendipity_value)
Serendipity:  2.3436131849908687e-05
[14]:
print("NDCG: ", ndcg.calc(reco=recos, interactions=df_test))
NDCG:  0.06808226116073855
[15]:
%%time
print("ILD: ", ild.calc(reco=recos))
ILD:  3.1908278145695363
CPU times: user 460 ms, sys: 39.5 ms, total: 499 ms
Wall time: 501 ms

Per user metric calculation

If you need to get metric value for every user, use calc_per_user method.

[18]:
precision_per_user = precision.calc_per_user(reco=recos, interactions=df_test)
print("\nprecision per user:")
display(precision_per_user.head())

print("Values are equal? ", precision_per_user.mean() == precision_value)

precision per user:
user_id
195    0.3
229    0.0
343    0.0
349    0.0
398    0.5
dtype: float64
Values are equal?  True

Multiple metrics calculation with one function

It is possible to calculate a bunch of metrics using only one function - calc_metrics.

It contains important optimisations in performance: if several metrics do the same calculations, they will be performed only once.

[16]:
# Here we provide a debias config for one of the metrics apart from calculating it's regular value
# Check our "Debiased metrics calculation user guied" for more info

metrics = {
    "hit_rate@10": HitRate(k=10),
    "hit_rate_debiased@10": HitRate(k=10, debias_config=DebiasConfig(iqr_coef=1.5, random_state=32)),
    "sufficient@10": SufficientReco(k=10, deep=True),
    "sufficient@20": SufficientReco(k=20, deep=True),
    "pop_bias@10": AvgRecPopularity(k=10, normalize=True),
    "ndcg@10": ndcg,
    "serendipity@10": serendipity,
    "diversity@10": ild,
    "intersection@10": Intersection(k=10)
}


# Some arguments can be omitted if they are not needed for metrics calculation.
calc_metrics(
    metrics,
    reco=recos,
    interactions=df_test,  # needed fo all `TruePositive` based metrics
    prev_interactions=df_train,  # needed for serendipity
    catalog=catalog,  # needed for serendipity
    ref_reco = {"same_model": recos}  # needed for intersection. usually this should be recos from a different model
)
[16]:
{'hit_rate@10': 0.1717171717171717,
 'hit_rate_debiased@10': 0.16161616161616163,
 'ndcg@10': 0.06808226116073855,
 'pop_bias@10': 0.0017362993523658327,
 'diversity@10': 3.1908278145695363,
 'serendipity@10': 2.3436131849908687e-05,
 'intersection@10_same_model': 1.0,
 'sufficient@10': 1.0,
 'sufficient@20': 0.5}