Examples of constructing datasets with features in RecTools

Some models allow using explicit user (sex, age, etc.) and item (genre, year, …) features. Let’s see how we can process them to RecTools dataset.

After creating the dataset, training models with features is as simple as model.fit(dataset_with_features)

[2]:
import os
import threadpoolctl

import numpy as np
import pandas as pd
from implicit.als import AlternatingLeastSquares

from rectools import Columns
from rectools.dataset import Dataset
from rectools.models import ImplicitALSWrapperModel

# For implicit ALS
os.environ["OPENBLAS_NUM_THREADS"] = "1"
threadpoolctl.threadpool_limits(1, "blas")

Load data: Movielens 1m

[27]:
%%time
!wget -q https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ml-1m.zip
!unzip -o ml-1m.zip
!rm ml-1m.zip
Archive:  ml-1m.zip
  inflating: ml-1m/movies.dat
  inflating: ml-1m/ratings.dat
  inflating: ml-1m/README
  inflating: ml-1m/users.dat
CPU times: user 43.2 ms, sys: 62.3 ms, total: 106 ms
Wall time: 3.11 s
[28]:
%%time
ratings = pd.read_csv(
    "ml-1m/ratings.dat",
    sep="::",
    engine="python",  # Because of 2-chars separators
    header=None,
    names=[Columns.User, Columns.Item, Columns.Weight, Columns.Datetime],
)
print(ratings.shape)
ratings.head()
(1000209, 4)
CPU times: user 3.84 s, sys: 357 ms, total: 4.2 s
Wall time: 4.17 s
[28]:
user_id item_id weight datetime
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
[29]:
%%time
users = pd.read_csv(
    "ml-1m/users.dat",
    sep="::",
    engine="python",  # Because of 2-chars separators
    header=None,
    names=[Columns.User, "sex", "age", "occupation", "zip_code"],
)
print(users.shape)
users.head()
(6040, 5)
CPU times: user 17.2 ms, sys: 2.38 ms, total: 19.6 ms
Wall time: 18.8 ms
[29]:
user_id sex age occupation zip_code
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
[30]:
# Select only users that present in 'ratings' table
users = users.loc[users["user_id"].isin(ratings["user_id"])].copy()

Data types: categorical and numerical

Generally there are 2 kind of features in data: categorical and numerical. For classic recommender algorithms categorical features are usually one-hot-encoded and stored in sparse format. Numerical features can be used in the original form (e.g. processed by MinMaxScaler), but they can also be binarized, transformed to categorical and then one-hot encoded.

Depending on your data you can select to store features in sparse or dense format within RecTools dataset. dense format requires all features to be numerical. sparse format doesn’t have any constraints and can include numerical features as well.

During training RecTools models will transform features to the format that is apllicable. iALS with features will transform feature to dense format. LightFM and DSSM will transform to sparse. All of these transformations happen under the hood and no values are actually affected.

Now let’s see processing routines.

Features storage: Sparse example

For sparse format we need to create a dataframe in flatten format with columns id, feature, value. This way we can have any number of entries for each feature for any user (ot item). This is often the case for movie genres for example (one movie has 5 genres).

[31]:
# Let's prepare a flatten dataframe with 3 user features
user_features_frames = []
for feature in ["sex", "age", "occupation"]:
    feature_frame = users.reindex(columns=["user_id", feature])
    feature_frame.columns = ["id", "value"]
    feature_frame["feature"] = feature
    user_features_frames.append(feature_frame)
user_features = pd.concat(user_features_frames)
[32]:
# Let's see how this looks for users `1` and `2`
user_features.query("id in [1, 2]").sort_values("id")
[32]:
id value feature
0 1 F sex
0 1 1 age
0 1 10 occupation
1 2 M sex
1 2 56 age
1 2 16 occupation
[33]:
# Now we construct the dataset
sparse_features_dataset = Dataset.construct(
    ratings,
    user_features_df=user_features,  # our flatten dataframe
    cat_user_features=["sex", "age"], # these will be one-hot-encoded. All other features must be numerical already
    make_dense_user_features=False  # for `sparse` format
)

In this dataset user features are now stored in sparse format.

cat_user_features have all their possible values retrieved, one-hot-encoded and stored in sparse matrix.

All other features (direct) have their values stored in the same sparse matrix (one columns for one direct feature). Here we make “occupation” a direct feature just for a quick example on data storage. It actually has categorical nature.

Rows of the sparse matrix correspond to internal user ids in dataset. Which are identical to row numbers in ui_csr matrix which is used for model training in most of the recommender models.

Let’s look inside the dataset to check how the data is stored

[34]:
# storing format for features
sparse_features_dataset.user_features.values
[34]:
<6040x10 sparse matrix of type '<class 'numpy.float32'>'
        with 18120 stored elements in Compressed Sparse Row format>
[35]:
# feature names and values (sparse matrix columns)
sparse_features_dataset.user_features.names
[35]:
(('occupation', '__is_direct_feature'),
 ('sex', 'F'),
 ('sex', 'M'),
 ('age', 1),
 ('age', 56),
 ('age', 25),
 ('age', 45),
 ('age', 50),
 ('age', 35),
 ('age', 18))
[36]:
# example of stored features for 5 users
sparse_features_dataset.user_features.values[:5].toarray()
[36]:
array([[10.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [16.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [15.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 7.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [20.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]], dtype=float32)

Features storage: Dense example

Now let’s create a dataset with dense features.

We need a classic dataframe with one column for each feature and one row for each subject (user or item).

Important: All feature values must be numeric

Important: You must set features for all objects (users or items). If you do not have some feature for some user (item) then use any method (zero, mean value, etc.) to fill it.

[37]:
user_numeric_features = users[[Columns.User, "age", "occupation"]]
user_numeric_features.head()
[37]:
user_id age occupation
0 1 1 10
1 2 56 16
2 3 25 15
3 4 45 7
4 5 25 20
[38]:
dense_features_dataset = Dataset.construct(
    ratings,
    user_features_df=user_numeric_features,
    make_dense_user_features=True  # for `dense` format
)

Let’s look how the data is stored now. This is a 2-d numpy array. Row numbers correspond to internal user ids in dataset.

[39]:
# feature names (array columns)
dense_features_dataset.user_features.names
[39]:
('age', 'occupation')
[40]:
# example of stored features for 5 users
dense_features_dataset.user_features.values[:5]
[40]:
array([[ 1., 10.],
       [56., 16.],
       [25., 15.],
       [45.,  7.],
       [25., 20.]], dtype=float32)

Feeding features to models

Now we can just fit model using prepared dataset. For this we choose models that have support for using features in training (e.g. iALS, LightFM, DSSM, PopularInCategory).

[41]:
model = ImplicitALSWrapperModel(AlternatingLeastSquares(10, num_threads=32))
model.fit(dense_features_dataset)
100%|██████████| 1/1 [00:00<00:00, 17.08it/s]
[41]:
<rectools.models.implicit_als.ImplicitALSWrapperModel at 0x7f6fe355bf10>
[42]:
model = ImplicitALSWrapperModel(AlternatingLeastSquares(10, num_threads=32))
model.fit(sparse_features_dataset)
/data/home/dmtikhono1/git_project/RecTools/rectools/dataset/features.py:399: UserWarning: Converting sparse features to dense array may cause MemoryError
  warnings.warn("Converting sparse features to dense array may cause MemoryError")
100%|██████████| 1/1 [00:00<00:00, 12.94it/s]
[42]:
<rectools.models.implicit_als.ImplicitALSWrapperModel at 0x7f6fe355be10>

Final notes

  • If model requires features in a specific format, it will convert them under the hood. This is why we can get a warning, fitting iALS with sparse features. Model fits anyway, just remember about possible memory problems

  • LightFM and DSSM prefer one-hot-encoded features. So it is a good idea to binarize all direct features and make them categorical. But you can also try to apply MinMaxScaler to direct values.

  • iALS works good with both direct and categorical features. Direct features can be MinMaxScaled

  • PopularInCategory requires sparse features and a selected category because of its nature