Examples of using explicit features in implicit.ALS model with RecTools
Some models allow using explicit user (sex, age, etc.) and item (genre, year, …) features.
Building ALS model
Adding features to model
Advanced feature usage
[2]:
import os
import numpy as np
import pandas as pd
from implicit.als import AlternatingLeastSquares
from rectools import Columns
from rectools.dataset import Dataset, SparseFeatures, DenseFeatures, IdMap, Interactions
from rectools.metrics import (
MAP,
MeanInvUserFreq,
calc_metrics,
)
from rectools.models import ImplicitALSWrapperModel
[3]:
os.environ["OPENBLAS_NUM_THREADS"] = "1" # For implicit ALS
Load data
[4]:
%%time
!wget https://files.grouplens.org/datasets/movielens/ml-1m.zip
!unzip ml-1m.zip
--2022-07-28 11:32:59-- https://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5,6M) [application/zip]
Saving to: ‘ml-1m.zip.5’
ml-1m.zip.5 100%[===================>] 5,64M 3,88MB/s in 1,5s
2022-07-28 11:33:01 (3,88 MB/s) - ‘ml-1m.zip.5’ saved [5917549/5917549]
Archive: ml-1m.zip
creating: ml-1m/
inflating: ml-1m/movies.dat
inflating: ml-1m/ratings.dat
inflating: ml-1m/README
inflating: ml-1m/users.dat
CPU times: user 41.1 ms, sys: 26.6 ms, total: 67.7 ms
Wall time: 2.45 s
[5]:
%%time
ratings = pd.read_csv(
"ml-1m/ratings.dat",
sep="::",
engine="python", # Because of 2-chars separators
header=None,
names=[Columns.User, Columns.Item, Columns.Weight, Columns.Datetime],
)
print(ratings.shape)
ratings.head()
(1000209, 4)
CPU times: user 4.01 s, sys: 177 ms, total: 4.19 s
Wall time: 4.21 s
[5]:
| user_id | item_id | weight | datetime | |
|---|---|---|---|---|
| 0 | 1 | 1193 | 5 | 978300760 |
| 1 | 1 | 661 | 3 | 978302109 |
| 2 | 1 | 914 | 3 | 978301968 |
| 3 | 1 | 3408 | 4 | 978300275 |
| 4 | 1 | 2355 | 5 | 978824291 |
[6]:
ratings["datetime"] = pd.to_datetime(ratings["datetime"] * 10 ** 9)
ratings["datetime"].min(), ratings["datetime"].max()
[6]:
(Timestamp('2000-04-25 23:05:32'), Timestamp('2003-02-28 17:49:50'))
[7]:
%%time
movies = pd.read_csv(
"ml-1m/movies.dat",
sep="::",
engine="python", # Because of 2-chars separators
header=None,
names=[Columns.Item, "title", "genres"],
)
print(movies.shape)
movies.head()
(3883, 3)
CPU times: user 11.5 ms, sys: 1.6 ms, total: 13.1 ms
Wall time: 12.1 ms
[7]:
| item_id | title | genres | |
|---|---|---|---|
| 0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
| 1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
| 2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
| 3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
| 4 | 5 | Father of the Bride Part II (1995) | Comedy |
[8]:
%%time
users = pd.read_csv(
"ml-1m/users.dat",
sep="::",
engine="python", # Because of 2-chars separators
header=None,
names=[Columns.User, "sex", "age", "occupation", "zip_code"],
)
print(users.shape)
users.head()
(6040, 5)
CPU times: user 22.8 ms, sys: 3.26 ms, total: 26 ms
Wall time: 25.1 ms
[8]:
| user_id | sex | age | occupation | zip_code | |
|---|---|---|---|---|---|
| 0 | 1 | F | 1 | 10 | 48067 |
| 1 | 2 | M | 56 | 16 | 70072 |
| 2 | 3 | M | 25 | 15 | 55117 |
| 3 | 4 | M | 45 | 7 | 02460 |
| 4 | 5 | M | 25 | 20 | 55455 |
Split by train / test and build model without features
For correct model comparison it’s better to use cross-valiation, but for simplicity here we are splitting only once
[9]:
split_dt = pd.Timestamp("2003-01-01")
df_train = ratings.loc[ratings["datetime"] < split_dt]
df_test = ratings.loc[ratings["datetime"] >= split_dt]
[10]:
metrics = {"MAP": MAP(10), "Novelty": MeanInvUserFreq(10)}
[11]:
dataset = Dataset.construct(df_train)
[12]:
def make_base_model():
# Need to create new base model every time to use same random initializations
return AlternatingLeastSquares(factors=32, random_state=42, num_threads=4)
[13]:
%%time
model = ImplicitALSWrapperModel(make_base_model())
model.fit(dataset)
recos = model.recommend(
users=df_test[Columns.User].unique(),
dataset=dataset,
k=10,
filter_viewed=True,
)
calc_metrics(metrics, recos, df_test, df_train)
CPU times: user 29.1 s, sys: 12.6 s, total: 41.7 s
Wall time: 7.66 s
[13]:
{'MAP': 0.017172445629322065, 'Novelty': 2.5387145697434614}
Prepare features
There are 2 kind of features: categorical that are represented as sparse and numerical that are represented as dense.
Sparse is much more popular, even if you have dense features it’s often better to binarize them.
Here we have mostly categorical features, only 2 numerical: - user age, but it’s binarized already; - movie year, we’ll binarize it now.
We represent user and item features as flatten dataframes.
User features
[14]:
users.isna().sum()
[14]:
user_id 0
sex 0
age 0
occupation 0
zip_code 0
dtype: int64
[15]:
users.nunique()
[15]:
user_id 6040
sex 2
age 7
occupation 21
zip_code 3439
dtype: int64
[16]:
# Select only users that present in 'ratings' table
users = users.loc[users["user_id"].isin(ratings["user_id"])].copy()
There are too many zip codes, we will not use them now because methods of using features that are available for now work badly with big number of features
[17]:
# For 3 features generate common flatten table with its values
# Here all features have 1 value per user
# But there can be more than 1 value of feature per user (item) if feature is categorical
user_features_frames = []
for feature in ["sex", "age", "occupation"]:
feature_frame = users.reindex(columns=["user_id", feature])
feature_frame.columns = ["id", "value"]
feature_frame["feature"] = feature
user_features_frames.append(feature_frame)
user_features = pd.concat(user_features_frames)
user_features.head()
[17]:
| id | value | feature | |
|---|---|---|---|
| 0 | 1 | F | sex |
| 1 | 2 | M | sex |
| 2 | 3 | M | sex |
| 3 | 4 | M | sex |
| 4 | 5 | M | sex |
Item features
Here we will use movie genre and year
[18]:
movies.isna().sum()
[18]:
item_id 0
title 0
genres 0
dtype: int64
[19]:
# Select only items that present in 'ratings' table
movies = movies.loc[movies["item_id"].isin(ratings["item_id"])].copy()
Genre
[20]:
# Explode genres to flatten table
movies["genre"] = movies["genres"].str.split("|")
genre_feature = movies[["item_id", "genre"]].explode("genre")
genre_feature.columns = ["id", "value"]
genre_feature["feature"] = "genre"
genre_feature.head()
[20]:
| id | value | feature | |
|---|---|---|---|
| 0 | 1 | Animation | genre |
| 0 | 1 | Children's | genre |
| 0 | 1 | Comedy | genre |
| 1 | 2 | Adventure | genre |
| 1 | 2 | Children's | genre |
Year
[21]:
# Binarize year to 10 bins and use it as categorica feature
movies["year"] = movies["title"].str.extract(r"\((\d{4})\)").astype(int)
_, bins = pd.qcut(movies["year"], 10, retbins=True)
labels = bins[:-1]
print(labels)
[1919. 1959. 1977. 1986. 1991. 1994. 1995. 1996. 1997. 1999.]
[22]:
year_feature = pd.DataFrame(
{
"id": movies["item_id"],
"value": pd.cut(movies["year"], bins=bins, labels=labels),
"feature": "year",
}
)
year_feature.head()
[22]:
| id | value | feature | |
|---|---|---|---|
| 0 | 1 | 1994.0 | year |
| 1 | 2 | 1994.0 | year |
| 2 | 3 | 1994.0 | year |
| 3 | 4 | 1994.0 | year |
| 4 | 5 | 1994.0 | year |
Combine
[23]:
item_features = pd.concat((genre_feature, year_feature))
Build model with features
There are 2 ways to use features in ALS that implemented in RecTools: ‘separately’ and ‘together’.
Both methods work with dense features. But we prepared data for sparse features because it’s more convenient here. In model sparse matrix will be converted to dense. Be carefull with big datasets, limit the number of features.
Note: Training model with features is available for CPU and GPU (as well as training native ALS without features). It is managed by use_gpu parameter in implicit.als.AlternatingLeastSquares.
Attempt to use all features
[24]:
%%time
dataset = Dataset.construct(
interactions_df=df_train,
user_features_df=user_features,
cat_user_features=["sex", "age", "occupation"],
item_features_df=item_features,
cat_item_features=["year", "genre"], # If we didn't binarize year, we wouldn't set it here
)
for fit_features_together in (True, False):
model = ImplicitALSWrapperModel(make_base_model(), fit_features_together=fit_features_together)
model.fit(dataset)
recos = model.recommend(
users=df_test[Columns.User].unique(),
dataset=dataset,
k=10,
filter_viewed=True,
)
metric_values = calc_metrics(metrics, recos, df_test, df_train)
print(f"Fit features together: {fit_features_together}. Metrics: {metric_values}")
/Users/eofeldman/opt/anaconda3/lib/python3.8/site-packages/rectools/dataset/features.py:384: UserWarning: Converting sparse features to dense array may cause MemoryError
warnings.warn("Converting sparse features to dense array may cause MemoryError")
Fit features together: True. Metrics: {'MAP': 0.01388652631210521, 'Novelty': 2.5681515162134265}
Fit features together: False. Metrics: {'MAP': 0.01998701037744824, 'Novelty': 2.2928462029811394}
CPU times: user 1min 54s, sys: 42.5 s, total: 2min 36s
Wall time: 27.5 s
Here we can see decreased MAP for joint feature fitting and increased MAP for separate fitting.
Let’s analyze which features have greater influence.
Note: We get warning because for methods of using features that are implemented for now features must be represented as dense array, so be careful and do not use big number of features.
Attempt to use only item features
[25]:
%%time
dataset = Dataset.construct(
interactions_df=df_train,
item_features_df=item_features,
cat_item_features=["year", "genre"],
)
for fit_features_together in (True, False):
model = ImplicitALSWrapperModel(make_base_model(), fit_features_together=fit_features_together)
model.fit(dataset)
recos = model.recommend(
users=df_test[Columns.User].unique(),
dataset=dataset,
k=10,
filter_viewed=True,
)
metric_values = calc_metrics(metrics, recos, df_test, df_train)
print(f"Fit features together: {fit_features_together}. Metrics: {metric_values}")
Fit features together: True. Metrics: {'MAP': 0.01520886994621532, 'Novelty': 2.5870053665186767}
Fit features together: False. Metrics: {'MAP': 0.014805021434661019, 'Novelty': 2.8426244093501594}
CPU times: user 1min 43s, sys: 34.9 s, total: 2min 18s
Wall time: 22.9 s
Here we see decreased MAP values.
Attempt to use only user features
[26]:
%%time
dataset = Dataset.construct(
interactions_df=df_train,
user_features_df=user_features,
cat_user_features=["sex", "age", "occupation"],
)
for fit_features_together in (True, False):
model = ImplicitALSWrapperModel(make_base_model(), fit_features_together=fit_features_together)
model.fit(dataset)
recos = model.recommend(
users=df_test[Columns.User].unique(),
dataset=dataset,
k=10,
filter_viewed=True,
)
metric_values = calc_metrics(metrics, recos, df_test, df_train)
print(f"Fit features together: {fit_features_together}. Metrics: {metric_values}")
Fit features together: True. Metrics: {'MAP': 0.013806802876374652, 'Novelty': 2.506039447145849}
Fit features together: False. Metrics: {'MAP': 0.02285698694879273, 'Novelty': 1.9870892804731672}
CPU times: user 1min 46s, sys: 40.2 s, total: 2min 27s
Wall time: 30.3 s
Here we see that user features increase MAP a lot if fit separately, but Novelty decreases.
Attempt to use only user features with increased features weight
[27]:
%%time
dataset = Dataset.construct(
interactions_df=df_train.eval("weight = weight / 10"), # decrease interactions weight => increase features weight
user_features_df=user_features,
cat_user_features=["sex", "age", "occupation"],
)
for fit_features_together in (True, False):
model = ImplicitALSWrapperModel(make_base_model(), fit_features_together=fit_features_together)
model.fit(dataset)
recos = model.recommend(
users=df_test[Columns.User].unique(),
dataset=dataset,
k=10,
filter_viewed=True,
)
metric_values = calc_metrics(metrics, recos, df_test, df_train)
print(f"Fit features together: {fit_features_together}. Metrics: {metric_values}")
/Users/eofeldman/opt/anaconda3/lib/python3.8/site-packages/rectools/dataset/features.py:384: UserWarning: Converting sparse features to dense array may cause MemoryError
warnings.warn("Converting sparse features to dense array may cause MemoryError")
Fit features together: True. Metrics: {'MAP': 0.01676492598017664, 'Novelty': 2.2798496191085693}
Fit features together: False. Metrics: {'MAP': 0.016931541817397328, 'Novelty': 2.069402889295814}
CPU times: user 1min 29s, sys: 28 s, total: 1min 57s
Wall time: 21.1 s
If we use much bigger weight for features compared to weight of interactions, values of metrics become more similar.
Advanced features usage
[28]:
# Prepare explicit_id <-> implicit_id mapping
id_map = IdMap.from_values(["u1", "u2", "u3"])
display(id_map.to_internal)
display(id_map.to_external)
u1 0
u2 1
u3 2
dtype: int64
0 u1
1 u2
2 u3
dtype: object
Sparse features
When using Dataset.construct with features, we call SparseFeatures.from_flatten.
All features are converted to CSR matrix.
[29]:
features_df = pd.DataFrame(
[
["u1", "feature_1", "x"],
["u1", "feature_1", "y"],
["u1", "feature_2", 123],
["u2", "feature_2", 123],
["u3", "feature_2", 150],
["u3", "feature_1", "x"],
],
columns=["id", "feature", "value"],
)
[30]:
# Categorical features are converted to one-hot encoded format
sf = SparseFeatures.from_flatten(features_df, id_map, cat_features=["feature_1", "feature_2"])
print(sf.names)
sf.values.toarray()
(('feature_1', 'x'), ('feature_1', 'y'), ('feature_2', 123), ('feature_2', 150))
[30]:
array([[1., 1., 1., 0.],
[0., 0., 1., 0.],
[1., 0., 0., 1.]], dtype=float32)
[31]:
# Non-categorical features remain 'as is'
sf = SparseFeatures.from_flatten(features_df, id_map, cat_features=["feature_1"])
print(sf.names)
sf.values.toarray()
(('feature_2', '__is_direct_feature'), ('feature_1', 'x'), ('feature_1', 'y'))
[31]:
array([[123., 1., 1.],
[123., 0., 0.],
[150., 1., 0.]], dtype=float32)
Important: All non-numeric features must be categorical
[32]:
# If you want to increase feature weight you can use 'weight' column
features_weighted_df = features_df.copy()
features_weighted_df["weight"] = 1
features_weighted_df.loc[[0, 1, 2], "weight"] = 2
sf = SparseFeatures.from_flatten(features_weighted_df, id_map, cat_features=["feature_1"])
print(sf.names)
sf.values.toarray()
(('feature_2', '__is_direct_feature'), ('feature_1', 'x'), ('feature_1', 'y'))
[32]:
array([[246., 2., 2.],
[123., 0., 0.],
[150., 1., 0.]], dtype=float32)
Dense features
If you have features in ‘classic’ format, you can use DenseFeatures.
When creating dataset with Dataset.construct use parameter make_dense_user_features (make_dense_item_features), in this case DenseFeatures.from_dataframe will be used.
All features are saved ‘as is’.
Important: Use only numeric features.
Important: You must set features for all objects (users or features). If you do not have some feature for some user (item) then use any method (zero, mean value, etc.) to fill it.
[33]:
features_df = pd.DataFrame(
[
["u1", 10, 0.5, 22],
["u2", 202, 0, 2.5],
["u3", 0.01, 1, 10],
],
columns=["id", "feature_1", "feature_2", "feature_3"],
)
[34]:
dense_features = DenseFeatures.from_dataframe(features_df, id_map)
print(dense_features.names)
dense_features.values
('feature_1', 'feature_2', 'feature_3')
[34]:
array([[1.00e+01, 5.00e-01, 2.20e+01],
[2.02e+02, 0.00e+00, 2.50e+00],
[1.00e-02, 1.00e+00, 1.00e+01]], dtype=float32)
Building dataset with manually created features
If you want, you can create features manually and then build dataset with it
[35]:
# Prepare id maps
user_id_map = IdMap.from_values(["u1", "u2", "u3"])
item_id_map = IdMap.from_values(["i1", "i2"])
display(user_id_map.to_internal)
display(item_id_map.to_internal)
u1 0
u2 1
u3 2
dtype: int64
i1 0
i2 1
dtype: int64
[36]:
# Prepare interactions
interactions_df = pd.DataFrame(
{
Columns.User: ["u1", "u1", "u2"],
Columns.Item: ["i1", "i2", "i1"],
Columns.Weight: 1,
Columns.Datetime: 1,
}
)
interactions = Interactions.from_raw(interactions_df, user_id_map, item_id_map)
interactions.df
[36]:
| user_id | item_id | weight | datetime | |
|---|---|---|---|---|
| 0 | 0 | 0 | 1.0 | 1970-01-01 00:00:00.000000001 |
| 1 | 0 | 1 | 1.0 | 1970-01-01 00:00:00.000000001 |
| 2 | 1 | 0 | 1.0 | 1970-01-01 00:00:00.000000001 |
[37]:
# Prepare user features
user_features_df = pd.DataFrame(
[
["u1", "feature_1", "x"],
["u1", "feature_1", "y"],
["u1", "feature_2", 123],
["u2", "feature_2", 123],
["u3", "feature_2", 150],
["u3", "feature_1", "x"],
],
columns=["id", "feature", "value"],
)
user_features = SparseFeatures.from_flatten(user_features_df, user_id_map, cat_features=["feature_1"])
user_features.values.toarray()
[37]:
array([[123., 1., 1.],
[123., 0., 0.],
[150., 1., 0.]], dtype=float32)
[38]:
# Prepare item features
item_features_df = pd.DataFrame(
[
["i1", 10, 0.5, 22],
["i2", 202, 0, 2.5],
],
columns=["id", "feature_1", "feature_2", "feature_3"],
)
item_features = DenseFeatures.from_dataframe(item_features_df, item_id_map)
item_features.values
[38]:
array([[ 10. , 0.5, 22. ],
[202. , 0. , 2.5]], dtype=float32)
Note: "u3" not in interactions_df but we can add it manually to dataset (and features) using IdMap. It’s not possible if you use Dataset.construct.
[39]:
Dataset(
user_id_map=user_id_map,
item_id_map=item_id_map,
interactions=interactions,
user_features=user_features,
item_features=item_features,
)
[39]:
Dataset(user_id_map=IdMap(to_internal=u1 0
u2 1
u3 2
dtype: int64), item_id_map=IdMap(to_internal=i1 0
i2 1
dtype: int64), interactions=Interactions(df= user_id item_id weight datetime
0 0 0 1.0 1970-01-01 00:00:00.000000001
1 0 1 1.0 1970-01-01 00:00:00.000000001
2 1 0 1.0 1970-01-01 00:00:00.000000001), user_features=SparseFeatures(values=<3x3 sparse matrix of type '<class 'numpy.float32'>'
with 6 stored elements in Compressed Sparse Row format>, names=(('feature_2', '__is_direct_feature'), ('feature_1', 'x'), ('feature_1', 'y'))), item_features=DenseFeatures(values=array([[ 10. , 0.5, 22. ],
[202. , 0. , 2.5]], dtype=float32), names=('feature_1', 'feature_2', 'feature_3')))