{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examples of constructing datasets with features in RecTools\n",
"\n",
"Some models allow using explicit user (sex, age, etc.) and item (genre, year, ...) features. Let's see how we can process them to RecTools dataset.\n",
"\n",
"After creating the dataset, training models with features is as simple as `model.fit(dataset_with_features)`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import threadpoolctl\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from implicit.als import AlternatingLeastSquares\n",
"\n",
"from rectools import Columns\n",
"from rectools.dataset import Dataset\n",
"from rectools.models import ImplicitALSWrapperModel\n",
"\n",
"# For implicit ALS\n",
"os.environ[\"OPENBLAS_NUM_THREADS\"] = \"1\"\n",
"threadpoolctl.threadpool_limits(1, \"blas\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data: Movielens 1m"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: ml-1m.zip\n",
" inflating: ml-1m/movies.dat \n",
" inflating: ml-1m/ratings.dat \n",
" inflating: ml-1m/README \n",
" inflating: ml-1m/users.dat \n",
"CPU times: user 43.2 ms, sys: 62.3 ms, total: 106 ms\n",
"Wall time: 3.11 s\n"
]
}
],
"source": [
"%%time\n",
"!wget -q https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ml-1m.zip\n",
"!unzip -o ml-1m.zip\n",
"!rm ml-1m.zip"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1000209, 4)\n",
"CPU times: user 3.84 s, sys: 357 ms, total: 4.2 s\n",
"Wall time: 4.17 s\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" item_id | \n",
" weight | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" 1193 | \n",
" 5 | \n",
" 978300760 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 661 | \n",
" 3 | \n",
" 978302109 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 914 | \n",
" 3 | \n",
" 978301968 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 3408 | \n",
" 4 | \n",
" 978300275 | \n",
"
\n",
" \n",
" | 4 | \n",
" 1 | \n",
" 2355 | \n",
" 5 | \n",
" 978824291 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id item_id weight datetime\n",
"0 1 1193 5 978300760\n",
"1 1 661 3 978302109\n",
"2 1 914 3 978301968\n",
"3 1 3408 4 978300275\n",
"4 1 2355 5 978824291"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"ratings = pd.read_csv(\n",
" \"ml-1m/ratings.dat\",\n",
" sep=\"::\",\n",
" engine=\"python\", # Because of 2-chars separators\n",
" header=None,\n",
" names=[Columns.User, Columns.Item, Columns.Weight, Columns.Datetime],\n",
")\n",
"print(ratings.shape)\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(6040, 5)\n",
"CPU times: user 17.2 ms, sys: 2.38 ms, total: 19.6 ms\n",
"Wall time: 18.8 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" sex | \n",
" age | \n",
" occupation | \n",
" zip_code | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" F | \n",
" 1 | \n",
" 10 | \n",
" 48067 | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" M | \n",
" 56 | \n",
" 16 | \n",
" 70072 | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" M | \n",
" 25 | \n",
" 15 | \n",
" 55117 | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" M | \n",
" 45 | \n",
" 7 | \n",
" 02460 | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" M | \n",
" 25 | \n",
" 20 | \n",
" 55455 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id sex age occupation zip_code\n",
"0 1 F 1 10 48067\n",
"1 2 M 56 16 70072\n",
"2 3 M 25 15 55117\n",
"3 4 M 45 7 02460\n",
"4 5 M 25 20 55455"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"users = pd.read_csv(\n",
" \"ml-1m/users.dat\",\n",
" sep=\"::\",\n",
" engine=\"python\", # Because of 2-chars separators\n",
" header=None,\n",
" names=[Columns.User, \"sex\", \"age\", \"occupation\", \"zip_code\"],\n",
")\n",
"print(users.shape)\n",
"users.head()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# Select only users that present in 'ratings' table\n",
"users = users.loc[users[\"user_id\"].isin(ratings[\"user_id\"])].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data types: categorical and numerical"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generally there are 2 kind of features in data: categorical and numerical. For classic recommender algorithms categorical features are usually one-hot-encoded and stored in sparse format. Numerical features can be used in the original form (e.g. processed by MinMaxScaler), but they can also be binarized, transformed to categorical and then one-hot encoded.\n",
"\n",
"Depending on your data you can select to store features in `sparse` or `dense` format within RecTools dataset. `dense` format requires all features to be numerical. `sparse` format doesn't have any constraints and can include numerical features as well.\n",
"\n",
"During training RecTools models will transform features to the format that is apllicable. iALS with features will transform feature to `dense` format. LightFM and DSSM will transform to `sparse`. All of these transformations happen under the hood and no values are actually affected.\n",
"\n",
"\n",
"\n",
"Now let's see processing routines."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Features storage: Sparse example\n",
"For `sparse` format we need to create a dataframe in flatten format with columns `id`, `feature`, `value`. This way we can have any number of entries for each feature for any user (ot item). This is often the case for movie genres for example (one movie has 5 genres)."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# Let's prepare a flatten dataframe with 3 user features\n",
"user_features_frames = []\n",
"for feature in [\"sex\", \"age\", \"occupation\"]:\n",
" feature_frame = users.reindex(columns=[\"user_id\", feature])\n",
" feature_frame.columns = [\"id\", \"value\"]\n",
" feature_frame[\"feature\"] = feature\n",
" user_features_frames.append(feature_frame)\n",
"user_features = pd.concat(user_features_frames)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" value | \n",
" feature | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" F | \n",
" sex | \n",
"
\n",
" \n",
" | 0 | \n",
" 1 | \n",
" 1 | \n",
" age | \n",
"
\n",
" \n",
" | 0 | \n",
" 1 | \n",
" 10 | \n",
" occupation | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" M | \n",
" sex | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" 56 | \n",
" age | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" 16 | \n",
" occupation | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id value feature\n",
"0 1 F sex\n",
"0 1 1 age\n",
"0 1 10 occupation\n",
"1 2 M sex\n",
"1 2 56 age\n",
"1 2 16 occupation"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's see how this looks for users `1` and `2`\n",
"user_features.query(\"id in [1, 2]\").sort_values(\"id\")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# Now we construct the dataset\n",
"sparse_features_dataset = Dataset.construct(\n",
" ratings,\n",
" user_features_df=user_features, # our flatten dataframe\n",
" cat_user_features=[\"sex\", \"age\"], # these will be one-hot-encoded. All other features must be numerical already\n",
" make_dense_user_features=False # for `sparse` format\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this dataset user features are now stored in `sparse` format.\n",
"\n",
"`cat_user_features` have all their possible values retrieved, one-hot-encoded and stored in sparse matrix. \n",
"\n",
"All other features (`direct`) have their values stored in the same sparse matrix (one columns for one direct feature). Here we make \"occupation\" a direct feature just for a quick example on data storage. It actually has categorical nature.\n",
"\n",
"Rows of the sparse matrix correspond to internal user ids in dataset. Which are identical to row numbers in ui_csr matrix which is used for model training in most of the recommender models.\n",
"\n",
"Let's look inside the dataset to check how the data is stored"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<6040x10 sparse matrix of type ''\n",
"\twith 18120 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# storing format for features\n",
"sparse_features_dataset.user_features.values"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('occupation', '__is_direct_feature'),\n",
" ('sex', 'F'),\n",
" ('sex', 'M'),\n",
" ('age', 1),\n",
" ('age', 56),\n",
" ('age', 25),\n",
" ('age', 45),\n",
" ('age', 50),\n",
" ('age', 35),\n",
" ('age', 18))"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# feature names and values (sparse matrix columns)\n",
"sparse_features_dataset.user_features.names"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[10., 1., 0., 1., 0., 0., 0., 0., 0., 0.],\n",
" [16., 0., 1., 0., 1., 0., 0., 0., 0., 0.],\n",
" [15., 0., 1., 0., 0., 1., 0., 0., 0., 0.],\n",
" [ 7., 0., 1., 0., 0., 0., 1., 0., 0., 0.],\n",
" [20., 0., 1., 0., 0., 1., 0., 0., 0., 0.]], dtype=float32)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# example of stored features for 5 users\n",
"sparse_features_dataset.user_features.values[:5].toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Features storage: Dense example\n",
"Now let's create a dataset with `dense` features. \n",
"\n",
"We need a classic dataframe with one column for each feature and one row for each subject (user or item). \n",
"\n",
"**Important:** All feature values must be numeric\n",
"\n",
"**Important:** You must set features for all objects (users or items). If you do not have some feature for some user (item) then use any method (zero, mean value, etc.) to fill it."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" age | \n",
" occupation | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" 1 | \n",
" 10 | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" 56 | \n",
" 16 | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" 25 | \n",
" 15 | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" 45 | \n",
" 7 | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" 25 | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id age occupation\n",
"0 1 1 10\n",
"1 2 56 16\n",
"2 3 25 15\n",
"3 4 45 7\n",
"4 5 25 20"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_numeric_features = users[[Columns.User, \"age\", \"occupation\"]]\n",
"user_numeric_features.head()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"dense_features_dataset = Dataset.construct(\n",
" ratings,\n",
" user_features_df=user_numeric_features,\n",
" make_dense_user_features=True # for `dense` format\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look how the data is stored now. This is a 2-d numpy array. Row numbers correspond to internal user ids in dataset."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('age', 'occupation')"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# feature names (array columns)\n",
"dense_features_dataset.user_features.names"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1., 10.],\n",
" [56., 16.],\n",
" [25., 15.],\n",
" [45., 7.],\n",
" [25., 20.]], dtype=float32)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# example of stored features for 5 users\n",
"dense_features_dataset.user_features.values[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feeding features to models\n",
"Now we can just fit model using prepared dataset. For this we choose models that have support for using features in training (e.g. iALS, LightFM, DSSM, PopularInCategory)."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1/1 [00:00<00:00, 17.08it/s]\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = ImplicitALSWrapperModel(AlternatingLeastSquares(10, num_threads=32))\n",
"model.fit(dense_features_dataset)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/data/home/dmtikhono1/git_project/RecTools/rectools/dataset/features.py:399: UserWarning: Converting sparse features to dense array may cause MemoryError\n",
" warnings.warn(\"Converting sparse features to dense array may cause MemoryError\")\n",
"100%|██████████| 1/1 [00:00<00:00, 12.94it/s]\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = ImplicitALSWrapperModel(AlternatingLeastSquares(10, num_threads=32))\n",
"model.fit(sparse_features_dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Final notes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- If model requires features in a specific format, it will convert them under the hood. This is why we can get a warning, fitting iALS with sparse features. Model fits anyway, just remember about possible memory problems\n",
"- LightFM and DSSM prefer one-hot-encoded features. So it is a good idea to binarize all direct features and make them categorical. But you can also try to apply MinMaxScaler to direct values.\n",
"- iALS works good with both direct and categorical features. Direct features can be MinMaxScaled\n",
"- PopularInCategory requires `sparse` features and a selected category because of its nature"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}