TimeRangeSplitter

class rectools.model_selection.time_split.TimeRangeSplitter(test_size: str, n_splits: int = 1, filter_cold_users: bool = True, filter_cold_items: bool = True, filter_already_seen: bool = True)[source]

Bases: Splitter

Splitter for cross-validation by leave-time-out scheme. Generate train and test putting all interactions for all users after fixed date-time in test and all interactions before this date-time in train. Cross-validation is achieved with sliding window over timeline of interactions.

Size of the window is defined in days or hours. Test folds do not intersect and start one right after the other. This technique fully reproduces the real life scenario for recommender systems, preventing any data leak from the future.

It is advised to remember daily and weekly patterns in time series, making each fold equal to full day or full week when such patterns are present in data.

It is also possible to exclude cold users and items and already seen items.

Parameters

test_size (str) – Size of test fold in format [1-9]\d*[DH], e.g. 1D (1 day), 4H (4 hours). Test folds are taken from the end of interactions. The last fold includes the whole time unit with the last interaction. E.g. if the last interaction was at 01:25 a.m. of Monday, then with test_size = “1D” the last fold will be the full Monday, and with test_size = “1H” the last fold will be between 01:00 a.m. and 02:00 a.m on Monday.
n_splits (int) – Number of test folds.
filter_cold_users (bool, default True) – If True, users that are not present in train will be excluded from test. WARNING: both cold and warm users will be excluded from test.
filter_cold_items (bool, default True) – If True, items that are not present in train will be excluded from test. WARNING: both cold and warm items will be excluded from test.
filter_already_seen (bool, default True) – If True, pairs (user, item) that are present in train will be excluded from test.

Examples

>>> from datetime import date
>>> df = pd.DataFrame(
...     [
...         [1, 2, 1, "2021-09-01"],  # 0
...         [2, 1, 1, "2021-09-02"],  # 1
...         [2, 3, 1, "2021-09-03"],  # 2
...         [3, 2, 1, "2021-09-03"],  # 3
...         [3, 3, 1, "2021-09-04"],  # 4
...         [4, 4, 1, "2021-09-04"],  # 5
...         [1, 2, 1, "2021-09-05"],  # 6
...     ],
...     columns=[Columns.User, Columns.Item, Columns.Weight, Columns.Datetime],
... ).astype({Columns.Datetime: "datetime64[ns]"})
>>> interactions = Interactions(df)
>>>
>>> splitter = TimeRangeSplitter("1D", 2, False, False, False)
>>> for train_ids, test_ids, _ in splitter.split(interactions):
...     print(train_ids, test_ids)
[0 1 2 3] [4 5]
[0 1 2 3 4 5] [6]
>>>
>>> splitter = TimeRangeSplitter("1D", 2, True, False, False)
>>> for train_ids, test_ids, _ in splitter.split(interactions):
...     print(train_ids, test_ids)
[0 1 2 3] [4]
[0 1 2 3 4 5] [6]

Inherited-members

Parameters

test_size (str) –
n_splits (int) –
filter_cold_users (bool) –
filter_cold_items (bool) –
filter_already_seen (bool) –

Methods

`filter`(interactions, collect_fold_stats, ...)	Filter train and test indexes from one fold based on filter_cold_users, filter_cold_items,`filter_already_seen` class fields.
`get_test_fold_borders`(interactions)	Return datetime borders of test folds based on given test fold sizes and last interaction.
`split`(interactions[, collect_fold_stats])	Split interactions into folds and apply filtration to the result.

get_test_fold_borders(interactions: Interactions) → List[Tuple[Timestamp, Timestamp]][source]

Return datetime borders of test folds based on given test fold sizes and last interaction.

Parameters: interactions (Interactions) –
Return type: List[Tuple[Timestamp, Timestamp]]