TimeRangeSplit

class rectools.model_selection.time_split.TimeRangeSplit(date_range: Sequence[Union[date, datetime]], filter_cold_users: bool = True, filter_cold_items: bool = True, filter_already_seen: bool = True)[source]

Bases: object

Splitter for cross-validation by time.

Generate train and test folds by time, it is also possible to exclude cold users and items and already seen items.

Parameters
  • date_range (array-like(date|datetime)) – Ordered test fold borders. Left will be included, right will be excluded from fold. Interactions before first border will be used for train. Interaction after right border will not be used. Ca be easily generated with [pd.date_range] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)

  • filter_cold_users (bool, default True) – If True, users that not in train will be excluded from test.

  • filter_cold_items (bool, default True) – If True, items that not in train will be excluded from test.

  • filter_already_seen (bool, default True) – If True, pairs (user, item) that are in train will be excluded from test.

Examples

>>> from datetime import date
>>> df = pd.DataFrame(
...         [
...             [1, 2, "2021-09-01"],  # 0
...             [2, 1, "2021-09-02"],  # 1
...             [2, 3, "2021-09-03"],  # 2
...             [3, 2, "2021-09-03"],  # 3
...             [3, 3, "2021-09-04"],  # 4
...             [3, 4, "2021-09-04"],  # 5
...             [1, 2, "2021-09-05"],  # 6
...             [4, 2, "2021-09-05"],  # 7
...             [4, 2, "2021-09-06"],  # 8
...         ],
...         columns=[Columns.User, Columns.Item, Columns.Datetime],
...     ).astype({Columns.Datetime: "datetime64[ns]"})
>>> date_range = pd.date_range(date(2021, 9, 4), date(2021, 9, 6))
>>>
>>> trs = TimeRangeSplit(date_range, False, False, False)
>>> for train_ids, test_ids, _ in trs.split(df):
...     print(train_ids, test_ids)
[0 1 2 3] [4 5]
[0 1 2 3 4 5] [6 7]
>>>
>>> trs = TimeRangeSplit(date_range, True, True, True)
>>> for train_ids, test_ids, _ in trs.split(df):
...     print(train_ids, test_ids)
[0 1 2 3] [4]
[0 1 2 3 4 5] []
Inherited-members

Parameters
  • date_range (Sequence[Union[date, datetime]]) –

  • filter_cold_users (bool) –

  • filter_cold_items (bool) –

  • filter_already_seen (bool) –

Methods

get_n_splits(df)

Return real number of folds.

split(df[, collect_fold_stats])

Split interactions into folds.

get_n_splits(df: DataFrame) int[source]

Return real number of folds.

Parameters

df (DataFrame) –

Return type

int

split(df: DataFrame, collect_fold_stats: bool = False) Iterator[Tuple[ndarray, ndarray, Dict[str, Any]]][source]

Split interactions into folds.

Parameters
  • df (pd.DataFrame) – User-item interactions. Obligatory columns: Columns.User, Columns.Item, Columns.Datetime.

  • collect_fold_stats (bool, default False) – Add some stats to fold info, like size of train and test part, number of users and items.

Returns

Yields tuples with train part row numbers, test part row numbers and fold info.

Return type

iterator(array, array, dict)