SparseFeatures

class rectools.dataset.features.SparseFeatures(values: csr_matrix, names: Tuple[Tuple[str, Any], ...])[source]

Bases: object

Storage for sparse features.

Sparse features are represented as CSR matrix, where rows correspond to objects, columns - to features. Assume that there are features of two types: direct and categorical.

Each direct feature is represented in a single column with its real values. Direct features are numeric. E.g. +—+—-+—-+ | | f1 | f2 | +—+—-+—-+ | 1 | 23 | 3 | +—+—-+—-+ | 2 | 36 | 5 | +—+—-+—-+

Categorical features are one-hot encoded (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), values in matrix are counts in category. If you want to binarize a numeric feature, make it categorical with bin indices as categories. E.g. +—+——+——+——+ | | f1_a | f1_b | f2_1 | +—+——+——+——+ | 1 | 0 | 2 | 1 | +—+——+——+——+ | 2 | 1 | 1 | 0 | +—+——+——+——+

Usually you do not need to create this object directly, use from_flatten class method instead. If you want to use custom logic, use from_iterables class method instead of direct creation.

Parameters
  • values (csr_matrix) – CSR matrix containing OHE feature values.

  • names (tuple(tuple(str, any))) – Tuple of feature names. Direct features are represented only by names, so for direct features use (feature name, None). For sparse features use (feature name, value), as they are one-hot encoded. E.g. If you have direct feature age and cat. feature sex, names will be ((age, None), (sex, m), (sex, f)). Number of names must be equal to the number of columns in values.

Inherited-members

Methods

from_flatten(df, id_map[, cat_features, ...])

Construct SparseFeatures from flatten DataFrame.

from_iterables(values, names)

Create class instance from sparse matrix and iterable feature names.

get_dense()

Return values in dense format.

get_sparse()

Return values in sparse format.

take(ids)

Take a subset of features for given subject (user or item) ids.

Attributes

values

names

classmethod from_flatten(df: DataFrame, id_map: IdMap, cat_features: Iterable[Any] = (), id_col: str = 'id', feature_col: str = 'feature', value_col: str = 'value', weight_col: str = 'weight') SparseFeatures[source]

Construct SparseFeatures from flatten DataFrame.

Flatten DataFrame has 3 obligatory columns: <id of object>, <feature name>, <feature value>, and <feature weight> as the optional fourth. If there is no <feature weight> column, all weights will be assumed to be equal to 1.

Direct features converted to sparse matrix as is. Its values are multiplied by weights. Values for the same object and same feature are added up. E.g: +—-+———+——-+——–+ | id | feature | value | weight | +—-+———+——-+——–+ | 1 | f1 | 10 | 1 | +—-+———+——-+——–+ | 2 | f1 | 20 | 1.5 | +—-+———+——-+——–+ | 1 | f1 | 15 | 1 | +—-+———+——-+——–+ | 2 | f2 | 3 | 1 | +—-+———+——-+——–+ Out: +—+—-+—-+ | | f1 | f2 | +—+—-+—-+ | 1 | 25 | | +—+—-+—-+ | 2 | 30 | 3 | +—+—-+—-+

Categorical features are represented as horizontally stacked one-hot vectors. Duplicated values are counted. Final values (counts) are multiplied by weights. E.g: +—-+———+——-+——–+ | id | feature | value | weight | +—-+———+——-+——–+ | 1 | f1 | 10 | 1 | +—-+———+——-+——–+ | 2 | f1 | 20 | 1.5 | +—-+———+——-+——–+ | 1 | f1 | 10 | 1 | +—-+———+——-+——–+ | 2 | f2 | 3 | 1 | +—-+———+——-+——–+

Out: +—+——–+——–+——-+ | | f1__10 | f1__20 | f2__3 | +—+——–+——–+——-+ | 1 | 2 | | | +—+——–+——–+——-+ | 2 | | 1.5 | 1 | +—+——–+——–+——-+

Parameters
  • df (pd.DataFrame) – Flatten table with features with columns id_col, feature_col, value_col in format described above.

  • id_map (IdMap) – Mapping between external and internal ids.

  • cat_features (iterable(str), default ()) – List of categorical feature names.

  • id_col (str, default id) – Name of column with object ids.

  • feature_col (str, default feature) – Name of column with feature names.

  • value_col (str, default value) – Name of column with feature values.

  • weight_col (str, default weight) – Name of column with feature weight. If no such column provided, all weights will be equal to 1.

Return type

SparseFeatures

classmethod from_iterables(values: csr_matrix, names: Iterable[Tuple[str, Any]]) SparseFeatures[source]

Create class instance from sparse matrix and iterable feature names.

Parameters
  • values (csr_matrix) – Feature values matrix.

  • names (iterable((str, any))) – Feature names in same format as in constructor.

Return type

SparseFeatures

get_dense() ndarray[source]

Return values in dense format.

Return type

ndarray

get_sparse() csr_matrix[source]

Return values in sparse format.

Return type

csr_matrix

take(ids: Union[Sequence[int], ndarray]) SparseFeatures[source]

Take a subset of features for given subject (user or item) ids.

Parameters

ids (array-like) – Array of internal ids to select features for.

Return type

SparseFeatures