SparseFeatures
- class rectools.dataset.features.SparseFeatures(values: csr_matrix, names: Tuple[Tuple[str, Any], ...])[source]
Bases:
object
Storage for sparse features.
Sparse features are represented as CSR matrix, where rows correspond to objects, columns - to features. Assume that there are features of two types: direct and categorical.
Each direct feature is represented in a single column with its real values. Direct features are numeric. E.g. +—+—-+—-+ | | f1 | f2 | +—+—-+—-+ | 1 | 23 | 3 | +—+—-+—-+ | 2 | 36 | 5 | +—+—-+—-+
Categorical features are one-hot encoded (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), values in matrix are counts in category. If you want to binarize a numeric feature, make it categorical with bin indices as categories. E.g. +—+——+——+——+ | | f1_a | f1_b | f2_1 | +—+——+——+——+ | 1 | 0 | 2 | 1 | +—+——+——+——+ | 2 | 1 | 1 | 0 | +—+——+——+——+
Usually you do not need to create this object directly, use from_flatten class method instead. If you want to use custom logic, use from_iterables class method instead of direct creation.
- Parameters
values (csr_matrix) – CSR matrix containing OHE feature values.
names (tuple(tuple(str, any))) – Tuple of feature names. Direct features are represented only by names, so for direct features use (
feature name
, None). For sparse features use (feature name
,value
), as they are one-hot encoded. E.g. If you have direct feature age and cat. feature sex, names will be ((age, None), (sex, m), (sex, f)). Number of names must be equal to the number of columns in values.
- Inherited-members
Methods
from_flatten
(df, id_map[, cat_features, ...])Construct SparseFeatures from flatten DataFrame.
from_iterables
(values, names)Create class instance from sparse matrix and iterable feature names.
Return values in dense format.
Return values in sparse format.
take
(ids)Take a subset of features for given subject (user or item) ids.
Attributes
values
names
- classmethod from_flatten(df: DataFrame, id_map: IdMap, cat_features: Iterable[Any] = (), id_col: str = 'id', feature_col: str = 'feature', value_col: str = 'value', weight_col: str = 'weight') SparseFeatures [source]
Construct SparseFeatures from flatten DataFrame.
Flatten DataFrame has 3 obligatory columns: <id of object>, <feature name>, <feature value>, and <feature weight> as the optional fourth. If there is no <feature weight> column, all weights will be assumed to be equal to
1
.Direct features converted to sparse matrix as is. Its values are multiplied by weights. Values for the same object and same feature are added up. E.g: +—-+———+——-+——–+ | id | feature | value | weight | +—-+———+——-+——–+ | 1 | f1 | 10 | 1 | +—-+———+——-+——–+ | 2 | f1 | 20 | 1.5 | +—-+———+——-+——–+ | 1 | f1 | 15 | 1 | +—-+———+——-+——–+ | 2 | f2 | 3 | 1 | +—-+———+——-+——–+ Out: +—+—-+—-+ | | f1 | f2 | +—+—-+—-+ | 1 | 25 | | +—+—-+—-+ | 2 | 30 | 3 | +—+—-+—-+
Categorical features are represented as horizontally stacked one-hot vectors. Duplicated values are counted. Final values (counts) are multiplied by weights. E.g: +—-+———+——-+——–+ | id | feature | value | weight | +—-+———+——-+——–+ | 1 | f1 | 10 | 1 | +—-+———+——-+——–+ | 2 | f1 | 20 | 1.5 | +—-+———+——-+——–+ | 1 | f1 | 10 | 1 | +—-+———+——-+——–+ | 2 | f2 | 3 | 1 | +—-+———+——-+——–+
Out: +—+——–+——–+——-+ | | f1__10 | f1__20 | f2__3 | +—+——–+——–+——-+ | 1 | 2 | | | +—+——–+——–+——-+ | 2 | | 1.5 | 1 | +—+——–+——–+——-+
- Parameters
df (pd.DataFrame) – Flatten table with features with columns id_col, feature_col, value_col in format described above.
id_map (IdMap) – Mapping between external and internal ids.
cat_features (iterable(str), default
()
) – List of categorical feature names.id_col (str, default
id
) – Name of column with object ids.feature_col (str, default
feature
) – Name of column with feature names.value_col (str, default
value
) – Name of column with feature values.weight_col (str, default
weight
) – Name of column with feature weight. If no such column provided, all weights will be equal to1
.
- Return type
- classmethod from_iterables(values: csr_matrix, names: Iterable[Tuple[str, Any]]) SparseFeatures [source]
Create class instance from sparse matrix and iterable feature names.
- Parameters
values (csr_matrix) – Feature values matrix.
names (iterable((str, any))) – Feature names in same format as in constructor.
- Return type
- take(ids: Union[Sequence[int], ndarray]) SparseFeatures [source]
Take a subset of features for given subject (user or item) ids.
- Parameters
ids (array-like) – Array of internal ids to select features for.
- Return type