DataHandler

class udao.data.handler.data_handler.DataHandler(data: DataFrame, params: DataHandlerParams)

Bases: object

DataHandler class to handle data loading, splitting, feature extraction and dataset iterator creation.

Parameters:

data (pd.DataFrame) – Dataframe containing the data.
params (DataHandlerParams) – DataHandlerParams object containing the parameters of the DataHandler.

extract_features() → DataHandler

Extract features for the different splits of the data.

Returns:: self
Return type:: DataHandler
Raises:: ValueError – Expects data to be split before extracting features.

classmethod from_csv(csv_path: str, params: DataHandlerParams) → DataHandler

Initialize DataHandler from csv.

Parameters:

csv_path (str) – Path to the data file.
params (DataHandlerParams) –

Returns:

Initialized DataHandler object.

Return type:

DataHandler

get_iterators() → Dict[Literal['train', 'val', 'test'], BaseDatasetIterator]

Return a dictionary of iterators for the different splits of the data.

Returns:: Dictionary of iterators for the different splits of the data.
Return type:: Dict[DatasetType, BaseDatasetIterator]

process_features() → DataHandler

split_data() → DataHandler

Split the data into train, test and validation sets, split indices are stored in self.index_splits.

Returns:: set
Return type:: DataHandler

class udao.data.handler.data_handler.DataHandlerParams(index_column: str, feature_extractors: Mapping[str, Tuple[Union[Type[udao.data.extractors.base_extractors.TrainedFeatureExtractor], Type[udao.data.extractors.base_extractors.StaticFeatureExtractor]], Any]], Iterator: Type[udao.data.iterators.base_iterator.BaseDatasetIterator], feature_preprocessors: Optional[Mapping[str, List[Tuple[Union[Type[udao.data.preprocessors.base_preprocessor.TrainedFeaturePreprocessor], Type[udao.data.preprocessors.base_preprocessor.StaticFeaturePreprocessor]], Any]]]] = None, stratify_on: Optional[str] = None, val_frac: float = 0.2, test_frac: float = 0.1, dryrun: bool = False, random_state: Optional[int] = None)

Bases: object

Iterator: Type[BaseDatasetIterator]: Iterator class to be returned after feature extraction. It is assumed that the iterator class takes the keys and the features extracted by the feature extractors as arguments.

dryrun: bool = False: Dry run mode for fast computation on a large dataset (sampling of a small portion), by default False

feature_extractors: Mapping[str, Tuple[Union[Type[TrainedFeatureExtractor], Type[StaticFeatureExtractor]], Any]]

Dict that links a feature name to tuples of the form (Extractor, args) where Extractor implements FeatureExtractor and args are the arguments to be passed at initialization. N.B.: Feature names must match the iterator’s parameters.

If Extractor is a StaticFeatureExtractor, the features are extracted independently of the split.

If Extractor is a TrainedFeatureExtractor, the extractor is first fitted on the train split and then applied to the other splits.

feature_preprocessors: Optional[Mapping[str, List[Tuple[Union[Type[TrainedFeaturePreprocessor], Type[StaticFeaturePreprocessor]], Any]]]] = None

Dict that links a feature name to a list of tuples of the form (Processor, args) where Processor implements FeatureProcessor and args are the arguments to be passed at initialization. This allows to apply a series of processors to different features, e.g. to normalize the features. N.B.: Feature names must match the iterator’s parameters. If Processor is a StaticFeatureprocessor, the features are processed independently of the split.

If Extractor is a TrainedFeatureProcessor, the processor is first fitted on the train split and then applied to the other splits (typically for normalization).

index_column: str: Column that should be used as index (unique identifier)

random_state: Optional[int] = None: Random state for reproducibility, by default None

stratify_on: Optional[str] = None: Column on which to stratify the split, by default None. If None, no stratification is performed.

test_frac: float = 0.1: Fraction allotted to the validation set, by default 0.2

val_frac: float = 0.2: Column on which to stratify the split (keeping proportions for each split) If None, no stratification is performed

class udao.data.handler.data_handler.FeaturePipeline(extractor: Tuple[Union[Type[udao.data.extractors.base_extractors.TrainedFeatureExtractor], Type[udao.data.extractors.base_extractors.StaticFeatureExtractor]], Any], preprocessors: Optional[List[Tuple[Union[Type[udao.data.preprocessors.base_preprocessor.TrainedFeaturePreprocessor], Type[udao.data.preprocessors.base_preprocessor.StaticFeaturePreprocessor]], Any]]] = None)

Bases: object

extractor: Tuple[Union[Type[TrainedFeatureExtractor], Type[StaticFeatureExtractor]], Any]: Tuple defining the feature extractor and its initialization arguments.

preprocessors: Optional[List[Tuple[Union[Type[TrainedFeaturePreprocessor], Type[StaticFeaturePreprocessor]], Any]]] = None: List of tuples defining feature preprocessors and their initialization arguments.

udao.data.handler.data_handler.create_data_handler_params(iterator_cls: Type[BaseDatasetIterator], *args: str) → Callable[[...], DataHandlerParams]

Creates a DataHandlerParams class dynamically based on provided iterator class and additional arguments.

Parameters:

iterator_cls (Type[BaseDatasetIterator]) – Dataset iterator class type.
args (str) – Additional feature names to be included.

Returns:

A dynamically generated DataHandlerParams class with arguments from the provided iterator class and additional arguments.

Return type:

Type[DataHandlerParams]