DataHandler
- class udao.data.handler.data_handler.DataHandler(data: DataFrame, params: DataHandlerParams)
Bases:
objectDataHandler class to handle data loading, splitting, feature extraction and dataset iterator creation.
- Parameters:
data (pd.DataFrame) – Dataframe containing the data.
params (DataHandlerParams) – DataHandlerParams object containing the parameters of the DataHandler.
- extract_features() DataHandler
Extract features for the different splits of the data.
- Returns:
self
- Return type:
- Raises:
ValueError – Expects data to be split before extracting features.
- classmethod from_csv(csv_path: str, params: DataHandlerParams) DataHandler
Initialize DataHandler from csv.
- Parameters:
csv_path (str) – Path to the data file.
params (DataHandlerParams) –
- Returns:
Initialized DataHandler object.
- Return type:
- get_iterators() Dict[Literal['train', 'val', 'test'], BaseDatasetIterator]
Return a dictionary of iterators for the different splits of the data.
- Returns:
Dictionary of iterators for the different splits of the data.
- Return type:
Dict[DatasetType, BaseDatasetIterator]
- process_features() DataHandler
- split_data() DataHandler
Split the data into train, test and validation sets, split indices are stored in self.index_splits.
- Returns:
set
- Return type:
- class udao.data.handler.data_handler.DataHandlerParams(index_column: str, feature_extractors: Mapping[str, Tuple[Union[Type[udao.data.extractors.base_extractors.TrainedFeatureExtractor], Type[udao.data.extractors.base_extractors.StaticFeatureExtractor]], Any]], Iterator: Type[udao.data.iterators.base_iterator.BaseDatasetIterator], feature_preprocessors: Optional[Mapping[str, List[Tuple[Union[Type[udao.data.preprocessors.base_preprocessor.TrainedFeaturePreprocessor], Type[udao.data.preprocessors.base_preprocessor.StaticFeaturePreprocessor]], Any]]]] = None, stratify_on: Optional[str] = None, val_frac: float = 0.2, test_frac: float = 0.1, dryrun: bool = False, random_state: Optional[int] = None)
Bases:
object- Iterator: Type[BaseDatasetIterator]
Iterator class to be returned after feature extraction. It is assumed that the iterator class takes the keys and the features extracted by the feature extractors as arguments.
- dryrun: bool = False
Dry run mode for fast computation on a large dataset (sampling of a small portion), by default False
- feature_extractors: Mapping[str, Tuple[Union[Type[TrainedFeatureExtractor], Type[StaticFeatureExtractor]], Any]]
Dict that links a feature name to tuples of the form (Extractor, args) where Extractor implements FeatureExtractor and args are the arguments to be passed at initialization. N.B.: Feature names must match the iterator’s parameters.
If Extractor is a StaticFeatureExtractor, the features are extracted independently of the split.
If Extractor is a TrainedFeatureExtractor, the extractor is first fitted on the train split and then applied to the other splits.
- feature_preprocessors: Optional[Mapping[str, List[Tuple[Union[Type[TrainedFeaturePreprocessor], Type[StaticFeaturePreprocessor]], Any]]]] = None
Dict that links a feature name to a list of tuples of the form (Processor, args) where Processor implements FeatureProcessor and args are the arguments to be passed at initialization. This allows to apply a series of processors to different features, e.g. to normalize the features. N.B.: Feature names must match the iterator’s parameters. If Processor is a StaticFeatureprocessor, the features are processed independently of the split.
If Extractor is a TrainedFeatureProcessor, the processor is first fitted on the train split and then applied to the other splits (typically for normalization).
- index_column: str
Column that should be used as index (unique identifier)
- random_state: Optional[int] = None
Random state for reproducibility, by default None
- stratify_on: Optional[str] = None
Column on which to stratify the split, by default None. If None, no stratification is performed.
- test_frac: float = 0.1
Fraction allotted to the validation set, by default 0.2
- val_frac: float = 0.2
Column on which to stratify the split (keeping proportions for each split) If None, no stratification is performed
- class udao.data.handler.data_handler.FeaturePipeline(extractor: Tuple[Union[Type[udao.data.extractors.base_extractors.TrainedFeatureExtractor], Type[udao.data.extractors.base_extractors.StaticFeatureExtractor]], Any], preprocessors: Optional[List[Tuple[Union[Type[udao.data.preprocessors.base_preprocessor.TrainedFeaturePreprocessor], Type[udao.data.preprocessors.base_preprocessor.StaticFeaturePreprocessor]], Any]]] = None)
Bases:
object- extractor: Tuple[Union[Type[TrainedFeatureExtractor], Type[StaticFeatureExtractor]], Any]
Tuple defining the feature extractor and its initialization arguments.
- preprocessors: Optional[List[Tuple[Union[Type[TrainedFeaturePreprocessor], Type[StaticFeaturePreprocessor]], Any]]] = None
List of tuples defining feature preprocessors and their initialization arguments.
- udao.data.handler.data_handler.create_data_handler_params(iterator_cls: Type[BaseDatasetIterator], *args: str) Callable[[...], DataHandlerParams]
Creates a DataHandlerParams class dynamically based on provided iterator class and additional arguments.
- Parameters:
iterator_cls (Type[BaseDatasetIterator]) – Dataset iterator class type.
args (str) – Additional feature names to be included.
- Returns:
A dynamically generated DataHandlerParams class with arguments from the provided iterator class and additional arguments.
- Return type:
Type[DataHandlerParams]