Data processing

UDAO provides a data processing pipeline that can be used to prepare data for training.

Feature extraction

A feature extractor extracts a certain kind of information from the main data source, which is expected to be a dataframe. There are two types of feature extractors:

All feature extractors should implement one of these interfaces.

A feature extractor is expected to return a feature container of type BaseContainer that stores the extracted features per sample. A container is used to link an index key to the feature for the corresponding sample. It does not make any assumption on the type of the feature returned, except that one sample has one such feature.

Iterator

The iterator is the main output of the data processing pipeline: it iterates over the data and returns all features for the current sample. As such, it is expected to have attributes:

  • a list of keys

  • a list of feature containers as attributes.

The BaseDatasetIterator class enforces these requirements as well as other utilitary methods to prepare the data for interaction with PyTorch.

The get_dataloader() returns a torch dataloader that can directly be used for training. Iterators can also implement a custom collate() method to define how to batch features from different samples in the dataloader.

Full data pipeline

The DataHandler class is a wrapper around the data processing pipeline. It:

  • performs the split between training, testing and validation data

  • applies the feature extractors on the data

  • applies any preprocessing on the resulting features

  • creates iterators for each split based on the features.

Diagram of UDAO data processing