Dataset#

class mergernet.data.dataset.Dataset[source]#

High-level representation of the dataset. This class abstracts all IO operations of the dataset (e.g. download, prepare, split)

Parameters:: config (DatasetConfig) – The configuration object of the database get from Dataset.registry attribute

Attributes

A registry containing all datasets configurations

_create_dataset_table()[source]#: Scan the images table and create a csv table with filenames if the dataset config has no table.

_discretize_label(y: ndarray) → ndarray[source]#

Find all ocurrences in table that matches DatasetConfig.label_map key and replaces with respective value.

Removes all downloaded files from hard disk. This includes:

static concat_fold_column(df: DataFrame, fname_column: str | None = None, class_column: str | None = None, r_column: str | None = None, n_splits: int = 5, bins: int = 3) → DataFrame[source]#

download()[source]#: Check if destination path exists, create missing folders and download the dataset files from web resource for a specified dataset type.

get_X_by_fold(fold: int, kind='test') → ndarray[source]#

Get X by fold

Parameters:

Returns:

X values

Return type:

numpy.ndarray

get_fold(fold: int) → Tuple[DatasetV2, DatasetV2][source]#

Generates the train and test dataset based on selected fold

Parameters:: fold (int) – The fold which will be used as test, the all other folds will be used as train
Returns:: A tuple containing two datasets, the first is the train dataset and the secound is the test dataset
Return type:: tuple, tf.data.Dataset

get_n_folds() → int[source]#

Get the number of folds in dataset

get_preds_dataset(prepare: bool = True, batch_size: int = 64) → DatasetV2[source]#

is_dataset_downloaded() → bool[source]#

Check if dataset files are downloaded locally at Experiment.local_shared_path

prepare_data(ds: DatasetV2, batch_size: int = 64, buffer_size: int = 1000, kind='train')[source]#

registry: DatasetRegistry = <mergernet.data.dataset_config.DatasetRegistry object>#: A registry containing all datasets configurations