Dataset#

class mergernet.data.dataset.Dataset[source]#

Bases: object

High-level representation of the dataset. This class abstracts all IO operations of the dataset (e.g. download, prepare, split)

Parameters:

config (DatasetConfig) – The configuration object of the database get from Dataset.registry attribute

Attributes

registry

A registry containing all datasets configurations

_create_dataset_table()[source]#

Scan the images table and create a csv table with filenames if the dataset config has no table.

_discretize_label(y: ndarray) ndarray[source]#

Find all ocurrences in table that matches DatasetConfig.label_map key and replaces with respective value.

Parameters:

y (np.ndarray) –

_transform_images()[source]#
clear()[source]#

Removes all downloaded files from hard disk. This includes:

  • Table file

  • Image archive

  • Extracted images folder

compute_class_weight() dict[source]#
static concat_fold_column(df: DataFrame, fname_column: str | None = None, class_column: str | None = None, r_column: str | None = None, n_splits: int = 5, bins: int = 3) DataFrame[source]#
download()[source]#

Check if destination path exists, create missing folders and download the dataset files from web resource for a specified dataset type.

get_X() ndarray[source]#
get_X_by_fold(fold: int, kind='test') ndarray[source]#

Get X by fold

Parameters:
  • fold (int) – Fold number

  • kind (str, optional) – one of: ‘train’ or ‘test’, by default ‘test’

Returns:

X values

Return type:

numpy.ndarray

get_fold(fold: int) Tuple[DatasetV2, DatasetV2][source]#

Generates the train and test dataset based on selected fold

Parameters:

fold (int) – The fold which will be used as test, the all other folds will be used as train

Returns:

A tuple containing two datasets, the first is the train dataset and the secound is the test dataset

Return type:

tuple, tf.data.Dataset

get_images_paths(iaunames: List[str]) List[Path][source]#
get_n_folds() int[source]#

Get the number of folds in dataset

Returns:

the number of folds

Return type:

int

get_preds_dataset(prepare: bool = True, batch_size: int = 64) DatasetV2[source]#
get_table() DataFrame[source]#
is_dataset_downloaded() bool[source]#

Check if dataset files are downloaded locally at Experiment.local_shared_path

Returns:

True if the images dir and the table are found, False otherwise

Return type:

bool

prepare_data(ds: DatasetV2, batch_size: int = 64, buffer_size: int = 1000, kind='train')[source]#
registry: DatasetRegistry = <mergernet.data.dataset_config.DatasetRegistry object>#

A registry containing all datasets configurations