DatasetSanitization#

class mergernet.data.sanitization.DatasetSanitization[source]#

Bases: object

Tool for dataset (table and images) sanitization. The sanitization includes:
  • Visualize file size distribution of the stamps

  • Drop corrupted stamps after visual inspection of the file sizes

  • Remove rows of the table without corresponding stamps

Parameters:
  • table (str, Path) – The path of the table

  • images_folder (str or Path) – The path where object stamps are stored, the files must be named with the IAU2000 name of the object using iauname function

check_images() Tuple[ndarray, ndarray][source]#

Checks which of the objects specified in the table are or are not in the imeges folder

Returns:

A tuple of two following elements: (1) an array containing the iaunames of the objects with the corresponding image, and (2) an array containing iaunames of the objects without the corresponding image.

Return type:

tuple of arrays

drop_images_by_filesize(threshold: float)[source]#

Remove images from images_folder with file size lower than threshold

Parameters:

threshold (float) – The cutoff value

drop_images_by_iauname(iaunames: List[str] | ndarray)[source]#

Remove images from images_folder by iauname

Parameters:

iaunames (array-like of strings) – The object iauname

filesize_histogram(bins: int = 10, **kwargs)[source]#

Plot file size histogram

Parameters:
  • bins (int) – The number of bins

  • kwargs (Any) – Arguments passed directly to plt.hist

get_filesize_distribution() ndarray[source]#

Computes the distribution of filesizes in the images folder

Returns:

The distribution of filesizes

Return type:

array

get_iauname_by_filesize(lower: float | None = None, upper: float | None = None) ndarray[source]#

Filter files with file size lower than threshold and returns its iaunaemes

Parameters:
  • lower (float) – The lower cutoff value

  • upper (float) – The upper cutoff value

Returns:

Array of iaunames

Return type:

array

sample(iaunames: List[str] | array)[source]#

Creates a sample of the dataset in temp folder

Parameters:

iaunames (array-like of string) – The objects of the sample

sanitize(threshold: float = 0.0, dry_run: bool = False) DataFrame | None[source]#
Sanitizes the dataset performing the follwing tasks (in this order):

1. Remove all files with size (in kB) lower than the threshold parameter from images_folder; 2. Remove all objects from table without corresponding file in images_folder; 3. Save the new table in same folder as the input table with _sanitized suffix added in the table name.

If this method is called in dry-run mode, no changes will be made in the files, this method will just print the changes that would be made, instead

Parameters:
  • threshold (float) – The file size in kilobytes that will be dropped from the images_folder

  • dry_run (bool) – Set the dry-run mode if True

Returns:

The sanitized table if not in dry-run mode

Return type:

DataFrame or None