DatasetSanitization#
- class mergernet.data.sanitization.DatasetSanitization[source]#
Bases:
object
- Tool for dataset (table and images) sanitization. The sanitization includes:
Visualize file size distribution of the stamps
Drop corrupted stamps after visual inspection of the file sizes
Remove rows of the table without corresponding stamps
- Parameters:
See also
- check_images() Tuple[ndarray, ndarray] [source]#
Checks which of the objects specified in the table are or are not in the imeges folder
- Returns:
A tuple of two following elements: (1) an array containing the iaunames of the objects with the corresponding image, and (2) an array containing iaunames of the objects without the corresponding image.
- Return type:
tuple of arrays
- drop_images_by_filesize(threshold: float)[source]#
Remove images from
images_folder
with file size lower thanthreshold
- Parameters:
threshold (float) – The cutoff value
- drop_images_by_iauname(iaunames: List[str] | ndarray)[source]#
Remove images from
images_folder
by iauname- Parameters:
iaunames (array-like of strings) – The object iauname
See also
- filesize_histogram(bins: int = 10, **kwargs)[source]#
Plot file size histogram
- Parameters:
bins (int) – The number of bins
kwargs (Any) – Arguments passed directly to
plt.hist
- get_filesize_distribution() ndarray [source]#
Computes the distribution of filesizes in the images folder
- Returns:
The distribution of filesizes
- Return type:
array
- get_iauname_by_filesize(lower: float | None = None, upper: float | None = None) ndarray [source]#
Filter files with file size lower than
threshold
and returns its iaunaemes
- sample(iaunames: List[str] | array)[source]#
Creates a sample of the dataset in temp folder
- Parameters:
iaunames (array-like of string) – The objects of the sample
- sanitize(threshold: float = 0.0, dry_run: bool = False) DataFrame | None [source]#
- Sanitizes the dataset performing the follwing tasks (in this order):
1. Remove all files with size (in kB) lower than the
threshold
parameter fromimages_folder
; 2. Remove all objects from table without corresponding file inimages_folder
; 3. Save the new table in same folder as the input table with_sanitized
suffix added in the table name.
If this method is called in dry-run mode, no changes will be made in the files, this method will just print the changes that would be made, instead