DatasetSanitization#
- class mergernet.data.sanitization.DatasetSanitization[source]#
- Bases: - object- Tool for dataset (table and images) sanitization. The sanitization includes:
- Visualize file size distribution of the stamps 
- Drop corrupted stamps after visual inspection of the file sizes 
- Remove rows of the table without corresponding stamps 
 
 - Parameters:
 - See also - check_images() Tuple[ndarray, ndarray][source]#
- Checks which of the objects specified in the table are or are not in the imeges folder - Returns:
- A tuple of two following elements: (1) an array containing the iaunames of the objects with the corresponding image, and (2) an array containing iaunames of the objects without the corresponding image. 
- Return type:
- tuple of arrays 
 
 - drop_images_by_filesize(threshold: float)[source]#
- Remove images from - images_folderwith file size lower than- threshold- Parameters:
- threshold (float) – The cutoff value 
 
 - drop_images_by_iauname(iaunames: List[str] | ndarray)[source]#
- Remove images from - images_folderby iauname- Parameters:
- iaunames (array-like of strings) – The object iauname 
 - See also 
 - filesize_histogram(bins: int = 10, **kwargs)[source]#
- Plot file size histogram - Parameters:
- bins (int) – The number of bins 
- kwargs (Any) – Arguments passed directly to - plt.hist
 
 
 - get_filesize_distribution() ndarray[source]#
- Computes the distribution of filesizes in the images folder - Returns:
- The distribution of filesizes 
- Return type:
- array 
 
 - get_iauname_by_filesize(lower: float | None = None, upper: float | None = None) ndarray[source]#
- Filter files with file size lower than - thresholdand returns its iaunaemes
 - sample(iaunames: List[str] | array)[source]#
- Creates a sample of the dataset in temp folder - Parameters:
- iaunames (array-like of string) – The objects of the sample 
 
 - sanitize(threshold: float = 0.0, dry_run: bool = False) DataFrame | None[source]#
- Sanitizes the dataset performing the follwing tasks (in this order):
- 1. Remove all files with size (in kB) lower than the - thresholdparameter from- images_folder; 2. Remove all objects from table without corresponding file in- images_folder; 3. Save the new table in same folder as the input table with- _sanitizedsuffix added in the table name.
 - If this method is called in dry-run mode, no changes will be made in the files, this method will just print the changes that would be made, instead