slideDupIdentify - swvanderlaan/slideToolKit GitHub Wiki
slideDupIdentify.py
is a Python script for identifying and managing duplicate WSI files. It processes files by their study type and stain and organizes duplicates based on a defined prioritization. The script outputs a report (CSV) with metadata on the identified duplicates and logs the entire process, ensuring clear documentation of which files were kept, moved, or skipped.
Options and Arguments:
Option | Description |
---|---|
--image-folder, -i |
Optional. The folder where the input images are located. Defaults to the current directory if not specified. |
--study-type, -t |
Required. The study type prefix (e.g., AE). Files must start with this prefix to be processed. |
--stain, -s |
Required. The stain name (e.g., CD34). Files must contain this string to be processed. |
--out-file, -o |
Required. The output CSV file name (without extension) for saving duplicate information. |
--force, -f |
Optional. Overwrite the output file if it already exists. |
--dry-run, -d |
Optional. Perform a dry run where no actual file operations are performed. Actions are reported to the terminal. |
--debug, -D |
Optional. Print debug information for troubleshooting. |
--verbose, -v |
Optional. Show details of all duplicate samples identified. |
--help, -h |
Optional. Show help message and usage instructions. |
--version, -V |
Optional. Print the script version and exit. |
Example Usage:
python slideDupIdentify.py --image-folder /path/to/images \
--study-type AE \
--stain CD34 \
--out-file duplicates_report \
--verbose
In this example, the script will:
- Search
/path/to/images
for files related to theAE
study type andCD34
stain. - Identify duplicates and save the information to a CSV file
duplicates_report.AE.CD34.metadata.csv
. - Log details about duplicate identification and prioritization.
Duplicate Identification and Prioritization:
The script uses a structured process to prioritize duplicates based on:
- Preferred file type:
.ndpi
files are preferred over.TIF
. - Creation date: The latest file is preferred.
- Checksum and size: If files have the same type and creation date, the largest file is preferred.
For each study number and stain, one prioritized file remains, and metadata about duplicates is stored in the output CSV.