slideDupIdentify - swvanderlaan/slideToolKit GitHub Wiki

slideDupIdentify.py is a Python script for identifying and managing duplicate WSI files. It processes files by their study type and stain and organizes duplicates based on a defined prioritization. The script outputs a report (CSV) with metadata on the identified duplicates and logs the entire process, ensuring clear documentation of which files were kept, moved, or skipped.

Options and Arguments:

Option	Description
`--image-folder, -i`	Optional. The folder where the input images are located. Defaults to the current directory if not specified.
`--study-type, -t`	Required. The study type prefix (e.g., AE). Files must start with this prefix to be processed.
`--stain, -s`	Required. The stain name (e.g., CD34). Files must contain this string to be processed.
`--out-file, -o`	Required. The output CSV file name (without extension) for saving duplicate information.
`--force, -f`	Optional. Overwrite the output file if it already exists.
`--dry-run, -d`	Optional. Perform a dry run where no actual file operations are performed. Actions are reported to the terminal.
`--debug, -D`	Optional. Print debug information for troubleshooting.
`--verbose, -v`	Optional. Show details of all duplicate samples identified.
`--help, -h`	Optional. Show help message and usage instructions.
`--version, -V`	Optional. Print the script version and exit.

Example Usage:

python slideDupIdentify.py --image-folder /path/to/images \
                           --study-type AE \
                           --stain CD34 \
                           --out-file duplicates_report \
                           --verbose

In this example, the script will:

Search /path/to/images for files related to the AE study type and CD34 stain.
Identify duplicates and save the information to a CSV file duplicates_report.AE.CD34.metadata.csv.
Log details about duplicate identification and prioritization.

Duplicate Identification and Prioritization:

The script uses a structured process to prioritize duplicates based on:

Preferred file type: .ndpi files are preferred over .TIF.
Creation date: The latest file is preferred.
Checksum and size: If files have the same type and creation date, the largest file is preferred.

For each study number and stain, one prioritized file remains, and metadata about duplicates is stored in the output CSV.