Duplicate Detection - Xentraxx/GooglePhotosTakeoutHelper GitHub Wiki

Duplicate Detection

Overview

The Duplicate Detection feature in Google Photos Takeout Helper (GPTH) identifies and removes duplicate files based on content analysis rather than filename matching. This is essential for Google Photos exports, which often contain the same photo multiple times across different album folders and year-based directories.

Why Duplicate Detection is Critical

Google Photos Takeout exports frequently create duplicate files due to the way Google organizes exported data:

Common Duplication Scenarios

Album and Year Organization:

Same photo appears in Photos from 2023/IMG_001.jpg
Also appears in Albums/Vacation/IMG_001.jpg
Both files have identical content but different locations

Export Method Variations:

Files downloaded multiple times from Google Photos web interface
Multiple takeout requests creating overlapping content
Files processed through various export/import cycles
Backup duplicates from syncing Google Photos to other services

Processing Duplicates:

Files exported in different formats but with identical content
Same image compressed differently but containing identical data
Motion Photos creating multiple component files of same content

Real-World Examples

📁 Google Photos Takeout/
├── Photos from 2023/
│   ├── IMG_001.jpg     ←─┐
│   └── IMG_002.jpg       │ Same content!
└── Albums/                │
    ├── Vacation/          │
    │   └── IMG_001.jpg  ←─┘
    └── Family/
        └── IMG_002.jpg   ← Same as Photos from 2023/IMG_002.jpg

How Duplicate Detection Works

Three-Phase Detection Algorithm

Phase 1: Size-Based Grouping

Quick Size Calculation: Reads file metadata to get size in bytes
Efficient Grouping: Groups files with identical sizes together
Early Filtering: Files with unique sizes are immediately marked as non-duplicates
Performance Optimization: Avoids expensive hash calculations for obviously unique files

Phase 2: Content Hash Verification

SHA-256 Hashing: Calculates cryptographic hash of file content
Streaming for Large Files: Uses memory-efficient streaming for files >50MB
Parallel Processing: Calculates hashes concurrently with adaptive batch sizing
Hash Caching: Stores calculated hashes to avoid recalculation

Phase 3: Duplicate Resolution

Quality Assessment: Evaluates which duplicate to keep based on metadata quality
Album Preservation: Maintains album associations when merging duplicates
Atomic Operations: Ensures consistent state during duplicate removal
Statistical Reporting: Provides detailed information about duplicates found

Content-Based vs Filename-Based Detection

Traditional Approaches (Filename-based):

Compare filenames only
Fails when same content has different names
Misses duplicates in different folders
Can incorrectly flag different files with similar names

GPTH's Content-Based Approach:

Analyzes actual file content using SHA-256 hashes
Detects duplicates regardless of filename or location
Handles renamed files correctly
Ignores metadata differences (focuses on image/video data)