Duplicate Detection - Xentraxx/GooglePhotosTakeoutHelper GitHub Wiki

Duplicate Detection

Overview

The Duplicate Detection feature in Google Photos Takeout Helper (GPTH) identifies and removes duplicate files based on content analysis rather than filename matching. This is essential for Google Photos exports, which often contain the same photo multiple times across different album folders and year-based directories.

Why Duplicate Detection is Critical

Google Photos Takeout exports frequently create duplicate files due to the way Google organizes exported data:

Common Duplication Scenarios

Album and Year Organization:

  • Same photo appears in Photos from 2023/IMG_001.jpg
  • Also appears in Albums/Vacation/IMG_001.jpg
  • Both files have identical content but different locations

Export Method Variations:

  • Files downloaded multiple times from Google Photos web interface
  • Multiple takeout requests creating overlapping content
  • Files processed through various export/import cycles
  • Backup duplicates from syncing Google Photos to other services

Processing Duplicates:

  • Files exported in different formats but with identical content
  • Same image compressed differently but containing identical data
  • Motion Photos creating multiple component files of same content

Real-World Examples

📁 Google Photos Takeout/
├── Photos from 2023/
│   ├── IMG_001.jpg     ←─┐
│   └── IMG_002.jpg       │ Same content!
└── Albums/                │
    ├── Vacation/          │
    │   └── IMG_001.jpg  ←─┘
    └── Family/
        └── IMG_002.jpg   ← Same as Photos from 2023/IMG_002.jpg

How Duplicate Detection Works

Three-Phase Detection Algorithm

Phase 1: Size-Based Grouping

  1. Quick Size Calculation: Reads file metadata to get size in bytes
  2. Efficient Grouping: Groups files with identical sizes together
  3. Early Filtering: Files with unique sizes are immediately marked as non-duplicates
  4. Performance Optimization: Avoids expensive hash calculations for obviously unique files

Phase 2: Content Hash Verification

  1. SHA-256 Hashing: Calculates cryptographic hash of file content
  2. Streaming for Large Files: Uses memory-efficient streaming for files >50MB
  3. Parallel Processing: Calculates hashes concurrently with adaptive batch sizing
  4. Hash Caching: Stores calculated hashes to avoid recalculation

Phase 3: Duplicate Resolution

  1. Quality Assessment: Evaluates which duplicate to keep based on metadata quality
  2. Album Preservation: Maintains album associations when merging duplicates
  3. Atomic Operations: Ensures consistent state during duplicate removal
  4. Statistical Reporting: Provides detailed information about duplicates found

Content-Based vs Filename-Based Detection

Traditional Approaches (Filename-based):

  • Compare filenames only
  • Fails when same content has different names
  • Misses duplicates in different folders
  • Can incorrectly flag different files with similar names

GPTH's Content-Based Approach:

  • Analyzes actual file content using SHA-256 hashes
  • Detects duplicates regardless of filename or location
  • Handles renamed files correctly
  • Ignores metadata differences (focuses on image/video data)