External Data Sources - jra3/mulm GitHub Wiki

External Data Sources Integration

The BAP platform integrates with multiple authoritative biodiversity databases to automatically enrich species records with external links, images, distribution maps, and taxonomic information.

  • 🌐 3 integrated data sources (Wikipedia/Wikidata, GBIF, FishBase)
  • 🌍 All species types covered (Fish, Corals, Inverts, Plants)

📚 Table of Contents


Overview

Why External Data Integration?

External data sources provide:

  • Credibility: Links to authoritative scientific databases
  • Visual Content: High-quality specimen photographs
  • Educational Value: Distribution maps, habitat info, conservation status
  • Verification: Cross-reference species identifications
  • Discovery: Members can learn more about species they're breeding

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Species Name Group                       │
│                    (Canonical Species)                       │
└────────────┬────────────────────────────────────────────────┘
             │
             ├──► species_external_references
             │    (Wikipedia, Wikidata, GBIF, FishBase URLs)
             │
             └──► species_images
                  (Photos from all sources)

Each species group can have:

  • Multiple external reference URLs (displayed as clickable links)
  • Multiple images (displayed in galleries)
  • Sync logs tracking when data was last updated

Integrated Data Sources

Wikipedia/Wikidata

Coverage: All species types (Fish, Corals, Inverts, Plants) API: Wikidata SPARQL + Wikipedia REST API Authentication: None required

What We Extract

  1. Wikidata Entity URL - Structured taxonomic data

    • Example: https://www.wikidata.org/wiki/Q178202
  2. Wikipedia Article URLs - Encyclopedia articles

    • Example: https://en.wikipedia.org/wiki/Guppy
    • Currently extracts English articles (can be extended to multiple languages)
  3. Images - High-quality, often CC-licensed photos

    • From Wikimedia Commons
    • Includes infobox images and gallery photos

Implementation Details

  • Client: src/integrations/wikipedia.ts
  • Sync Script: scripts/sync-wikipedia-external-data.ts
  • SPARQL Query: Matches species by scientific name (wdt:P225 property)
  • Rate Limiting: 100ms between requests
  • Match Criteria: Exact scientific name match

GBIF (Global Biodiversity Information Facility)

Coverage: All species types API: GBIF REST API v1 Authentication: None required

What We Extract

  1. GBIF Species Page URL - Comprehensive species profiles

    • Example: https://www.gbif.org/species/2440951
    • Includes taxonomy, synonyms, descriptions
  2. Occurrence Map URLs - Geographic distribution

    • Static map images showing where species has been observed
    • Example: https://api.gbif.org/v2/map/occurrence/density/0/0/[email protected]?taxonKey=2440951
  3. Specimen Images - Photos from observations worldwide

    • User-contributed photos from iNaturalist, museum collections, etc.
    • High biodiversity coverage

Implementation Details

  • Client: src/integrations/gbif.ts
  • Sync Script: scripts/sync-gbif-external-data.ts
  • Matching: Species name matching API with confidence scores
  • Rate Limiting: 100ms between requests
  • Match Criteria: Confidence ≥ 80%, match type not "NONE"

Confidence Scoring

GBIF provides confidence scores for species matches:

  • 99% - Exact match, high confidence
  • 98% - Very good match
  • 94-97% - Good match (genus-level or fuzzy)
  • < 94% - Rejected (too uncertain)

FishBase

Coverage: Fish only API: Local DuckDB cache (Parquet files) Authentication: None required

What We Extract

  1. FishBase Species Page URL

    • Example: https://www.fishbase.se/summary/3228
    • Comprehensive fish-specific data
  2. Fish Images - Multiple life stages

    • Preferred image (general)
    • Male/female specimens
    • Juvenile/larvae stages
    • Eggs

Implementation Details

  • Client: Not API-based (uses local DuckDB cache)
  • Sync Script: scripts/sync-fishbase-external-data-duckdb.ts
  • Data Source: Parquet files in scripts/fishbase/cache/
  • Matching: Direct SQL query on genus/species
  • Rate Limiting: None (local data)

Why Not FishBase API?

The FishBase rOpenSci API (https://fishbase.ropensci.org) has SSL certificate issues. We use a local DuckDB cache of FishBase data instead, which:

  • ✅ Is faster (no network calls)
  • ✅ Is more reliable (no SSL issues)
  • ✅ Provides same data quality
  • ✅ Can be updated periodically

Database Schema

species_external_references

Stores URLs to external species pages (Wikipedia, GBIF, FishBase, etc.)

CREATE TABLE species_external_references (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  group_id INTEGER NOT NULL,
  reference_url TEXT NOT NULL,
  display_order INTEGER NOT NULL DEFAULT 0,
  FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE,
  UNIQUE (group_id, reference_url)
);

Fields:

  • group_id - Links to canonical species
  • reference_url - Full URL to external resource
  • display_order - Determines order of display (0 = first)

Indexes:

  • Primary key on id
  • Unique constraint on (group_id, reference_url) prevents duplicates

species_images

Stores URLs to species images from all sources.

CREATE TABLE species_images (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  group_id INTEGER NOT NULL,
  image_url TEXT NOT NULL,
  display_order INTEGER NOT NULL DEFAULT 0,
  source TEXT,
  attribution TEXT,
  license TEXT,
  FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE,
  UNIQUE (group_id, image_url)
);

Fields:

  • group_id - Links to canonical species
  • image_url - Full URL to image (Wikimedia, GBIF, etc.)
  • display_order - Display order
  • source - Optional: Name of source database
  • attribution - Optional: Photo credit
  • license - Optional: License (e.g., "CC BY-SA 4.0")

external_data_sync_log

Tracks sync operations for debugging and monitoring.

CREATE TABLE external_data_sync_log (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  group_id INTEGER NOT NULL,
  source TEXT NOT NULL,  -- 'wikipedia', 'gbif', 'fishbase'
  sync_date TEXT NOT NULL,
  status TEXT NOT NULL,  -- 'success', 'not_found', 'error'
  links_added INTEGER DEFAULT 0,
  images_added INTEGER DEFAULT 0,
  error_message TEXT,
  FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE
);

Indexes:

  • idx_external_sync_group on group_id
  • idx_external_sync_source on source
  • idx_external_sync_date on sync_date

species_name_group.last_external_sync

Timestamp field tracks when species was last synced:

ALTER TABLE species_name_group ADD COLUMN last_external_sync TEXT;

Used to avoid re-syncing recently updated species (default: 90 days).


Sync Scripts

All sync scripts follow the same pattern and support the same CLI arguments.

Common Arguments

# Dry-run (preview what would be synced, safe)
npm run script scripts/sync-<source>-external-data.ts

# Execute (actually modify database)
npm run script scripts/sync-<source>-external-data.ts -- --execute

# Limit to N species (for testing)
npm run script scripts/sync-<source>-external-data.ts -- --limit=10

# Sync specific species by ID
npm run script scripts/sync-<source>-external-data.ts -- --species-id=123

# Filter by species type
npm run script scripts/sync-<source>-external-data.ts -- --species-type=Coral

# Force re-sync (ignore last_external_sync timestamp)
npm run script scripts/sync-<source>-external-data.ts -- --force

# Custom database path
npm run script scripts/sync-<source>-external-data.ts -- --db=/path/to/db

Available Scripts

  1. Wikipedia/Wikidata Sync

    npm run script scripts/sync-wikipedia-external-data.ts -- --execute
  2. GBIF Sync

    npm run script scripts/sync-gbif-external-data.ts -- --execute
  3. FishBase Sync (DuckDB-based)

    npm run script scripts/sync-fishbase-external-data-duckdb.ts -- --execute

Sync Behavior

Default (no --force):

  • Only syncs species with approved submissions
  • Skips species synced within last 90 days
  • Prioritizes species by submission count (most popular first)

With --force:

  • Re-syncs all species regardless of last sync date
  • Adds new data without removing existing data
  • Idempotent (safe to run multiple times)

What Gets Synced:

  • Only new links/images are added
  • Existing data is preserved
  • No deletions occur
  • Display order is maintained

Usage

The One Command

Sync all species with external data AND download images to R2 in ONE step:

# Sync entire database (2,279 species) - downloads images to R2
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

# Run multiple times until output shows "Found 0 species"
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

What this does:

  • ✅ Queries Wikipedia, GBIF, and FishBase for ALL 2,279 species
  • ✅ Stores external reference links (Wikipedia pages, GBIF pages, etc.)
  • Downloads ALL images to Cloudflare R2 (no external URLs stored)
  • ✅ Transcodes to optimized JPEGs (800×600, 85% quality)
  • ✅ Tracks full metadata (source, attribution, license, original_url)
  • ✅ Avoids re-downloads via MD5 hash checking

Individual Sources

If you need to sync specific data sources:

# Wikipedia/Wikidata only (all species types)
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --batch-size=500

# GBIF only (all species types)
npm run script scripts/sync-gbif-all-species.ts -- --execute --download-images --batch-size=500

# FishBase only (fish species only)
npm run script scripts/sync-fishbase-all-species.ts -- --execute --download-images --batch-size=500

Automated Sync (Production)

For production environments, set up automated daily syncs:

# On production server (one-time setup)
ssh BAP
cd /opt/basny
./scripts/setup-external-data-cron.sh

This installs a cron job that runs daily at 3 AM. See Automated Sync section below for details.

Common Workflows

Sync All Species (Initial Setup)

# Sync entire database with images - run in batches
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

# Continue until "Found 0 species"
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

Sync Only Specific Species Type

# Sync all corals only
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --species-type=Coral

# Sync all plants only
npm run script scripts/sync-gbif-all-species.ts -- --execute --download-images --species-type=Plant

Test with Specific Species

# Test with species ID 61 (Poecilia reticulata)
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --species-id=61

Resume from Interruption

# If interrupted at species ID 1234, resume from there
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500 --start-after=1234

Coverage Statistics

Overall Coverage

Metric Count Notes
Total External Links 138 Wikipedia, Wikidata, GBIF, FishBase
Total Images 173 From all sources
Species with Data 47 unique Out of 53 with submissions
Overall Success Rate 89% At least one source matched

By Data Source

Source Species Synced Links Added Images Added Success Rate
Wikipedia/Wikidata 36 70 92 68%
GBIF 41 41 47 77%
FishBase 27 27 ~48 75%

By Species Type

Type Total with Submissions Synced Coverage
Fish 36 34 94%
Corals 11 7 64%
Inverts 3 3 100%
Plants 3 3 100%

Best Image Coverage

Species with most images (across all sources):

Species Common Name Type Total Images Sources
Caridina cantonensis Crystal Red Shrimp Invert 13 Wikipedia (3) + GBIF (10)
Discosoma coerulea Blue Mushroom Coral Coral 13 Wikipedia (3) + GBIF (10)
Discosoma ferrugata Red Mushroom Coral Coral 13 Wikipedia (3) + GBIF (10)
Poecilia reticulata Guppy Fish 11 FishBase (5) + Wikipedia (3) + GBIF (3)
Ancistrus cirrhosus Bristlenose Pleco Fish 6 FishBase (1) + Wikipedia (3) + GBIF (0)

Automated Sync (Cron)

Overview

The external data sync system can run automatically during off-peak hours on production servers. This is the recommended approach for production deployments.

Benefits:

  • Fully automated - Set it and forget it
  • Off-peak hours - Runs when traffic is lowest (default: 3 AM)
  • Conservative - Respectful rate limits and delays
  • Monitored - Comprehensive logs with 30-day rotation
  • Safe - Idempotent, no deletions, transaction-safe

Quick Setup

On Production Server:

# SSH to production
ssh BAP

# Navigate to project
cd /opt/basny

# Pull latest code (if not already done)
git pull

# Run setup script (installs cron for 3 AM daily)
./scripts/setup-external-data-cron.sh

The setup script will:

  1. Create log directory at /var/log/mulm/
  2. Configure log rotation (30 days retention)
  3. Install cron job
  4. Test the sync with a dry-run

That's it! The sync will now run automatically every day at 3 AM.

Custom Schedule

To use a different schedule:

# Sunday at 2 AM (weekly)
./scripts/setup-external-data-cron.sh --schedule "0 2 * * 0"

# Monday at 4 AM (weekly)
./scripts/setup-external-data-cron.sh --schedule "0 4 * * 1"

# Every other day at 3 AM
./scripts/setup-external-data-cron.sh --schedule "0 3 */2 * *"

Cron expression format: MIN HOUR DAY MONTH WEEKDAY

Recommended schedules:

Frequency Expression Use Case
Daily at 3 AM 0 3 * * * Active database, new species added frequently
Weekly (Sunday 2 AM) 0 2 * * 0 Stable database, moderate activity
Weekly (Monday 4 AM) 0 4 * * 1 Alternative timing
Every other day 0 3 */2 * * Moderate frequency

What Gets Synced

By default (without --force):

  • Species with approved submissions
  • Species NOT synced within last 90 days
  • Prioritized by submission count (most popular first)
  • All three sources: Wikipedia, GBIF, FishBase

Sync behavior:

  • Only new links/images are added
  • Existing data is preserved
  • No deletions occur
  • Display order is maintained
  • Transaction-safe (all-or-nothing per species)

Expected Duration

Typical sync times for daily runs:

New/Updated Species Duration
0-5 species 1-3 minutes
5-10 species 2-5 minutes
10-20 species 5-10 minutes
20-50 species 10-20 minutes

The 3 AM schedule ensures plenty of time before users wake up.

Monitoring

View Logs:

# Real-time (watch live)
tail -f /var/log/mulm/external-data-sync.log

# Last sync (most recent 100 lines)
tail -100 /var/log/mulm/external-data-sync.log

# Search for errors
grep "" /var/log/mulm/external-data-sync.log

# View specific date (compressed)
zcat /var/log/mulm/external-data-sync.log-20251119.gz | less

Check Cron Status:

# View installed cron jobs
crontab -l

# Check cron execution logs
grep CRON /var/log/syslog | tail -20

# Verify cron service is running
sudo systemctl status cron

Log Format:

Each sync produces output like:

================================================================================
🌐 External Data Sync Orchestrator
================================================================================
Mode: 🔴 EXECUTE
Started: 2025-11-19T03:00:01.234Z
================================================================================

[Processing Wikipedia/Wikidata...]
[1/10] Processing Poecilia reticulata (Fish)... ✅ 2 links, 3 images
...

=== Summary ===
Total processed: 10
  ✅ Success: 8
  ❌ Not found: 2
Total new links: 15
Total new images: 20

✅ Sync completed!

Rate Limiting

The automated sync uses conservative rate limiting to be respectful to API providers:

Source Delay Between Requests Additional Delays
Wikipedia/Wikidata 100ms 30s before next source
GBIF 120ms 30s before next source
FishBase None (local data) 30s before next source

Total sync includes:

  • Request delays within each source
  • 30 second pauses between sources
  • Automatic retry with exponential backoff on errors

Email Notifications (Optional)

To receive email on sync completion or errors:

# Edit crontab
crontab -e

# Add MAILTO at the top
[email protected]

# The cron job will email stdout/stderr
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute

Disabling or Modifying

Temporarily Disable:

crontab -e
# Add # at start of line:
# 0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute

Change Schedule:

# Re-run setup with new schedule
./scripts/setup-external-data-cron.sh --schedule "0 2 * * 0"

Permanently Remove:

crontab -e
# Delete the line containing 'sync-all-external-data.ts'

Performance Tuning

If syncs are taking too long or using too many resources:

1. Limit species per run:

Edit crontab to add --limit argument:

crontab -e

# Change to:
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute --limit=20

2. Run less frequently:

# Weekly instead of daily
./scripts/setup-external-data-cron.sh --schedule "0 3 * * 0"

3. Skip sources:

# Skip FishBase if it's slow
crontab -e

# Add --skip-fishbase:
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute --skip-fishbase

4. Stagger sources:

Run different sources on different days:

crontab -e

# Wikipedia on Mondays
0 3 * * 1 cd /opt/basny && npm run script scripts/sync-wikipedia-external-data.ts -- --execute

# GBIF on Wednesdays
0 3 * * 3 cd /opt/basny && npm run script scripts/sync-gbif-external-data.ts -- --execute

# FishBase on Fridays
0 3 * * 5 cd /opt/basny && npm run script scripts/sync-fishbase-external-data-duckdb.ts -- --execute

Troubleshooting Cron

Cron not running:

# Check if cron service is active
sudo systemctl status cron

# Start if stopped
sudo systemctl start cron

# Enable on boot
sudo systemctl enable cron

Sync not executing:

Check cron logs:

grep CRON /var/log/syslog | grep sync-all-external-data

Common issues:

  • PATH not set (script uses absolute paths, should work)
  • Permissions (run setup script as deployment user)
  • Database locked (ensure no other writes during sync)

High resource usage:

Monitor during sync:

# Watch CPU/memory
top

# Watch process
ps aux | grep ts-node

Solutions:

  • Reduce batch size with --limit
  • Run during lower-traffic hours
  • Increase delays in integration clients

Complete Documentation

For comprehensive guide including troubleshooting, performance tips, and advanced configurations, see:

docs/EXTERNAL_DATA_CRON_SETUP.md


Troubleshooting

Species Not Found

Symptoms: Sync reports "Not found" for a species

Common Causes:

  1. Capitalization Issues

    • Database: "Danio Kerri"
    • Should be: "Danio kerri"
    • Fix: Update canonical names to use proper capitalization
  2. Typos

    • Database: "Poecillia Reticulata"
    • Should be: "Poecilia reticulata"
    • Fix: Correct scientific names
  3. Generic/Unidentified Species

    • Example: "Ancistrus sp.", "Orange Leptostrea"
    • These species don't exist in databases (they're placeholders)
    • Fix: Identify to species level if possible
  4. Common Names Used

    • Database stores common name instead of scientific name
    • Example: "Green rhodactis" instead of genus/species
    • Fix: Add proper scientific names

Low Confidence Matches (GBIF)

Symptoms: GBIF reports low confidence (<80%) and skips species

Solutions:

  • Check if scientific name is correct
  • Verify species exists in GBIF database
  • Try searching GBIF web interface manually: https://www.gbif.org/

SSL Certificate Errors

Symptoms: Connection errors when syncing

FishBase API: Known issue - use local DuckDB sync instead Wikipedia/GBIF: Should not occur (stable APIs)

If you encounter SSL errors:

# Test API connectivity
curl -I https://api.gbif.org/v1/species/match?name=Poecilia%20reticulata
curl -I https://query.wikidata.org/sparql

No Species Found to Sync

Symptoms: "Found 0 species to process"

Cause: All species synced within last 90 days

Solutions:

# Force re-sync
npm run script scripts/sync-wikipedia-external-data.ts -- --execute --force

# Or sync specific species
npm run script scripts/sync-wikipedia-external-data.ts -- --execute --species-id=61

Rate Limiting

Symptoms: Errors or timeouts from APIs

Current Settings:

  • Wikipedia: 100ms between requests
  • GBIF: 120ms between requests
  • FishBase: N/A (local)

If rate limited:

  • Increase delay in integration clients
  • Run syncs during off-peak hours
  • Sync in smaller batches using --limit

Future Enhancements

Planned Data Sources

  1. SeaLifeBase - FishBase's sister project for marine species

    • Better coverage for corals and marine inverts
    • Can reuse FishBase code structure
  2. iNaturalist - Community observations

    • Real-world photos from hobbyists
    • Geographic distribution data
    • Good for popular aquarium species
  3. WoRMS - World Register of Marine Species

    • Authoritative taxonomy for marine species
    • Essential for coral identification
  4. Tropicos / World Flora Online - Plant databases

    • Better coverage for aquatic plants
    • Taxonomic authority for HAP species

Planned Features

  1. Image DownloadingCOMPLETED (Nov 2025)

    • ✅ One-step sync with --download-images flag
    • ✅ Automatic download to Cloudflare R2
    • ✅ Transcoding to 800×600 JPEG (85% quality)
    • ✅ MD5-based deduplication prevents re-downloads
    • ✅ Full metadata tracking (source, attribution, license, original_url)
    • ✅ Graceful error handling with fallback to external URLs
  2. Admin UI Enhancements

    • Display external links on species detail pages
    • Image galleries
    • Occurrence maps
    • Manual sync triggers from admin panel
  3. Automated Sync SchedulingCOMPLETED

    • ✅ Cron job setup script
    • ✅ Configurable schedules (daily/weekly/custom)
    • ✅ Comprehensive logging with rotation
    • ⏳ Email reports (manual setup available)
  4. Data Quality Improvements

    • Automatic capitalization fixes
    • Scientific name validation
    • Duplicate detection
    • Synonym matching

Related Documentation


Last Updated: November 19, 2025 Integration Version: 1.0 Maintainer: Development Team

⚠️ **GitHub.com Fallback** ⚠️