External Data Sources Integration

The BAP platform integrates with multiple authoritative biodiversity databases to automatically enrich species records with external links, images, distribution maps, and taxonomic information.

🌐 3 integrated data sources (Wikipedia/Wikidata, GBIF, FishBase)
🌍 All species types covered (Fish, Corals, Inverts, Plants)

Overview

Why External Data Integration?

External data sources provide:

Credibility: Links to authoritative scientific databases
Visual Content: High-quality specimen photographs
Educational Value: Distribution maps, habitat info, conservation status
Verification: Cross-reference species identifications
Discovery: Members can learn more about species they're breeding

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Species Name Group                       │
│                    (Canonical Species)                       │
└────────────┬────────────────────────────────────────────────┘
             │
             ├──► species_external_references
             │    (Wikipedia, Wikidata, GBIF, FishBase URLs)
             │
             └──► species_images
                  (Photos from all sources)

Each species group can have:

Multiple external reference URLs (displayed as clickable links)
Multiple images (displayed in galleries)
Sync logs tracking when data was last updated

Integrated Data Sources

Wikipedia/Wikidata

Coverage: All species types (Fish, Corals, Inverts, Plants) API: Wikidata SPARQL + Wikipedia REST API Authentication: None required

What We Extract

Wikidata Entity URL - Structured taxonomic data
- Example: https://www.wikidata.org/wiki/Q178202
Wikipedia Article URLs - Encyclopedia articles
- Example: https://en.wikipedia.org/wiki/Guppy
- Currently extracts English articles (can be extended to multiple languages)
Images - High-quality, often CC-licensed photos
- From Wikimedia Commons
- Includes infobox images and gallery photos

Implementation Details

Client: src/integrations/wikipedia.ts
Sync Script: scripts/sync-wikipedia-external-data.ts
SPARQL Query: Matches species by scientific name (wdt:P225 property)
Rate Limiting: 100ms between requests
Match Criteria: Exact scientific name match

GBIF (Global Biodiversity Information Facility)

Coverage: All species types API: GBIF REST API v1 Authentication: None required

What We Extract

GBIF Species Page URL - Comprehensive species profiles
- Example: https://www.gbif.org/species/2440951
- Includes taxonomy, synonyms, descriptions
Occurrence Map URLs - Geographic distribution
- Static map images showing where species has been observed
- Example: https://api.gbif.org/v2/map/occurrence/density/0/0/[email protected]?taxonKey=2440951
Specimen Images - Photos from observations worldwide
- User-contributed photos from iNaturalist, museum collections, etc.
- High biodiversity coverage

Implementation Details

Client: src/integrations/gbif.ts
Sync Script: scripts/sync-gbif-external-data.ts
Matching: Species name matching API with confidence scores
Rate Limiting: 100ms between requests
Match Criteria: Confidence ≥ 80%, match type not "NONE"

Confidence Scoring

GBIF provides confidence scores for species matches:

99% - Exact match, high confidence
98% - Very good match
94-97% - Good match (genus-level or fuzzy)
< 94% - Rejected (too uncertain)

FishBase

Coverage: Fish only API: Local DuckDB cache (Parquet files) Authentication: None required

What We Extract

FishBase Species Page URL
- Example: https://www.fishbase.se/summary/3228
- Comprehensive fish-specific data
Fish Images - Multiple life stages
- Preferred image (general)
- Male/female specimens
- Juvenile/larvae stages
- Eggs

Implementation Details

Client: Not API-based (uses local DuckDB cache)
Sync Script: scripts/sync-fishbase-external-data-duckdb.ts
Data Source: Parquet files in scripts/fishbase/cache/
Matching: Direct SQL query on genus/species
Rate Limiting: None (local data)

Why Not FishBase API?

The FishBase rOpenSci API (https://fishbase.ropensci.org) has SSL certificate issues. We use a local DuckDB cache of FishBase data instead, which:

✅ Is faster (no network calls)
✅ Is more reliable (no SSL issues)
✅ Provides same data quality
✅ Can be updated periodically

Database Schema

species_external_references

Stores URLs to external species pages (Wikipedia, GBIF, FishBase, etc.)

CREATE TABLE species_external_references (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  group_id INTEGER NOT NULL,
  reference_url TEXT NOT NULL,
  display_order INTEGER NOT NULL DEFAULT 0,
  FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE,
  UNIQUE (group_id, reference_url)
);

Fields:

group_id - Links to canonical species
reference_url - Full URL to external resource
display_order - Determines order of display (0 = first)

Indexes:

Primary key on id
Unique constraint on (group_id, reference_url) prevents duplicates

species_images

Stores URLs to species images from all sources.

CREATE TABLE species_images (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  group_id INTEGER NOT NULL,
  image_url TEXT NOT NULL,
  display_order INTEGER NOT NULL DEFAULT 0,
  source TEXT,
  attribution TEXT,
  license TEXT,
  FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE,
  UNIQUE (group_id, image_url)
);

Fields:

group_id - Links to canonical species
image_url - Full URL to image (Wikimedia, GBIF, etc.)
display_order - Display order
source - Optional: Name of source database
attribution - Optional: Photo credit
license - Optional: License (e.g., "CC BY-SA 4.0")

external_data_sync_log

Tracks sync operations for debugging and monitoring.

CREATE TABLE external_data_sync_log (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  group_id INTEGER NOT NULL,
  source TEXT NOT NULL,  -- 'wikipedia', 'gbif', 'fishbase'
  sync_date TEXT NOT NULL,
  status TEXT NOT NULL,  -- 'success', 'not_found', 'error'
  links_added INTEGER DEFAULT 0,
  images_added INTEGER DEFAULT 0,
  error_message TEXT,
  FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE
);

Indexes:

idx_external_sync_group on group_id
idx_external_sync_source on source
idx_external_sync_date on sync_date

species_name_group.last_external_sync

Timestamp field tracks when species was last synced:

ALTER TABLE species_name_group ADD COLUMN last_external_sync TEXT;

Used to avoid re-syncing recently updated species (default: 90 days).

Sync Scripts

All sync scripts follow the same pattern and support the same CLI arguments.

Common Arguments

# Dry-run (preview what would be synced, safe)
npm run script scripts/sync-<source>-external-data.ts

# Execute (actually modify database)
npm run script scripts/sync-<source>-external-data.ts -- --execute

# Limit to N species (for testing)
npm run script scripts/sync-<source>-external-data.ts -- --limit=10

# Sync specific species by ID
npm run script scripts/sync-<source>-external-data.ts -- --species-id=123

# Filter by species type
npm run script scripts/sync-<source>-external-data.ts -- --species-type=Coral

# Force re-sync (ignore last_external_sync timestamp)
npm run script scripts/sync-<source>-external-data.ts -- --force

# Custom database path
npm run script scripts/sync-<source>-external-data.ts -- --db=/path/to/db

Available Scripts

Wikipedia/Wikidata Sync

npm run script scripts/sync-wikipedia-external-data.ts -- --execute

GBIF Sync

npm run script scripts/sync-gbif-external-data.ts -- --execute

FishBase Sync (DuckDB-based)

npm run script scripts/sync-fishbase-external-data-duckdb.ts -- --execute

Sync Behavior

Default (no --force):

Only syncs species with approved submissions
Skips species synced within last 90 days
Prioritizes species by submission count (most popular first)

With --force:

Re-syncs all species regardless of last sync date
Adds new data without removing existing data
Idempotent (safe to run multiple times)

What Gets Synced:

Only new links/images are added
Existing data is preserved
No deletions occur
Display order is maintained

Usage

The One Command

Sync all species with external data AND download images to R2 in ONE step:

# Sync entire database (2,279 species) - downloads images to R2
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

# Run multiple times until output shows "Found 0 species"
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

What this does:

✅ Queries Wikipedia, GBIF, and FishBase for ALL 2,279 species
✅ Stores external reference links (Wikipedia pages, GBIF pages, etc.)
✅ Downloads ALL images to Cloudflare R2 (no external URLs stored)
✅ Transcodes to optimized JPEGs (800×600, 85% quality)
✅ Tracks full metadata (source, attribution, license, original_url)
✅ Avoids re-downloads via MD5 hash checking

Individual Sources

If you need to sync specific data sources:

# Wikipedia/Wikidata only (all species types)
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --batch-size=500

# GBIF only (all species types)
npm run script scripts/sync-gbif-all-species.ts -- --execute --download-images --batch-size=500

# FishBase only (fish species only)
npm run script scripts/sync-fishbase-all-species.ts -- --execute --download-images --batch-size=500

Automated Sync (Production)

For production environments, set up automated daily syncs:

# On production server (one-time setup)
ssh BAP
cd /opt/basny
./scripts/setup-external-data-cron.sh

This installs a cron job that runs daily at 3 AM. See Automated Sync section below for details.

Common Workflows

Sync All Species (Initial Setup)

# Sync entire database with images - run in batches
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

# Continue until "Found 0 species"
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500

Sync Only Specific Species Type

# Sync all corals only
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --species-type=Coral

# Sync all plants only
npm run script scripts/sync-gbif-all-species.ts -- --execute --download-images --species-type=Plant

Test with Specific Species

# Test with species ID 61 (Poecilia reticulata)
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --species-id=61

Resume from Interruption

# If interrupted at species ID 1234, resume from there
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500 --start-after=1234

Coverage Statistics

Overall Coverage

Metric	Count	Notes
Total External Links	138	Wikipedia, Wikidata, GBIF, FishBase
Total Images	173	From all sources
Species with Data	47 unique	Out of 53 with submissions
Overall Success Rate	89%	At least one source matched

By Data Source

Source	Species Synced	Links Added	Images Added	Success Rate
Wikipedia/Wikidata	36	70	92	68%
GBIF	41	41	47	77%
FishBase	27	27	~48	75%

By Species Type

Type	Total with Submissions	Synced	Coverage
Fish	36	34	94%
Corals	11	7	64%
Inverts	3	3	100%
Plants	3	3	100%

Best Image Coverage

Species with most images (across all sources):

Species	Common Name	Type	Total Images	Sources
Caridina cantonensis	Crystal Red Shrimp	Invert	13	Wikipedia (3) + GBIF (10)
Discosoma coerulea	Blue Mushroom Coral	Coral	13	Wikipedia (3) + GBIF (10)
Discosoma ferrugata	Red Mushroom Coral	Coral	13	Wikipedia (3) + GBIF (10)
Poecilia reticulata	Guppy	Fish	11	FishBase (5) + Wikipedia (3) + GBIF (3)
Ancistrus cirrhosus	Bristlenose Pleco	Fish	6	FishBase (1) + Wikipedia (3) + GBIF (0)

Automated Sync (Cron)

Overview

The external data sync system can run automatically during off-peak hours on production servers. This is the recommended approach for production deployments.

Benefits:

✅ Fully automated - Set it and forget it
✅ Off-peak hours - Runs when traffic is lowest (default: 3 AM)
✅ Conservative - Respectful rate limits and delays
✅ Monitored - Comprehensive logs with 30-day rotation
✅ Safe - Idempotent, no deletions, transaction-safe

Quick Setup

On Production Server:

# SSH to production
ssh BAP

# Navigate to project
cd /opt/basny

# Pull latest code (if not already done)
git pull

# Run setup script (installs cron for 3 AM daily)
./scripts/setup-external-data-cron.sh

The setup script will:

Create log directory at /var/log/mulm/
Configure log rotation (30 days retention)
Install cron job
Test the sync with a dry-run

That's it! The sync will now run automatically every day at 3 AM.

Custom Schedule

To use a different schedule:

# Sunday at 2 AM (weekly)
./scripts/setup-external-data-cron.sh --schedule "0 2 * * 0"

# Monday at 4 AM (weekly)
./scripts/setup-external-data-cron.sh --schedule "0 4 * * 1"

# Every other day at 3 AM
./scripts/setup-external-data-cron.sh --schedule "0 3 */2 * *"

Cron expression format: MIN HOUR DAY MONTH WEEKDAY

Recommended schedules:

Frequency	Expression	Use Case
Daily at 3 AM	`0 3 * * *`	Active database, new species added frequently
Weekly (Sunday 2 AM)	`0 2 * * 0`	Stable database, moderate activity
Weekly (Monday 4 AM)	`0 4 * * 1`	Alternative timing
Every other day	`0 3 /2 *`	Moderate frequency

What Gets Synced

By default (without --force):

Species with approved submissions
Species NOT synced within last 90 days
Prioritized by submission count (most popular first)
All three sources: Wikipedia, GBIF, FishBase

Sync behavior:

Only new links/images are added
Existing data is preserved
No deletions occur
Display order is maintained
Transaction-safe (all-or-nothing per species)

Expected Duration

Typical sync times for daily runs:

New/Updated Species	Duration
0-5 species	1-3 minutes
5-10 species	2-5 minutes
10-20 species	5-10 minutes
20-50 species	10-20 minutes

The 3 AM schedule ensures plenty of time before users wake up.

Monitoring

View Logs:

# Real-time (watch live)
tail -f /var/log/mulm/external-data-sync.log

# Last sync (most recent 100 lines)
tail -100 /var/log/mulm/external-data-sync.log

# Search for errors
grep "❌" /var/log/mulm/external-data-sync.log

# View specific date (compressed)
zcat /var/log/mulm/external-data-sync.log-20251119.gz | less

Check Cron Status:

# View installed cron jobs
crontab -l

# Check cron execution logs
grep CRON /var/log/syslog | tail -20

# Verify cron service is running
sudo systemctl status cron

Log Format:

Each sync produces output like:

================================================================================
🌐 External Data Sync Orchestrator
================================================================================
Mode: 🔴 EXECUTE
Started: 2025-11-19T03:00:01.234Z
================================================================================

[Processing Wikipedia/Wikidata...]
[1/10] Processing Poecilia reticulata (Fish)... ✅ 2 links, 3 images
...

=== Summary ===
Total processed: 10
  ✅ Success: 8
  ❌ Not found: 2
Total new links: 15
Total new images: 20

✅ Sync completed!

Rate Limiting

The automated sync uses conservative rate limiting to be respectful to API providers:

Source	Delay Between Requests	Additional Delays
Wikipedia/Wikidata	100ms	30s before next source
GBIF	120ms	30s before next source
FishBase	None (local data)	30s before next source

Total sync includes:

Request delays within each source
30 second pauses between sources
Automatic retry with exponential backoff on errors

Email Notifications (Optional)

To receive email on sync completion or errors:

# Edit crontab
crontab -e

# Add MAILTO at the top
[email protected]

# The cron job will email stdout/stderr
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute

Disabling or Modifying

Temporarily Disable:

crontab -e
# Add # at start of line:
# 0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute

Change Schedule:

# Re-run setup with new schedule
./scripts/setup-external-data-cron.sh --schedule "0 2 * * 0"

Permanently Remove:

crontab -e
# Delete the line containing 'sync-all-external-data.ts'

Performance Tuning

If syncs are taking too long or using too many resources:

1. Limit species per run:

Edit crontab to add --limit argument:

crontab -e

# Change to:
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute --limit=20

2. Run less frequently:

# Weekly instead of daily
./scripts/setup-external-data-cron.sh --schedule "0 3 * * 0"

3. Skip sources:

# Skip FishBase if it's slow
crontab -e

# Add --skip-fishbase:
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute --skip-fishbase

4. Stagger sources:

Run different sources on different days:

crontab -e

# Wikipedia on Mondays
0 3 * * 1 cd /opt/basny && npm run script scripts/sync-wikipedia-external-data.ts -- --execute

# GBIF on Wednesdays
0 3 * * 3 cd /opt/basny && npm run script scripts/sync-gbif-external-data.ts -- --execute

# FishBase on Fridays
0 3 * * 5 cd /opt/basny && npm run script scripts/sync-fishbase-external-data-duckdb.ts -- --execute

Troubleshooting Cron

Cron not running:

# Check if cron service is active
sudo systemctl status cron

# Start if stopped
sudo systemctl start cron

# Enable on boot
sudo systemctl enable cron

Sync not executing:

Check cron logs:

grep CRON /var/log/syslog | grep sync-all-external-data

Common issues:

PATH not set (script uses absolute paths, should work)
Permissions (run setup script as deployment user)
Database locked (ensure no other writes during sync)

High resource usage:

Monitor during sync:

# Watch CPU/memory
top

# Watch process
ps aux | grep ts-node

Solutions:

Reduce batch size with --limit
Run during lower-traffic hours
Increase delays in integration clients

Complete Documentation

For comprehensive guide including troubleshooting, performance tips, and advanced configurations, see:

docs/EXTERNAL_DATA_CRON_SETUP.md

Troubleshooting

Species Not Found

Symptoms: Sync reports "Not found" for a species

Common Causes:

Capitalization Issues
- Database: "Danio Kerri"
- Should be: "Danio kerri"
- Fix: Update canonical names to use proper capitalization
Typos
- Database: "Poecillia Reticulata"
- Should be: "Poecilia reticulata"
- Fix: Correct scientific names
Generic/Unidentified Species
- Example: "Ancistrus sp.", "Orange Leptostrea"
- These species don't exist in databases (they're placeholders)
- Fix: Identify to species level if possible
Common Names Used
- Database stores common name instead of scientific name
- Example: "Green rhodactis" instead of genus/species
- Fix: Add proper scientific names

Low Confidence Matches (GBIF)

Symptoms: GBIF reports low confidence (<80%) and skips species

Solutions:

Check if scientific name is correct
Verify species exists in GBIF database
Try searching GBIF web interface manually: https://www.gbif.org/

SSL Certificate Errors

Symptoms: Connection errors when syncing

FishBase API: Known issue - use local DuckDB sync instead Wikipedia/GBIF: Should not occur (stable APIs)

If you encounter SSL errors:

# Test API connectivity
curl -I https://api.gbif.org/v1/species/match?name=Poecilia%20reticulata
curl -I https://query.wikidata.org/sparql

No Species Found to Sync

Symptoms: "Found 0 species to process"

Cause: All species synced within last 90 days

Solutions:

# Force re-sync
npm run script scripts/sync-wikipedia-external-data.ts -- --execute --force

# Or sync specific species
npm run script scripts/sync-wikipedia-external-data.ts -- --execute --species-id=61

Rate Limiting

Symptoms: Errors or timeouts from APIs

Current Settings:

Wikipedia: 100ms between requests
GBIF: 120ms between requests
FishBase: N/A (local)

If rate limited:

Increase delay in integration clients
Run syncs during off-peak hours
Sync in smaller batches using --limit

Future Enhancements

Planned Data Sources

SeaLifeBase - FishBase's sister project for marine species
- Better coverage for corals and marine inverts
- Can reuse FishBase code structure
iNaturalist - Community observations
- Real-world photos from hobbyists
- Geographic distribution data
- Good for popular aquarium species
WoRMS - World Register of Marine Species
- Authoritative taxonomy for marine species
- Essential for coral identification
Tropicos / World Flora Online - Plant databases
- Better coverage for aquatic plants
- Taxonomic authority for HAP species

Planned Features

Image Downloading ✅ COMPLETED (Nov 2025)
- ✅ One-step sync with --download-images flag
- ✅ Automatic download to Cloudflare R2
- ✅ Transcoding to 800×600 JPEG (85% quality)
- ✅ MD5-based deduplication prevents re-downloads
- ✅ Full metadata tracking (source, attribution, license, original_url)
- ✅ Graceful error handling with fallback to external URLs
Admin UI Enhancements
- Display external links on species detail pages
- Image galleries
- Occurrence maps
- Manual sync triggers from admin panel
Automated Sync Scheduling ✅ COMPLETED
- ✅ Cron job setup script
- ✅ Configurable schedules (daily/weekly/custom)
- ✅ Comprehensive logging with rotation
- ⏳ Email reports (manual setup available)
Data Quality Improvements
- Automatic capitalization fixes
- Scientific name validation
- Duplicate detection
- Synonym matching

External Data Sources - jra3/mulm GitHub Wiki

External Data Sources Integration

📚 Table of Contents

Overview

Why External Data Integration?

Architecture

Integrated Data Sources

Wikipedia/Wikidata

What We Extract

Implementation Details

GBIF (Global Biodiversity Information Facility)

What We Extract

Implementation Details

Confidence Scoring

FishBase

What We Extract

Implementation Details

Why Not FishBase API?

Database Schema

species_external_references

species_images

external_data_sync_log

species_name_group.last_external_sync

Sync Scripts

Common Arguments

Available Scripts

Sync Behavior

Usage

The One Command

Individual Sources

Automated Sync (Production)

Common Workflows

Sync All Species (Initial Setup)

Sync Only Specific Species Type

Test with Specific Species

Resume from Interruption

Coverage Statistics

Overall Coverage

By Data Source

By Species Type

Best Image Coverage

Automated Sync (Cron)

Overview

Quick Setup

Custom Schedule

What Gets Synced

Expected Duration

Monitoring

Rate Limiting

Email Notifications (Optional)

Disabling or Modifying

Performance Tuning

Troubleshooting Cron

Complete Documentation

Troubleshooting

Species Not Found

Low Confidence Matches (GBIF)

SSL Certificate Errors

No Species Found to Sync

Rate Limiting

Future Enhancements

Planned Data Sources

Planned Features

Related Documentation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️