External Data Sources - jra3/mulm GitHub Wiki
The BAP platform integrates with multiple authoritative biodiversity databases to automatically enrich species records with external links, images, distribution maps, and taxonomic information.
- 🌐 3 integrated data sources (Wikipedia/Wikidata, GBIF, FishBase)
- 🌍 All species types covered (Fish, Corals, Inverts, Plants)
- Overview
- Integrated Data Sources
- Database Schema
- Sync Scripts
- Usage
- Coverage Statistics
- Troubleshooting
External data sources provide:
- Credibility: Links to authoritative scientific databases
- Visual Content: High-quality specimen photographs
- Educational Value: Distribution maps, habitat info, conservation status
- Verification: Cross-reference species identifications
- Discovery: Members can learn more about species they're breeding
┌─────────────────────────────────────────────────────────────┐
│ Species Name Group │
│ (Canonical Species) │
└────────────┬────────────────────────────────────────────────┘
│
├──► species_external_references
│ (Wikipedia, Wikidata, GBIF, FishBase URLs)
│
└──► species_images
(Photos from all sources)
Each species group can have:
- Multiple external reference URLs (displayed as clickable links)
- Multiple images (displayed in galleries)
- Sync logs tracking when data was last updated
Coverage: All species types (Fish, Corals, Inverts, Plants) API: Wikidata SPARQL + Wikipedia REST API Authentication: None required
-
Wikidata Entity URL - Structured taxonomic data
- Example:
https://www.wikidata.org/wiki/Q178202
- Example:
-
Wikipedia Article URLs - Encyclopedia articles
- Example:
https://en.wikipedia.org/wiki/Guppy - Currently extracts English articles (can be extended to multiple languages)
- Example:
-
Images - High-quality, often CC-licensed photos
- From Wikimedia Commons
- Includes infobox images and gallery photos
-
Client:
src/integrations/wikipedia.ts -
Sync Script:
scripts/sync-wikipedia-external-data.ts -
SPARQL Query: Matches species by scientific name (
wdt:P225property) - Rate Limiting: 100ms between requests
- Match Criteria: Exact scientific name match
Coverage: All species types API: GBIF REST API v1 Authentication: None required
-
GBIF Species Page URL - Comprehensive species profiles
- Example:
https://www.gbif.org/species/2440951 - Includes taxonomy, synonyms, descriptions
- Example:
-
Occurrence Map URLs - Geographic distribution
- Static map images showing where species has been observed
- Example:
https://api.gbif.org/v2/map/occurrence/density/0/0/[email protected]?taxonKey=2440951
-
Specimen Images - Photos from observations worldwide
- User-contributed photos from iNaturalist, museum collections, etc.
- High biodiversity coverage
-
Client:
src/integrations/gbif.ts -
Sync Script:
scripts/sync-gbif-external-data.ts - Matching: Species name matching API with confidence scores
- Rate Limiting: 100ms between requests
- Match Criteria: Confidence ≥ 80%, match type not "NONE"
GBIF provides confidence scores for species matches:
- 99% - Exact match, high confidence
- 98% - Very good match
- 94-97% - Good match (genus-level or fuzzy)
- < 94% - Rejected (too uncertain)
Coverage: Fish only API: Local DuckDB cache (Parquet files) Authentication: None required
-
FishBase Species Page URL
- Example:
https://www.fishbase.se/summary/3228 - Comprehensive fish-specific data
- Example:
-
Fish Images - Multiple life stages
- Preferred image (general)
- Male/female specimens
- Juvenile/larvae stages
- Eggs
- Client: Not API-based (uses local DuckDB cache)
-
Sync Script:
scripts/sync-fishbase-external-data-duckdb.ts -
Data Source: Parquet files in
scripts/fishbase/cache/ - Matching: Direct SQL query on genus/species
- Rate Limiting: None (local data)
The FishBase rOpenSci API (https://fishbase.ropensci.org) has SSL certificate issues. We use a local DuckDB cache of FishBase data instead, which:
- ✅ Is faster (no network calls)
- ✅ Is more reliable (no SSL issues)
- ✅ Provides same data quality
- ✅ Can be updated periodically
Stores URLs to external species pages (Wikipedia, GBIF, FishBase, etc.)
CREATE TABLE species_external_references (
id INTEGER PRIMARY KEY AUTOINCREMENT,
group_id INTEGER NOT NULL,
reference_url TEXT NOT NULL,
display_order INTEGER NOT NULL DEFAULT 0,
FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE,
UNIQUE (group_id, reference_url)
);Fields:
-
group_id- Links to canonical species -
reference_url- Full URL to external resource -
display_order- Determines order of display (0 = first)
Indexes:
- Primary key on
id - Unique constraint on
(group_id, reference_url)prevents duplicates
Stores URLs to species images from all sources.
CREATE TABLE species_images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
group_id INTEGER NOT NULL,
image_url TEXT NOT NULL,
display_order INTEGER NOT NULL DEFAULT 0,
source TEXT,
attribution TEXT,
license TEXT,
FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE,
UNIQUE (group_id, image_url)
);Fields:
-
group_id- Links to canonical species -
image_url- Full URL to image (Wikimedia, GBIF, etc.) -
display_order- Display order -
source- Optional: Name of source database -
attribution- Optional: Photo credit -
license- Optional: License (e.g., "CC BY-SA 4.0")
Tracks sync operations for debugging and monitoring.
CREATE TABLE external_data_sync_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
group_id INTEGER NOT NULL,
source TEXT NOT NULL, -- 'wikipedia', 'gbif', 'fishbase'
sync_date TEXT NOT NULL,
status TEXT NOT NULL, -- 'success', 'not_found', 'error'
links_added INTEGER DEFAULT 0,
images_added INTEGER DEFAULT 0,
error_message TEXT,
FOREIGN KEY (group_id) REFERENCES species_name_group(group_id) ON DELETE CASCADE
);Indexes:
-
idx_external_sync_groupongroup_id -
idx_external_sync_sourceonsource -
idx_external_sync_dateonsync_date
Timestamp field tracks when species was last synced:
ALTER TABLE species_name_group ADD COLUMN last_external_sync TEXT;Used to avoid re-syncing recently updated species (default: 90 days).
All sync scripts follow the same pattern and support the same CLI arguments.
# Dry-run (preview what would be synced, safe)
npm run script scripts/sync-<source>-external-data.ts
# Execute (actually modify database)
npm run script scripts/sync-<source>-external-data.ts -- --execute
# Limit to N species (for testing)
npm run script scripts/sync-<source>-external-data.ts -- --limit=10
# Sync specific species by ID
npm run script scripts/sync-<source>-external-data.ts -- --species-id=123
# Filter by species type
npm run script scripts/sync-<source>-external-data.ts -- --species-type=Coral
# Force re-sync (ignore last_external_sync timestamp)
npm run script scripts/sync-<source>-external-data.ts -- --force
# Custom database path
npm run script scripts/sync-<source>-external-data.ts -- --db=/path/to/db-
Wikipedia/Wikidata Sync
npm run script scripts/sync-wikipedia-external-data.ts -- --execute
-
GBIF Sync
npm run script scripts/sync-gbif-external-data.ts -- --execute
-
FishBase Sync (DuckDB-based)
npm run script scripts/sync-fishbase-external-data-duckdb.ts -- --execute
Default (no --force):
- Only syncs species with approved submissions
- Skips species synced within last 90 days
- Prioritizes species by submission count (most popular first)
With --force:
- Re-syncs all species regardless of last sync date
- Adds new data without removing existing data
- Idempotent (safe to run multiple times)
What Gets Synced:
- Only new links/images are added
- Existing data is preserved
- No deletions occur
- Display order is maintained
Sync all species with external data AND download images to R2 in ONE step:
# Sync entire database (2,279 species) - downloads images to R2
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500
# Run multiple times until output shows "Found 0 species"
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500What this does:
- ✅ Queries Wikipedia, GBIF, and FishBase for ALL 2,279 species
- ✅ Stores external reference links (Wikipedia pages, GBIF pages, etc.)
- ✅ Downloads ALL images to Cloudflare R2 (no external URLs stored)
- ✅ Transcodes to optimized JPEGs (800×600, 85% quality)
- ✅ Tracks full metadata (source, attribution, license, original_url)
- ✅ Avoids re-downloads via MD5 hash checking
If you need to sync specific data sources:
# Wikipedia/Wikidata only (all species types)
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --batch-size=500
# GBIF only (all species types)
npm run script scripts/sync-gbif-all-species.ts -- --execute --download-images --batch-size=500
# FishBase only (fish species only)
npm run script scripts/sync-fishbase-all-species.ts -- --execute --download-images --batch-size=500For production environments, set up automated daily syncs:
# On production server (one-time setup)
ssh BAP
cd /opt/basny
./scripts/setup-external-data-cron.shThis installs a cron job that runs daily at 3 AM. See Automated Sync section below for details.
# Sync entire database with images - run in batches
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500
# Continue until "Found 0 species"
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500# Sync all corals only
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --species-type=Coral
# Sync all plants only
npm run script scripts/sync-gbif-all-species.ts -- --execute --download-images --species-type=Plant# Test with species ID 61 (Poecilia reticulata)
npm run script scripts/sync-wikipedia-all-species.ts -- --execute --download-images --species-id=61# If interrupted at species ID 1234, resume from there
npm run script scripts/sync-all-species-full-database.ts -- --execute --download-images --batch-size=500 --start-after=1234| Metric | Count | Notes |
|---|---|---|
| Total External Links | 138 | Wikipedia, Wikidata, GBIF, FishBase |
| Total Images | 173 | From all sources |
| Species with Data | 47 unique | Out of 53 with submissions |
| Overall Success Rate | 89% | At least one source matched |
| Source | Species Synced | Links Added | Images Added | Success Rate |
|---|---|---|---|---|
| Wikipedia/Wikidata | 36 | 70 | 92 | 68% |
| GBIF | 41 | 41 | 47 | 77% |
| FishBase | 27 | 27 | ~48 | 75% |
| Type | Total with Submissions | Synced | Coverage |
|---|---|---|---|
| Fish | 36 | 34 | 94% |
| Corals | 11 | 7 | 64% |
| Inverts | 3 | 3 | 100% |
| Plants | 3 | 3 | 100% |
Species with most images (across all sources):
| Species | Common Name | Type | Total Images | Sources |
|---|---|---|---|---|
| Caridina cantonensis | Crystal Red Shrimp | Invert | 13 | Wikipedia (3) + GBIF (10) |
| Discosoma coerulea | Blue Mushroom Coral | Coral | 13 | Wikipedia (3) + GBIF (10) |
| Discosoma ferrugata | Red Mushroom Coral | Coral | 13 | Wikipedia (3) + GBIF (10) |
| Poecilia reticulata | Guppy | Fish | 11 | FishBase (5) + Wikipedia (3) + GBIF (3) |
| Ancistrus cirrhosus | Bristlenose Pleco | Fish | 6 | FishBase (1) + Wikipedia (3) + GBIF (0) |
The external data sync system can run automatically during off-peak hours on production servers. This is the recommended approach for production deployments.
Benefits:
- ✅ Fully automated - Set it and forget it
- ✅ Off-peak hours - Runs when traffic is lowest (default: 3 AM)
- ✅ Conservative - Respectful rate limits and delays
- ✅ Monitored - Comprehensive logs with 30-day rotation
- ✅ Safe - Idempotent, no deletions, transaction-safe
On Production Server:
# SSH to production
ssh BAP
# Navigate to project
cd /opt/basny
# Pull latest code (if not already done)
git pull
# Run setup script (installs cron for 3 AM daily)
./scripts/setup-external-data-cron.shThe setup script will:
- Create log directory at
/var/log/mulm/ - Configure log rotation (30 days retention)
- Install cron job
- Test the sync with a dry-run
That's it! The sync will now run automatically every day at 3 AM.
To use a different schedule:
# Sunday at 2 AM (weekly)
./scripts/setup-external-data-cron.sh --schedule "0 2 * * 0"
# Monday at 4 AM (weekly)
./scripts/setup-external-data-cron.sh --schedule "0 4 * * 1"
# Every other day at 3 AM
./scripts/setup-external-data-cron.sh --schedule "0 3 */2 * *"Cron expression format: MIN HOUR DAY MONTH WEEKDAY
Recommended schedules:
| Frequency | Expression | Use Case |
|---|---|---|
| Daily at 3 AM | 0 3 * * * |
Active database, new species added frequently |
| Weekly (Sunday 2 AM) | 0 2 * * 0 |
Stable database, moderate activity |
| Weekly (Monday 4 AM) | 0 4 * * 1 |
Alternative timing |
| Every other day | 0 3 */2 * * |
Moderate frequency |
By default (without --force):
- Species with approved submissions
- Species NOT synced within last 90 days
- Prioritized by submission count (most popular first)
- All three sources: Wikipedia, GBIF, FishBase
Sync behavior:
- Only new links/images are added
- Existing data is preserved
- No deletions occur
- Display order is maintained
- Transaction-safe (all-or-nothing per species)
Typical sync times for daily runs:
| New/Updated Species | Duration |
|---|---|
| 0-5 species | 1-3 minutes |
| 5-10 species | 2-5 minutes |
| 10-20 species | 5-10 minutes |
| 20-50 species | 10-20 minutes |
The 3 AM schedule ensures plenty of time before users wake up.
View Logs:
# Real-time (watch live)
tail -f /var/log/mulm/external-data-sync.log
# Last sync (most recent 100 lines)
tail -100 /var/log/mulm/external-data-sync.log
# Search for errors
grep "❌" /var/log/mulm/external-data-sync.log
# View specific date (compressed)
zcat /var/log/mulm/external-data-sync.log-20251119.gz | lessCheck Cron Status:
# View installed cron jobs
crontab -l
# Check cron execution logs
grep CRON /var/log/syslog | tail -20
# Verify cron service is running
sudo systemctl status cronLog Format:
Each sync produces output like:
================================================================================
🌐 External Data Sync Orchestrator
================================================================================
Mode: 🔴 EXECUTE
Started: 2025-11-19T03:00:01.234Z
================================================================================
[Processing Wikipedia/Wikidata...]
[1/10] Processing Poecilia reticulata (Fish)... ✅ 2 links, 3 images
...
=== Summary ===
Total processed: 10
✅ Success: 8
❌ Not found: 2
Total new links: 15
Total new images: 20
✅ Sync completed!
The automated sync uses conservative rate limiting to be respectful to API providers:
| Source | Delay Between Requests | Additional Delays |
|---|---|---|
| Wikipedia/Wikidata | 100ms | 30s before next source |
| GBIF | 120ms | 30s before next source |
| FishBase | None (local data) | 30s before next source |
Total sync includes:
- Request delays within each source
- 30 second pauses between sources
- Automatic retry with exponential backoff on errors
To receive email on sync completion or errors:
# Edit crontab
crontab -e
# Add MAILTO at the top
[email protected]
# The cron job will email stdout/stderr
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --executeTemporarily Disable:
crontab -e
# Add # at start of line:
# 0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --executeChange Schedule:
# Re-run setup with new schedule
./scripts/setup-external-data-cron.sh --schedule "0 2 * * 0"Permanently Remove:
crontab -e
# Delete the line containing 'sync-all-external-data.ts'If syncs are taking too long or using too many resources:
1. Limit species per run:
Edit crontab to add --limit argument:
crontab -e
# Change to:
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute --limit=202. Run less frequently:
# Weekly instead of daily
./scripts/setup-external-data-cron.sh --schedule "0 3 * * 0"3. Skip sources:
# Skip FishBase if it's slow
crontab -e
# Add --skip-fishbase:
0 3 * * * cd /opt/basny && npm run script scripts/sync-all-external-data.ts -- --execute --skip-fishbase4. Stagger sources:
Run different sources on different days:
crontab -e
# Wikipedia on Mondays
0 3 * * 1 cd /opt/basny && npm run script scripts/sync-wikipedia-external-data.ts -- --execute
# GBIF on Wednesdays
0 3 * * 3 cd /opt/basny && npm run script scripts/sync-gbif-external-data.ts -- --execute
# FishBase on Fridays
0 3 * * 5 cd /opt/basny && npm run script scripts/sync-fishbase-external-data-duckdb.ts -- --executeCron not running:
# Check if cron service is active
sudo systemctl status cron
# Start if stopped
sudo systemctl start cron
# Enable on boot
sudo systemctl enable cronSync not executing:
Check cron logs:
grep CRON /var/log/syslog | grep sync-all-external-dataCommon issues:
- PATH not set (script uses absolute paths, should work)
- Permissions (run setup script as deployment user)
- Database locked (ensure no other writes during sync)
High resource usage:
Monitor during sync:
# Watch CPU/memory
top
# Watch process
ps aux | grep ts-nodeSolutions:
- Reduce batch size with
--limit - Run during lower-traffic hours
- Increase delays in integration clients
For comprehensive guide including troubleshooting, performance tips, and advanced configurations, see:
docs/EXTERNAL_DATA_CRON_SETUP.md
Symptoms: Sync reports "Not found" for a species
Common Causes:
-
Capitalization Issues
- Database: "Danio Kerri"
- Should be: "Danio kerri"
- Fix: Update canonical names to use proper capitalization
-
Typos
- Database: "Poecillia Reticulata"
- Should be: "Poecilia reticulata"
- Fix: Correct scientific names
-
Generic/Unidentified Species
- Example: "Ancistrus sp.", "Orange Leptostrea"
- These species don't exist in databases (they're placeholders)
- Fix: Identify to species level if possible
-
Common Names Used
- Database stores common name instead of scientific name
- Example: "Green rhodactis" instead of genus/species
- Fix: Add proper scientific names
Symptoms: GBIF reports low confidence (<80%) and skips species
Solutions:
- Check if scientific name is correct
- Verify species exists in GBIF database
- Try searching GBIF web interface manually: https://www.gbif.org/
Symptoms: Connection errors when syncing
FishBase API: Known issue - use local DuckDB sync instead Wikipedia/GBIF: Should not occur (stable APIs)
If you encounter SSL errors:
# Test API connectivity
curl -I https://api.gbif.org/v1/species/match?name=Poecilia%20reticulata
curl -I https://query.wikidata.org/sparqlSymptoms: "Found 0 species to process"
Cause: All species synced within last 90 days
Solutions:
# Force re-sync
npm run script scripts/sync-wikipedia-external-data.ts -- --execute --force
# Or sync specific species
npm run script scripts/sync-wikipedia-external-data.ts -- --execute --species-id=61Symptoms: Errors or timeouts from APIs
Current Settings:
- Wikipedia: 100ms between requests
- GBIF: 120ms between requests
- FishBase: N/A (local)
If rate limited:
- Increase delay in integration clients
- Run syncs during off-peak hours
- Sync in smaller batches using
--limit
-
SeaLifeBase - FishBase's sister project for marine species
- Better coverage for corals and marine inverts
- Can reuse FishBase code structure
-
iNaturalist - Community observations
- Real-world photos from hobbyists
- Geographic distribution data
- Good for popular aquarium species
-
WoRMS - World Register of Marine Species
- Authoritative taxonomy for marine species
- Essential for coral identification
-
Tropicos / World Flora Online - Plant databases
- Better coverage for aquatic plants
- Taxonomic authority for HAP species
-
Image Downloading ✅ COMPLETED (Nov 2025)
- ✅ One-step sync with
--download-imagesflag - ✅ Automatic download to Cloudflare R2
- ✅ Transcoding to 800×600 JPEG (85% quality)
- ✅ MD5-based deduplication prevents re-downloads
- ✅ Full metadata tracking (source, attribution, license, original_url)
- ✅ Graceful error handling with fallback to external URLs
- ✅ One-step sync with
-
Admin UI Enhancements
- Display external links on species detail pages
- Image galleries
- Occurrence maps
- Manual sync triggers from admin panel
-
Automated Sync Scheduling ✅ COMPLETED
- ✅ Cron job setup script
- ✅ Configurable schedules (daily/weekly/custom)
- ✅ Comprehensive logging with rotation
- ⏳ Email reports (manual setup available)
-
Data Quality Improvements
- Automatic capitalization fixes
- Scientific name validation
- Duplicate detection
- Synonym matching
- Database Schema - Complete database documentation
- IUCN Red List Integration - Conservation status integration
- Species MCP Server Usage - MCP tools for species management
- Admin Species Management - Admin guide for species
Last Updated: November 19, 2025 Integration Version: 1.0 Maintainer: Development Team