Data Loading - dfo-mar-odis/dto GitHub Wiki
Data Loading Guide
This guide explains how to load initial data into the DTO Viewer database, including species thermal ranges, MPA polygons, and GLORYS12V1 temperature timeseries data. If you haven't already configured your Docker/Pycharm setup, please see the Pycharm Setup guide.
Overview
The data loading process consists of three main steps that must be executed in order:
- Load species thermal range data
- Load MPA polygon boundaries
- Load GLORYS temperature timeseries for each polygon
Data Directory Structure
scripts/
├── data/
│ ├── GLORYS/ # Temperature timeseries CSV files
│ │ ├── [bottom average files]
│ │ └── [depth-specific files]
│ ├── MPA_Polygons/ # Shapefiles for MPA boundaries
│ │ ├── *.shp
│ │ ├── *.shx
│ │ ├── *.dbf
│ │ └── *.prj
│ └── species_range.csv # Species thermal tolerance data
├── setup_init.py # Main initialization script
├── load_species.py # Species data loader
├── load_polygons.py # MPA polygon loader
├── load_timeseries.py # Temperature data loader
└── GLORYS_pipeline.R # R script to generate GLORYS CSV files
Prerequisites
Before loading data:
- Ensure your Docker containers are running
- Verify database connectivity
- Check that all data files are present in
scripts/data/
Quick Start
The easiest way to load all data is using the main setup script:
Is to start up the docker containers (see Run Configuration) and then access the Django Shell through the Pycharm Pro Terminal.
- If this is the first time running the DTO Viewer application run migrations, otherwise skip to the next step:
python manage.py migrate
- Start the Django shell
python manage.py shell
- Import the setup_init module
from scripts import setup_init
- Run the setup function
setup_init.setup()
This will run all three loading scripts in the correct order.
Step-by-Step Data Loading
1. Loading Species Data
The species thermal range data comes from Lewis et al. (2023) and defines temperature and depth ranges for various marine species.
# Enter Django shell
python manage.py shell
# Load species data
from scripts import load_species
load_species.load_species()
Data source: scripts/data/species_range.csv
What it does:
- Parses species thermal tolerance ranges
- Creates species records with temperature min/max
- Stores depth range preferences
- Links to the published research paper
2. Loading MPA Polygons
Load the Marine Protected Area boundaries from government shapefiles.
# Enter Django shell
python manage.py shell
# Load species data
from scripts import load_polygons
load_polygons.load_mpas()
Data source: scripts/data/MPA_Polygons/
directory
What it does:
- Reads shapefile geometry data
- Creates MPA boundary polygons in PostGIS
- Stores MPA metadata (name, designation, area)
- Prepares spatial indices for efficient queries
Note: This step must complete before loading timeseries data, as temperature data is linked to specific polygons.
3. Loading Temperature Timeseries
Load GLORYS12V1 ocean model temperature data for each MPA.
# Enter Django shell
python manage.py shell
# Load species data
from scripts import load_polygons
load_timeseries.load_mpas()
Data source: scripts/data/GLORYS/
directory
What it does:
- Loads bottom average temperature data
- Loads depth-specific temperature profiles
- Links temperature data to MPA polygons
- Creates timeseries indices for efficient retrieval
Data types available:
- Bottom average temperatures
- Temperature at specific depths
Generating GLORYS Data
If you need to regenerate or update the GLORYS CSV files:
- Ensure you have R installed with required packages
- Update the GLORYS12V1 NetCDF source files
- Run the pipeline script:
Rscript scripts/GLORYS_pipeline.R
This will:
- Extract temperature data from GLORYS12V1 NetCDF files
- Calculate bottom averages
- Extract depth-specific profiles
- Generate CSV files in the expected format
Verifying Data Load
After loading, verify the data:
# In Django shell
from core.models import Species, MPA, TemperatureTimeseries
# Check species data
print(f"Species loaded: {Species.objects.count()}")
# Check MPA polygons
print(f"MPAs loaded: {MPA.objects.count()}")
# Check temperature records
print(f"Temperature records: {TemperatureTimeseries.objects.count()}")
# Verify data linkage
for mpa in MPA.objects.all():
temp_count = TemperatureTimeseries.objects.filter(mpa=mpa).count()
print(f"{mpa.name}: {temp_count} temperature records")
Updating Data
Updating Species Data
- Update
scripts/data/species_range.csv
- Clear existing species data (if needed)
- Re-run
load_species.py
Adding New MPAs
- Add shapefiles to
scripts/data/MPA_Polygons/
- Run
load_polygons.py
(it should skip existing polygons) - Generate GLORYS data for new polygons
- Run
load_timeseries.py
for new data
Updating Temperature Data
- Generate new CSV files using
GLORYS_pipeline.R
- Place files in
scripts/data/GLORYS/
- Clear old temperature data if doing a full refresh
- Run
load_timeseries.py
Troubleshooting
Common Issues
"File not found" errors:
- Verify all data files are in the correct directories
- Check file permissions in Docker container
- Ensure paths in scripts match your structure
Polygon loading fails:
- Check shapefile integrity (all components present: .shp, .shx, .dbf, .prj)
- Verify PostGIS extension is enabled in database
- Check coordinate reference system compatibility
Memory errors with large datasets:
- Load data in batches by modifying scripts
- Increase Docker container memory allocation
- Consider loading MPAs individually for large temperature datasets
Duplicate data warnings:
- Scripts should handle duplicates gracefully
- To force reload, clear existing data first
- Check unique constraints in models
Clearing Data
If you need to start fresh:
# WARNING: This will delete all data!
from core.models import Species, MPA, TemperatureTimeseries
TemperatureTimeseries.objects.all().delete()
MPA.objects.all().delete()
Species.objects.all().delete()
Performance Considerations
-
Initial data load may take considerable time depending on:
- Number of MPAs
- Temporal resolution of GLORYS data
- Available system memory
-
For production loads:
- Run during off-peak hours
- Monitor database performance
- Consider loading in batches for very large datasets
Next Steps
After successfully loading data:
- Verify data visualization in the web interface
- Test species thermal emergence calculations
- Review data completeness for all MPAs
- Set up regular data update procedures
References
Lewis, S.A., Stortini, C.H., Boyce, D.G., and Stanley, R.R.E. 2023. Climate change, species thermal emergence, and conservation design: a case study in the Canadian Northwest Atlantic. FACETS. 8: 1-16. https://doi.org/10.1139/facets-2022-0191