Data Loading - dfo-mar-odis/dto GitHub Wiki

Data Loading Guide

This guide explains how to load initial data into the DTO Viewer database, including species thermal ranges, MPA polygons, and GLORYS12V1 temperature timeseries data. If you haven't already configured your Docker/Pycharm setup, please see the Pycharm Setup guide.

Overview

The data loading process consists of three main steps that must be executed in order:

  1. Load species thermal range data
  2. Load MPA polygon boundaries
  3. Load GLORYS temperature timeseries for each polygon

Data Directory Structure

scripts/
├── data/
│   ├── GLORYS/              # Temperature timeseries CSV files
│   │   ├── [bottom average files]
│   │   └── [depth-specific files]
│   ├── MPA_Polygons/        # Shapefiles for MPA boundaries
│   │   ├── *.shp
│   │   ├── *.shx
│   │   ├── *.dbf
│   │   └── *.prj
│   └── species_range.csv    # Species thermal tolerance data
├── setup_init.py            # Main initialization script
├── load_species.py          # Species data loader
├── load_polygons.py         # MPA polygon loader
├── load_timeseries.py       # Temperature data loader
└── GLORYS_pipeline.R        # R script to generate GLORYS CSV files

Prerequisites

Before loading data:

  1. Ensure your Docker containers are running
  2. Verify database connectivity
  3. Check that all data files are present in scripts/data/

Quick Start

The easiest way to load all data is using the main setup script:

Is to start up the docker containers (see Run Configuration) and then access the Django Shell through the Pycharm Pro Terminal.
image

  1. If this is the first time running the DTO Viewer application run migrations, otherwise skip to the next step:
python manage.py migrate
  1. Start the Django shell
python manage.py shell
  1. Import the setup_init module
from scripts import setup_init
  1. Run the setup function
setup_init.setup()

This will run all three loading scripts in the correct order.

Step-by-Step Data Loading

1. Loading Species Data

The species thermal range data comes from Lewis et al. (2023) and defines temperature and depth ranges for various marine species.

# Enter Django shell
python manage.py shell
# Load species data
from scripts import load_species
load_species.load_species()

Data source: scripts/data/species_range.csv

What it does:

  • Parses species thermal tolerance ranges
  • Creates species records with temperature min/max
  • Stores depth range preferences
  • Links to the published research paper

2. Loading MPA Polygons

Load the Marine Protected Area boundaries from government shapefiles.

# Enter Django shell
python manage.py shell
# Load species data
from scripts import load_polygons
load_polygons.load_mpas()

Data source: scripts/data/MPA_Polygons/ directory

What it does:

  • Reads shapefile geometry data
  • Creates MPA boundary polygons in PostGIS
  • Stores MPA metadata (name, designation, area)
  • Prepares spatial indices for efficient queries

Note: This step must complete before loading timeseries data, as temperature data is linked to specific polygons.

3. Loading Temperature Timeseries

Load GLORYS12V1 ocean model temperature data for each MPA.

# Enter Django shell
python manage.py shell
# Load species data
from scripts import load_polygons
load_timeseries.load_mpas()

Data source: scripts/data/GLORYS/ directory

What it does:

  • Loads bottom average temperature data
  • Loads depth-specific temperature profiles
  • Links temperature data to MPA polygons
  • Creates timeseries indices for efficient retrieval

Data types available:

  • Bottom average temperatures
  • Temperature at specific depths

Generating GLORYS Data

If you need to regenerate or update the GLORYS CSV files:

  1. Ensure you have R installed with required packages
  2. Update the GLORYS12V1 NetCDF source files
  3. Run the pipeline script:
Rscript scripts/GLORYS_pipeline.R

This will:

  • Extract temperature data from GLORYS12V1 NetCDF files
  • Calculate bottom averages
  • Extract depth-specific profiles
  • Generate CSV files in the expected format

Verifying Data Load

After loading, verify the data:

# In Django shell
from core.models import Species, MPA, TemperatureTimeseries

# Check species data
print(f"Species loaded: {Species.objects.count()}")

# Check MPA polygons
print(f"MPAs loaded: {MPA.objects.count()}")

# Check temperature records
print(f"Temperature records: {TemperatureTimeseries.objects.count()}")

# Verify data linkage
for mpa in MPA.objects.all():
    temp_count = TemperatureTimeseries.objects.filter(mpa=mpa).count()
    print(f"{mpa.name}: {temp_count} temperature records")

Updating Data

Updating Species Data

  1. Update scripts/data/species_range.csv
  2. Clear existing species data (if needed)
  3. Re-run load_species.py

Adding New MPAs

  1. Add shapefiles to scripts/data/MPA_Polygons/
  2. Run load_polygons.py (it should skip existing polygons)
  3. Generate GLORYS data for new polygons
  4. Run load_timeseries.py for new data

Updating Temperature Data

  1. Generate new CSV files using GLORYS_pipeline.R
  2. Place files in scripts/data/GLORYS/
  3. Clear old temperature data if doing a full refresh
  4. Run load_timeseries.py

Troubleshooting

Common Issues

"File not found" errors:

  • Verify all data files are in the correct directories
  • Check file permissions in Docker container
  • Ensure paths in scripts match your structure

Polygon loading fails:

  • Check shapefile integrity (all components present: .shp, .shx, .dbf, .prj)
  • Verify PostGIS extension is enabled in database
  • Check coordinate reference system compatibility

Memory errors with large datasets:

  • Load data in batches by modifying scripts
  • Increase Docker container memory allocation
  • Consider loading MPAs individually for large temperature datasets

Duplicate data warnings:

  • Scripts should handle duplicates gracefully
  • To force reload, clear existing data first
  • Check unique constraints in models

Clearing Data

If you need to start fresh:

# WARNING: This will delete all data!
from core.models import Species, MPA, TemperatureTimeseries

TemperatureTimeseries.objects.all().delete()
MPA.objects.all().delete()
Species.objects.all().delete()

Performance Considerations

  • Initial data load may take considerable time depending on:

    • Number of MPAs
    • Temporal resolution of GLORYS data
    • Available system memory
  • For production loads:

    • Run during off-peak hours
    • Monitor database performance
    • Consider loading in batches for very large datasets

Next Steps

After successfully loading data:

  • Verify data visualization in the web interface
  • Test species thermal emergence calculations
  • Review data completeness for all MPAs
  • Set up regular data update procedures

References

Lewis, S.A., Stortini, C.H., Boyce, D.G., and Stanley, R.R.E. 2023. Climate change, species thermal emergence, and conservation design: a case study in the Canadian Northwest Atlantic. FACETS. 8: 1-16. https://doi.org/10.1139/facets-2022-0191