Data ingestion - AtlasOfLivingAustralia/documentation Wiki

Table of Contents

Ingest of resources

Just a Dataset

In case that you are working with just a dataset:

In case of a small dataresource ( < 50k records)

  1. Connect to the server where your biocache tool is hosted
  2. Connect to biocache
$ sudo biocache
  1. Run the following command line:
biocache> ingest -dr <dataresource_id>

You can also run directly on the terminal:

$ sudo biocache ingest -dr <dataresource_id>

Important : you will not have logs if you don’t specify the out file.

In case of a big dataresource ( > 50k records and < 8m)

  1. Connect to the server where your biocache tool is hosted
  2. Connect to biocache:
$ sudo biocache
  1. Run the following command lines:
biocache> load <dataResource_id>
biocache> process -dr <dataResource_id>
biocache> sample -dr <dataResource_id>
biocache> index -dr <dataResource_id>

You can also run directly on the terminal, the command lines above with

$ sudo biocache

In case of a really big dataresource ( DwC-A size > 1 Go)

  1. Upload a modified DwC-Archive with 15 occurrences in order to create the dataset into the system.
  2. Copy the real DwC-Archive instead of the modified one on the /collectory/upload/ folder
  3. Then run the load, process and index commands:
biocache> load <dataResource_id>
biocache> process -dr <dataResource_id>
biocache> sample -dr <dataResource_id>
biocache> index -dr <dataResource_id>

We need to do step 1 because the ZipFile library used by the biocache-store can’t open a file bigger than 1 GO

You need to have a server with at least the size of your DwC-Archive in RAM.

All of your resources

You can run one command line (as sudo user):

$ nohup biocache ingest -all > /tmp/load.log &

You can run three different command lines directly on the terminal (as sudo user):

$ biocache process-local-node
$ biocache sample-local-node
$ biocache index-local-node

In fact ALA team loads datasets during the week, but they have jenkins jobs for offline indexing that twice a week run processing, sampling and index everything, specially for big datasets (> 100k).

In older versions of biocache:

$ biocache bulk-processor load -t 7 > data/output_load.log
$ biocache bulk-processor process -t 6 > data/output_process.log
$ biocache bulk-processor index -ps 1000 -t 8 > /data/output_index.log

With the -t option, you will give the number of CPU you want to use for the processus.

With the -ps option, you will give the number of occurrences per pages on SOLR.

Use & good pratices of Biocache

You can run these task via Jenkins so you can store logs of tasks, and share tasks with your team.

@Todo : Tips : You don't need to enter on biocache environnment to execute biocache command line (flag by institution/ALA production)

Command for spatial module

@Todo : Instruction to add

Checks after ingestions

Some manual checks you can be performed after an occurrences data resource ingestion to check if the data was ingested correctly:

If your collection it's empty probably your data resource is not correctly mapped to a institution and/or collection.

See the jenkins page to do this in a more automatized way.

Other checks: