Data ingestion - AtlasOfLivingAustralia/documentation GitHub Wiki

Ingest of resources
- Just a Dataset
- All of your resources
Use & good pratices of Biocache
Command for spatial module
Checks after ingestions

Ingest of resources

Just a Dataset

In case that you are working with just a dataset:

In case of a small dataresource ( < 50k records)

Connect to the server where your biocache tool is hosted
Connect to biocache

$ sudo biocache

Run the following command line:

biocache> ingest -dr <dataresource_id>

You can also run directly on the terminal:

$ sudo biocache ingest -dr <dataresource_id>

Important : you will not have logs if you don’t specify the out file.

In case of a big dataresource ( > 50k records and < 8m)

Connect to the server where your biocache tool is hosted
Connect to biocache:

$ sudo biocache

Run the following command lines:

biocache> load <dataResource_id>
biocache> process -dr <dataResource_id>
biocache> sample -dr <dataResource_id>
biocache> index -dr <dataResource_id>

You can also run directly on the terminal, the command lines above with

$ sudo biocache

In case of a really big dataresource ( DwC-A size > 1 Go)

Upload a modified DwC-Archive with 15 occurrences in order to create the dataset into the system.
Copy the real DwC-Archive instead of the modified one on the /collectory/upload/ folder
Then run the load, process and index commands:

biocache> load <dataResource_id>
biocache> process -dr <dataResource_id>
biocache> sample -dr <dataResource_id>
biocache> index -dr <dataResource_id>

We need to do step 1 because the ZipFile library used by the biocache-store can’t open a file bigger than 1 GO

You need to have a server with at least the size of your DwC-Archive in RAM.

All of your resources

You can run one command line (as sudo user):

$ nohup biocache ingest -all > /tmp/load.log &

You can run three different command lines directly on the terminal (as sudo user):

$ biocache process-local-node
$ biocache sample-local-node
$ biocache index-local-node

In fact ALA team loads datasets during the week, but they have jenkins jobs for offline indexing that twice a week run processing, sampling and index everything, specially for big datasets (> 100k).

In older versions of biocache:

$ biocache bulk-processor load -t 7 > data/output_load.log
$ biocache bulk-processor process -t 6 > data/output_process.log
$ biocache bulk-processor index -ps 1000 -t 8 > /data/output_index.log

With the -t option, you will give the number of CPU you want to use for the processus.

With the -ps option, you will give the number of occurrences per pages on SOLR.

Use & good pratices of Biocache

You can run these task via Jenkins so you can store logs of tasks, and share tasks with your team.

@Todo : Tips : You don't need to enter on biocache environnment to execute biocache command line (flag by institution/ALA production)

Command for spatial module

@Todo : Instruction to add

Checks after ingestions

Some manual checks you can be performed after an occurrences data resource ingestion to check if the data was ingested correctly:

Check that the dr collections shows a similar number or occurrences in the collectory and in your source (IPT, DwCA). To check this you can do it:
- Via your biocache-hub web search
- Via your biocache-ws API with calls like: https://biocache-ws.ala.org.au/ws/occurrences/search?q=data_resource_uid:drNUMBER
- Via a direct Solr index search with a similar query.

If your collection it's empty probably your data resource is not correctly mapped to a institution and/or collection.

See the jenkins page to do this in a more automatized way.

Other checks:

If you data resource has multimedia, you can search your image service using dataResourceUid criteria. Sample:
Search in your spatial services if your occurrences where processed also correctly in the spatial service

Data ingestion - AtlasOfLivingAustralia/documentation GitHub Wiki

Table of Contents

Ingest of resources

Just a Dataset

In case of a small dataresource ( < 50k records)

In case of a big dataresource ( > 50k records and < 8m)

In case of a really big dataresource ( DwC-A size > 1 Go)

All of your resources

Use & good pratices of Biocache

Command for spatial module

Checks after ingestions

End

⚠️ GitHub.com Fallback ⚠️

Data ingestion - AtlasOfLivingAustralia/documentation GitHub Wiki

Table of Contents

Ingest of resources

Just a Dataset

In case of a small dataresource ( < 50k records)

In case of a big dataresource ( > 50k records and < 8m)

In case of a really big dataresource ( DwC-A size > 1 Go)

All of your resources

Use & good pratices of Biocache

Command for spatial module

Checks after ingestions

End

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️