Requirements - AtlasOfLivingAustralia/documentation GitHub Wiki

The ALA Portal requires several components like Java, Tomcat and Cassandra, as well as web applications that composes the ALA Portal. These softwares can be automatically installed and configured by an Ansible playbook, a type of script, developed together with the project. So if you've got an Ubuntu Linux instance up-and-running, you can use Ansible playbooks in the ala-install project to automatically set up and configure the Linux instance into an ALA Portal.

Besides Ansible playbooks, if you don't have a Linux instance ready, or just want to set up a clean on for this ALA Portal, you can consider the following two tools to create a clean instance:

  • Vagrant: Create and configure a virtual machine to host the ALA portal;
  • VirtualBox: The container of virtual machines;

The installation guide assumes one wants to create a Linux instance and configure an ALA portal from scratch. If you want to configure an existing Linux instance(Ubuntu), you may go straight to the Ansible section of the installation guide. This tutorial can only be run on UNIX systems (Linux/Mac OS X), given the fact that Ansible is not yet available on Microsoft Windows as 24 Jun 2014.

Using a Macintosh as an example, here are steps to get these tools ready:

  1. Download Vagrant and install the downloaded package.

  2. Download VirtualBox and install the downloaded package.

  3. To install Ansible, the easiest way is to install via homebrew. It's also handy to have the Command Line Tools for Mac OS X installed. Once ready, the following commands will install Ansible:

     $ brew update
     $ brew install ansible
    

The recommended minimal server requirements are:

  • Virtual machine with Ubuntu 18.04.
  • 100GB of free disk storage (ideally SSD), the indexing and processing of data consumes space very fast.
  • 32 GB RAM.
  • 2 CPUs.

Pipelines requirements

Big datasets, anything over 2 million will be super slow with spark embedded for interpretation so a spark cluster is recommended. ALA itself runs datasets over 300k on the spark cluster. Over 300k it degrades for spark embedded in a non-linear fashion. (source Dave Martin).