5a. Advanced: Azure Machine Learning Services - EY-Data-Science-Program/2021-Better-Working-World-Data-Challenge GitHub Wiki

The Bushfire challenge can also be run on Azure Machine Learning services (AML), which is a sophisticated service featuring Jupyter Notebooks, capabilities to log experiments, access to auto-scaling compute, like training clusters etc. Using AML for the challenge makes sense for advanced groups and institutions that want to give several groups access to the same environment, where compute and notebooks can be shared. While the Cube-in-a-box setup can be readily deployed in the public cloud through a single click, the AML setup requires a couple of manual steps. If you have an existing Azure subscription and tenant, the Microsoft team is ready to help you setting the AML environment up.

Azure Machine Learning

Azure Machine Learning (AML) is a cloud-based environment you can use to train, deploy, automate, manage, and track ML models. The environment offers integration with auto-scaling compute clusters, Jupyter notebooks, your favourite frameworks like Scikit-learn, XGBoost, PyTorch, TensorFlow etc. It has a variety of services built in from drag-and-drop user interfaces over AutoML to MLOps functionalities. Read more about at docs.microsoft.com and try it first-hand in a free, sandboxed learning environment at Microsoft Learn.

The following is a technical description of how to set up an opendatacube/cube-in-a-box environment in AML.

Overview

Cube-in-a-box is a dockerized setup that provides you with a ready-to-go environment consisting of a PostGIS database and a Conda environment with preinstalled Opendatacube libraries. To emulate this in AML, a few manual installation steps are required, which set up all components separately.

  1. Log into Azure and create a resource group
  2. Create an AML workspace and a VM
  3. Install the Opendatacube libraries
  4. Create the PostGIS DB as Azure Container Instance
  5. Index the data-sets

Please note that you will have to reinstall the Opendatacube libraries for every new VM that you create. Also note that the indexing step only has to be executed once.

Prerequisites

The following assumes that you have the Azure CLI installed and are working in a Bash environment. You can for example use the Azure Cloud Shell. It furthermore assumes that you have access to an Azure subscription with quotas lifted as needed.

Please take note of the uppercase variable names used in the steps below, e.g. LOC. They are reused in many steps.

Log into Azure and create a resource group

Make sure you are logged in by running the following in your shell:

az login

If needed list the subscriptions available and select the one of interest:

az account list --output table
az account set --subscription "My Subscription"

Next, create a resource group, which acts as a convenient container for all created resources. Note that we assign a partially random name for the readgroup RG here and that the location LOC is hardcoded. See az account list-locations --output table for a list of available locations and pick the one closest to you.

RG=bushfire-challenge-${RANDOM}
LOC=southeastasia;
az group create --name $RG --location $LOC
echo "Created resource group $RG in $LOC"

Create an AML workspace and a VM

WSNAME=$RG
# install the machine learning extension first
az extension add -n azure-cli-ml
# create a workspace
az ml workspace create -w $WSNAME -g $RG

Then, create a VM:

AMLVM=notebookvm${RANDOM}
VMSIZE=Standard_DS3_v2
az ml computetarget create computeinstance --name $AMLVM \
  --vm-size  $VMSIZE  -g $RG -w $WSNAME

The VM size is hardcoded above. See az vm list-sizes -l $LOC for more options.

These two steps could instead be easily done through the Azure Portal as well.

Install the Opendatacube libraries

Next we'll install the Opendatacube libraries into their own Conda environment and register the kernel with Jupyter as "Python (Datacube)". This has to be done on the AML VM itself and is best done by using the Azure Portal: log into the AML workspace, click on Compute and then open a Terminal in the newly created VM. This will open a browser-based terminal on your AML VM.

conda create -y --name datacube python=3.6.9 pip ipykernel
conda activate datacube
conda install -y -c anaconda psycopg2
python -m ipykernel install --user --name datacube --display-name "Python (Datacube)"
pip install --extra-index-url="https://packages.dea.ga.gov.au" \
  aiobotocore[awscli]==1.1.2 botocore==1.17.44 boto3==1.14.44 \
  odc-apps-dc-tools odc-apps-cloud odc-ui

Next install ipyleaflet into the conda root environment to be able to display maps:

conda deactivate
pip install ipyleaflet

Note you will have to complete the steps above for every new VM you create.

Create the PostGIS DB as Azure Container Instance

Back in your previous shell environment (not in the AML VM terminal) create the PostGIS database as follows:

GISDBNAME=postgis-${RANDOM}
az container create --resource-group $RG --name $GISDBNAME \
  --image kartoza/postgis:11.0-2.5 \
  --dns-name-label $GISDBNAME \
  --ports 5432  \
  --environment-variables POSTGRES_DB=opendatacube \
    POSTGRES_PASSWORD=opendatacubepassword \
    POSTGRES_USER=opendatacube
DB_HOSTNAME="${GISDBNAME}.${LOC}.azurecontainer.io"
echo "Created GIS DB instance ${DB_HOSTNAME}"

Save the instance name. It is reused later.

Note, the POSTGRES environment variables below are set as in https://github.com/opendatacube/cube-in-a-box/blob/master/docker-compose.yml.

WIP WARNING: there is no extra security build in here. Ideally the GIS DB should be deployed in the same vnet as AML (possibly running on Kubernetes rather than ACI)

Index the data-sets

The datasets served through the PostGIS DB set up above need indexing, but only once!

Switch back again to the AML terminal and run the commands below.

You will have to change DB_HOSTNAME to the hostname created in the last step.

sudo su - 
export DB_HOSTNAME="postgis-XYZ.southeastasia.azurecontainer.io";# modify according to you env
export DB_PASSWORD="opendatacubepassword"
export DB_USERNAME="opendatacube"
export DB_DATABASE="opendatacube"

conda activate datacube
wget -nd https://raw.githubusercontent.com/EY-Data-Science-Program/2021-Better-Working-World-Data-Challenge/main/install-cube.sh
cat install-cube.sh | sed -e 's,docker-compose exec -T jupyter ,,' | sed -e 's, /scripts, /opt/odc/scripts,' | grep -v docker > install-cube-mod.sh
bash ./install-cube-mod.sh secretpassword false
rm -f install-cube.sh install-cube-mod.sh

Note, the POSTGRES environment variables below are set as in https://github.com/opendatacube/cube-in-a-box/blob/master/docker-compose.yml. The script that's being modified here on the fly, is using the original cube-in-a-box setup script but we remove any Docker references, which are not needed in the AML setup.

WIP WARNING Github access above will fail as long as the repo is private. If you want to use wget on a private repo, navigate to the script "install-cube.sh" in your browser, click on "Raw" at the top, and then copy the link in your browser to use it with wget. The link will include a token to access the script from the terminal.

Test the setup

Go back to your AML VM and select Jupyter. Then select New /Python (Datacube) and enter the code below. Note that you will have to change DB_HOSTNAME to the hostname created in the last step.

import datacube
print(datacube.__version__)

from odc.ui import DcViewer
from datacube import Datacube

import os 
os.environ['DB_HOSTNAME']="postgis-XYZ.southeastasia.azurecontainer.io";# modify according to you env
os.environ['DB_USERNAME']="opendatacube"
os.environ['DB_PASSWORD']="opendatacubepassword"
os.environ['DB_DATABASE']="opendatacube"

dc = Datacube()
DcViewer(
  dc, 
  time='2019-01-01',
  zoom=5,
  center=(-38, 142),
  height='500px', width='800px',
  products='non-empty',
  style={
    'fillOpacity': 0.05,
    'color': 'teal',
    'weight': 0.7
  }
)