1b. Getting started: Azure Environment - EY-Data-Science-Program/2021-Better-Working-World-Data-Challenge GitHub Wiki
Follow this guide to set up a data science environment using Microsoft Azure.
You can also watch this video tutorial which shows you every step from deploying your data science environment to submitting your first results! From sign up to submission in 20 minutes flat on Microsoft Azure
Contents
- Learn about Microsoft Azure
- Setting up an account
- Creating an SSH key
- Deploy data science environment
- Launch data science environment
- Next steps
- FAQ
Learn about Microsoft Azure
If you are new to Microsoft Azure, the following resources are a great place to familiarise yourself with the basic concepts. They are entirely optional, but recommended.
Microsoft Learn is a free, interactive, hands-on training platform that helps developing technical skills related to Microsoft products such as Azure.
Training modules are bite-sized and organized by learning path. Different paths and modules can be selected by choosing a role, such as data scientist.
These paths are also a good start to prepare for certifications, such as the Azure Data Scientist Associate.
Modules take between 30 minutes to two hours to complete. The learning paths are a collection of modules. We recommend to skip those that are not of interest.
Here is the full list of learning paths, recommended for aspiring Data Scientists. In addition, we recommend to cherry-pick modules from the Azure Fundamentals path, such as:
- Introduction to Azure
- Create an Azure Account
- Core Cloud Services - Manage services with the Azure portal (1h)
- Introduction to Azure virtual machines (1h)
There is also a series of youtube playlists which may be helpful.
- Getting Started with Azure
- Working with Data in Azure
- Machine Learning in Azure
- Data Science & Machine Learning
We strongly recommend that you backup your code/any data files you have created by either downloading them to your local drive or laptop, or if you're using Github, pushing them to a branch in your team's repo (see 2d. Code Management and Collaboration for instructions on how to set this up).
And finally we recommend to take the module on cost optimization and budgets. Setting up cost alerts early will help to make sure you stay within budget. Important: This feature doesn't work for Azure for Students subscriptions.
Setting up an account
Follow the steps at 1a. Creating an Azure for Students subscription (for both students and non-students) then continue here.
Creating an SSH key
Before you can launch the data science environment, you need an SSH key pair.
Note that an SSH key is required to create the environment, however connecting with SSH isn't required. It may be useful for advanced usage and technical troubleshooting, so make sure to keep it safe, and remember where it is!
SSH keys are like a special, extra secure password, typically used as the standard way to authenticate access to Linux servers, like the one used for the data science environment.
An SSH key has two parts; a public key and a private key. The public key is given to the server then you use the private key to log in. This way your secret login information is never actually sent over any network, special cryptographic algorithms make it work. Cool!
There are a number of methods for creating SSH keys:
- Using the Azure portal: No need to install anything (Recommended - courtesy of Azure). When creating an SSH key using this method, we recommend creating the SSH key in a new Resource Group that is location in the Australia East Region.
- Using the command line: Fastest method, uses tools already available on Linux, MacOS, and latest Windows versions (courtesy of GitHub)
- Using PuTTY and PuTTYGen: Common for Windows users (courtesy of Digital Ocean)
Deploy the data science environment
Now you should be prepared to deploy your very own data science environment in Azure Cloud. Follow the next steps to deploy it.
Make sure you have an Azure account and are logged in to the Azure Portal, then go ahead and click the link below:
This will launch the deployment. You will be prompted for several parameters to configure your environment (marked in yellow as shown below).
- Subscription: This should be automatically set to your current subscription
- Resource Group: Logical group name for related resources. If you created your SSH key using Azure, use the same resource group here. Otherwise, click "create new" so that you can monitor your Azure credit consumption for this specific resource group.
- Region: The recommended region is Australia East, since that is where the data is hosted. If you specified an existing resource group this will be assigned for you.
- Instance Type: Choose an instance size. Note that larger instances will use more credits. If you need to upload or create a large number of files in your VM, you may need to come back here and re-deploy your environment with a larger capacity VM. In this case, if you're not using your original resource group, you can remove it to save credits - follow 3c. Deleting your data science environment: Azure.
- SSH Key: This is where you enter your public SSH key. If you created your SSH key pair using the recommended method above, you can click "Go to resource" after the SSH key pair is created to get the public key. It should look something like
ssh-rsa xxx.......xxx
- Secret Password: This is the password you want to use to log in to Jupyter, choose something secure and don't forget this password because you will use it when you first login to Jupyter on the item #5 of this guide - Launch data science environment.
⚠️ Do not use a password which contains only numerical characters, as this causes an issue in Jupyter starting. ⚠️
Once you populated everything accordingly, then click the "Review + Create" button.
The validation screen will come up. Then, click in " Create" button (bottom left):
The ARM template deployment will start, as below. Go and grab a tea or coffee. The deployment will take 10-15 mins to complete.
Note: During this time, the following will be created: Virtual Machine (VM), internal boot disks, external disks, storage, network interface, public-ip, Network Security Group (NSG) - firewall, virtual network, SSH Key. Before cloud technology exists, this would take days or weeks!
After approximately 10-15 mins (it will depend on the region you are deploying) the ARM deployment (magic link) should complete, and then you will be able to see the data science environment deployed successfully as per example below.
Click in " Go to Resource Group" and you will be able to see the complete solution (whole infrastructure and the VM) for cubeinbox :
Note: After clicking this button, the Resource Group "cubeinbox" will be launched with all the solution deployed:
Next Steps
- Launch data science environment: Check you have completed the deployment by accessing the environment
- Schedule VM Auto-shutdown: Implement auto-shutdown to avoid consuming azure credits unnecessarily
Once you have provisioned your data science environment, check out the following technical guides
- Cost Management: Azure: Learn how to get the best out of your credits
- How to stop your data science environment: How to stop your environment and save credits.
- How to delete your data science environment: How to remove your environment. You can always re-deploy it to get the latest release.
- Code management and collaboration: Tips on how to manage your code and work with others
- Download the CSV (result file) from your data science environment: Download from Jupyter