Spark to Cosmos DB Connector Setup - Azure/azure-cosmosdb-spark GitHub Wiki
Here are the steps to setup Apache Spark on Azure HDInsight and Azure Cosmos DB to run the Azure-CosmosDB-Spark Connector:
- Create Apache Spark cluster in Azure HDInsight
- Create an Azure Cosmos DB cluster
- Uploading your notebook to Jupyter notebook service in Azure HDInsight
- Running the Azure-CosmosDB-Spark Connector demo end-to-end.
Note, to help you determine the Cosmos DB capacity you need, here's an unofficial throughput and capacity guestimate guide for Azure Cosmos DB: Request Units, Storage Utilization, Splits....oh my!
Let's get started!
Prerequisites
- An Azure subscription. Before you begin this tutorial, you must have an Azure subscription. See Create your free Azure account today.
- You are already familiar with the Azure Portal. For more information, please refer to Microsoft Azure portal overview.
Create Apache Spark cluster in Azure HDInsight
To create an Apache Spark cluster in Azure HDInsight, a full set of instructions can be found at Get started: Create Apache Spark cluster in Azure HDInsight and run interactive queries using Spark SQL. Below is an abbreviated version of installation.
Step 1: Create an Azure HDInsight Cluster
- Go to the Azure Portal
- On the left blade, click the + > Marketplace > Intelligence + analytics > HDInsight
- This will open up the HDInsight Cluster configuration similar to the screenshot below.
The key components for this cluster configuration are:
- Type in your cluster name and subscription
- For your cluster type, choose Spark with version Spark 2.0.2 (HDI 3.5)
- By default your cluster login username is admin and type in your new cluster login password
- For your SSH username, you can either the same password as cluster login or use a public key authentication. Choose what is appropriate for you; many users will often use public key so they can ssh into their HDInsight cluster. For more information, please refer to How to create and use an SSH public and private key pair for Linux VMs in Azure.
- Complete the basic settings configuration by choosing your resource group and location. Keep this location in mind as you will want to configure Azure Cosmos DB to be in the same region as your Apache Spark on Azure HDInsight cluster.
- Click Next to go to your Storage Settings
For your storage settings, you can go with the defaults:
- Primary storage type: Azure Storage
- Selection method: My subscription
- All the remaining configurations are optional or filled in by default.
- Click Next to go to yoru configuration summary
Once you click Create, your Apache Spark on Azure HDInsight cluster will be deployed.
Step 2: Upload Spark Connector JARs to your HDI cluster's storage account
Now that you have your HDI cluster, the next step will be to make the azure-cosmosdb-spark
jar available to all the nodes in your cluster. To do this, you will need to first:
- Build the
azure-cosmosdb-spark
JARs viamaven
(for more information, please refer to the user guide - OR download the JARS from the releases folder
To upload them to your HDI cluster's storage account:
- Go to the your HDI cluster within the Azure Portal
- Scroll to near the bottom of the lefthand blade under Properties and click on Storage Accounts and you should see something similar to the screenshot below.
- Click on the storage account > Blobs > Your blob storage container
- Navigate to the folder you would like to upload the jars, for example (as per the screenshot below) navigate to the $container$/example/jars folder
Note, we are using the HDI cluster's default storage account instead of some other blob storage account because at this time, Spark on HDI can only reference JARs from its own storage account.
- Click Upload and upload the JARs you had built or downloaded in the previous step. It should look something similar to the screenshot below.
- Once the JARs are finished uploading, your folder should look simliar to the screenshot below.
Running the Azure-CosmosDB-Spark Connector demo end-to-end
In the previous step, you had uploaded the azure-cosmosdb-spark
JARs to be used by your HDI cluster. To get yourself up and runing faster, we will use the Jupyter notebook service within Azure HDInsight
Step 3. Uploading your notebook to Jupyter notebook service in Azure HDInsight
To get fast access to your Jupyter notebook service, you can get to it directly by typing the URL https://$clustername$.azurehdinsight.net/jupyter/tree
where:
$clustername$
is name of Azure HDInsight cluster you had created.- The login is the one you had specified when you had built your HDInsight cluster back in Step 1
Once you log in, it should look something similar to:
We will be uploading a Scala notebook that you can download from /samples/notebooks/ of this repository.
To upload the notebook:
- Click on Scala within your Juptyer UI
- On the upper right, click Upload
- Navigate to the sample notebook you had recently downloaded
- Click Open
- And then click Upload for your file.
Now your notebook has been uploaded and is ready to use.