Spark to Cosmos DB Connector Setup - Azure/azure-cosmosdb-spark GitHub Wiki

Here are the steps to setup Apache Spark on Azure HDInsight and Azure Cosmos DB to run the Azure-CosmosDB-Spark Connector:

Create Apache Spark cluster in Azure HDInsight
Create an Azure Cosmos DB cluster
Uploading your notebook to Jupyter notebook service in Azure HDInsight
Running the Azure-CosmosDB-Spark Connector demo end-to-end.

Note, to help you determine the Cosmos DB capacity you need, here's an unofficial throughput and capacity guestimate guide for Azure Cosmos DB: Request Units, Storage Utilization, Splits....oh my!

Let's get started!

Prerequisites

An Azure subscription. Before you begin this tutorial, you must have an Azure subscription. See Create your free Azure account today.
You are already familiar with the Azure Portal. For more information, please refer to Microsoft Azure portal overview.

Create Apache Spark cluster in Azure HDInsight

To create an Apache Spark cluster in Azure HDInsight, a full set of instructions can be found at Get started: Create Apache Spark cluster in Azure HDInsight and run interactive queries using Spark SQL. Below is an abbreviated version of installation.

Step 1: Create an Azure HDInsight Cluster

Go to the Azure Portal
On the left blade, click the + > Marketplace > Intelligence + analytics > HDInsight
This will open up the HDInsight Cluster configuration similar to the screenshot below.

The key components for this cluster configuration are:

Type in your cluster name and subscription
For your cluster type, choose Spark with version Spark 2.0.2 (HDI 3.5)
By default your cluster login username is admin and type in your new cluster login password
For your SSH username, you can either the same password as cluster login or use a public key authentication. Choose what is appropriate for you; many users will often use public key so they can ssh into their HDInsight cluster. For more information, please refer to How to create and use an SSH public and private key pair for Linux VMs in Azure.
Complete the basic settings configuration by choosing your resource group and location. Keep this location in mind as you will want to configure Azure Cosmos DB to be in the same region as your Apache Spark on Azure HDInsight cluster.
Click Next to go to your Storage Settings

For your storage settings, you can go with the defaults:

Primary storage type: Azure Storage
Selection method: My subscription
All the remaining configurations are optional or filled in by default.
Click Next to go to yoru configuration summary

Once you click Create, your Apache Spark on Azure HDInsight cluster will be deployed.

Step 2: Upload Spark Connector JARs to your HDI cluster's storage account

Now that you have your HDI cluster, the next step will be to make the azure-cosmosdb-spark jar available to all the nodes in your cluster. To do this, you will need to first:

Build the azure-cosmosdb-spark JARs via maven (for more information, please refer to the user guide
OR download the JARS from the releases folder

To upload them to your HDI cluster's storage account:

Go to the your HDI cluster within the Azure Portal
Scroll to near the bottom of the lefthand blade under Properties and click on Storage Accounts and you should see something similar to the screenshot below.

Click on the storage account > Blobs > Your blob storage container
Navigate to the folder you would like to upload the jars, for example (as per the screenshot below) navigate to the $container$/example/jars folder

Note, we are using the HDI cluster's default storage account instead of some other blob storage account because at this time, Spark on HDI can only reference JARs from its own storage account.

Click Upload and upload the JARs you had built or downloaded in the previous step. It should look something similar to the screenshot below.

Once the JARs are finished uploading, your folder should look simliar to the screenshot below.

Running the Azure-CosmosDB-Spark Connector demo end-to-end

In the previous step, you had uploaded the azure-cosmosdb-spark JARs to be used by your HDI cluster. To get yourself up and runing faster, we will use the Jupyter notebook service within Azure HDInsight

Step 3. Uploading your notebook to Jupyter notebook service in Azure HDInsight

To get fast access to your Jupyter notebook service, you can get to it directly by typing the URL https://$clustername$.azurehdinsight.net/jupyter/tree where:

$clustername$ is name of Azure HDInsight cluster you had created.
The login is the one you had specified when you had built your HDInsight cluster back in Step 1

Once you log in, it should look something similar to:

We will be uploading a Scala notebook that you can download from /samples/notebooks/ of this repository.

To upload the notebook:

Click on Scala within your Juptyer UI
On the upper right, click Upload
Navigate to the sample notebook you had recently downloaded
Click Open
And then click Upload for your file.

Now your notebook has been uploaded and is ready to use.