Connecting Cosmos DB with PowerBI using spark and databricks permium - nomiero/azure-cosmosdb-spark GitHub Wiki

A powerfully fun way to visualize your data in Azure Cosmos DB is to use Power BI. While there is an ODBC Driver (refer to Connect to Azure Cosmos DB using BI analytics tools with the ODBC driver), this method requires you to download all of the data from Azure Cosmos DB into Power BI.

To workaround this issue, one technique is to use the azure-cosmosdb-spark connector which allows you to use Apache Spark as the bridge between Power BI and Azure Cosmos DB. Power BI has direct query capabilities to Apache Spark and with the azure-cosmosdb-spark connector, you can create direct connectivity from Power BI to Azure Cosmos DB.

Power BI DQ > Spark > Azure Cosmos DB

Note, these are alpha working instructions and we will over time simplify how to do this so it will be easier for you to configure this.

Setup

You will need the following components

Power BI Desktop,
A databricks premium cluster Azure Databricks Spark, and
an Azure Cosmos DB subscription.

Configuration

Attach the Cosmos DB connector to the provisioned databricks cluser

Create a databricks cluster
Attach the Cosmos DB connector to the data bricks cluster

Creating Hive (Spark) table with Cosmos DB Data Source

Now that your Databricks cluster can automatically connect to your Azure Cosmos DB cluster, you will need to build a table - the source of which is a Cosmos DB query. This way from the Power BI perspective, it is only connecting to a Spark SQL table (not realizing that it is a Cosmos DB data source underneath).

The easiest way to do this is to create a notebook that creates the sql table

you can run a command to create an external table such as the one below:

spark.sql("create table flights using com.microsoft.azure.cosmosdb.spark options (endpoint 'https://doctorwho.documents.azure.com:443/', database 'DepartureDelays', collection 'flights_pcoll', masterkey '$masterkey')")

Thus if you run the show tables command, it should look something like this:

spark.sql("show tables").show()
+---------------+-----------+
|      tableName|isTemporary|
+---------------+-----------+
|        flights|      false|
|hivesampletable|      false|
+---------------+-----------+

And now you can test it by running Spark SQL queries against it, such as:

-- Test the table works
spark.sql("select * from flights limit 10").show()
spark.sql("select date, delay, distance, origin, destination from flights limit 10").show()


-- Test a group by statement
spark.sql("select destination, count(1) from flights where origin = 'SEA' group by destination").show()

Connect Power BI to Spark

Now, you can connect Power BI to Spark as noted in the instructions Apache Spark BI using data visualization tools with Azure Databricks.

The only difference is now you will connect to a table (such as flights above) where its underlying source is an Azure Cosmos DB query.

Tips and Tricks

Based on the initial tests, here are some quick tips:

Because Power BI can ask a lot of queries (especially if you have a lot of widgets), you may want to shrink the scope of the query by creating a view. For example, if I only care about flights departing from Seattle, I can create the view below that limits Power BI to only request for flights departing Seattle. You're reducing the amount of data transferred as well as taking advantage of Cosmos DB indexing at the same time.

spark.sql("CREATE VIEW originSEA AS SELECT date, distance, delay, origin, destination FROM flights WHERE origin = 'SEA'")

Due to the spikiness of Power BI queries, you may want to enable Azure Cosmos DB RU per minute pricing.

Talk to us

As noted, this is an early preview and we will be updating this document as we optimize the query performance as well as the installation process. Meanwhile, please do not hesitate to submit issues, suggest any PRs, or contact us at [email protected].