Data Processing using Databricks User Guide - NRCan/datahub-portal GitHub Wiki
Please note the programming language used for this user guide is provided mostly Python, however, you may visit the website provided under each section for examples using other languages such as Scala, and R.
Access files from Data Lake storage and load it into a spark dataframe
df=spark.read.format('csv').option("header","true").load('abfss://[email protected]/demo/export-2.csv')
is the container is the blob storage account
https://docs.databricks.com/data/data-sources/azure/azure-storage.html
Read data from excel file
projectsDf = spark.read.format("com.crealytics.spark.excel").option("sheetName", fiscal_year + " Projects").option("dataAddress", "A5").option("Header", False).load("dbfs:/mnt" + vMountDirectory + vFileName)
*Requires crealytics library to be installed on the cluster
Frequently used Spark SQL Commands
Select rows from a dataframe by filtering:
df.select("GEO","Value","Date").where("value >6000").show()
Filter rows with not null values:
dfND = df.where(df["VALUE"].isNotNull())
####Group by a column and get the count of rows: dfCountByLF=dfND.groupBy("Labour_force_characteristics").count()
####Rename columns: renamedDf = df.withColumnRenamed("LName","Last_Name")
https://spark.apache.org/docs/latest/sql-programming-guide.html
Use of different programming languages within the same notebook (Python, sql, scala, R)
Convert Spark Dataframe to regular Pandas(Python) dataframe:
import pandas pandasDf = df.select("*").toPandas()
Create a temporary table (view) from a dataframe and query the view using sql
dfFullTimeEmp.createOrReplaceTempView("vw_dfFullTimeEmp")
SELECT sum(value) Employment, NAICS FROM vw_dfFullTimeEmp WHERE NAICS <> 'Total, all industries' GROUP BY NAICS ORDER BY Employment desc
Convert the dataset to a SparkR dataframe
%r library(SparkR) sdrf2<-sql("SELECT * FROM vw_dfFullTimeEmp ")
Store data
Create persistent table in the Data Lake
CREATE TABLE IF NOT EXISTS DEMO_LABOUR_FORCE USING DELTA LOCATION 'abfss://[email protected]/demo/tbl/' AS SELECT * FROM vw_dfFullTimeEmp
Visit previous version of the delta lake table using sql
SELECT * FROM DEMO_LABOUR_FORCE TIMESTAMP AS OF '2021-02-22T22:32:33.000+0000'
Write data to SQL Server datbase
server_name = "jdbc:sqlserver://sqlserver-ciosb-datahub.database.windows.net" database_name = "spi-drf-db" url = server_name + ";" + "databaseName=" + database_name + ";"
username = password = dbutils.secrets.get(scope = "datalake-key-dev", key = "sqladmin-pswd-ciosb-datahub")
table_name_project_tracker = "EA_PROJECT_TRACKER" projectsDf_with_uniqueID.write.format("jdbc").mode("overwrite").option("url", url).option("dbtable", table_name_project_tracker).option("user", username) .option("password", password).save()