Azure Databricks - barialim/architecture GitHub Wiki

Table of Content

Table of contents generated with markdown-toc

Overview

Apache Spark is an open-source unified analytics engine for large-scale data processing.

or

Apache Spark is an open-source, lightning fast cluster computing system and a highly popular framework for big data analysis/workloads. This framework processes the data in parallel that helps to boost the performance. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

Apache Spark is written in Scala, and provides high-level API in Java, Scala, Python and R. It can access data from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. And run in Standalone, YARN and Mesos cluster manager.

See Apache Spark for more info.

What is Azure Databricks and how is it related to Spark?

in simple terms, Databricks is the implementation of Apache Spark on Azure. With fully managed Spark clusters, it is used to process large workloads of data and also helps in data engineering, data exploring and also visualizing data using Machine learning.

Other names for Databricks

  • Analytic platform
  • Big data platform
  • Data science platform

Biggest winning point of DBX

  • It's ability to code in multiple languages in the same notebook. So, for example, suppose you created a data frame in Python, with Azure DBX, you can load this data into a temporary view and can use Scala, R, or SQL with pointer referring to this temporary view.

Why Databricks

Apart from its support for multiple languages, this service allows us to integrate easily with many Azure services like Blob Storage, Data Lake Store, SQL Database and BI tools like Power BI, Tableau, etc.

It is a great collaborative platform letting data professionals share clusters and workspaces, which leads to higher productivity.🥇

Components of Databricks

The below screenshot is the diagram puts out by Microsoft to explain Databricks components on Azure:

Azure Databricks components

Databricks Workspace

It offers an interactive workspace that enables data scientists, data engineers and businesses to collaborate and work closely together on notebooks and dashboards

Databricks Runtime

Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics. For more Databricks Runtime

Databricks File System (DBFS)

This is an abstraction layer on top of object storage. This allows you to mount storage objects like Azure Blob Storage that lets you access data as if they were on the local file system

Scheduling/Workflows/Jobs

Azure data factory is prefered over Airflow

Terminology

Asset aggregation - read into it

  • who has access to workspace, data
  • how are groups (infra, data) managed
⚠️ **GitHub.com Fallback** ⚠️