Azure Databricks - barialim/architecture GitHub Wiki
Table of contents generated with markdown-toc
Apache Spark is an open-source unified analytics engine for large-scale data processing.
or
Apache Spark is an open-source, lightning fast cluster computing system and a highly popular framework for big data analysis/workloads. This framework processes the data in parallel that helps to boost the performance. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
Apache Spark is written in Scala, and provides high-level API in Java, Scala, Python and R. It can access data from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source
. And run in Standalone, YARN and Mesos cluster manager.
See Apache Spark for more info.
in simple terms, Databricks is the implementation of Apache Spark on Azure. With fully managed Spark clusters, it is used to process large workloads of data and also helps in data engineering, data exploring and also visualizing data using Machine learning.
- Analytic platform
- Big data platform
- Data science platform
- It's ability to code in multiple languages in the same notebook. So, for example, suppose you created a data frame in Python, with Azure DBX, you can load this data into a temporary view and can use Scala, R, or SQL with pointer referring to this temporary view.
Apart from its support for multiple languages, this service allows us to integrate easily with many Azure services like Blob Storage, Data Lake Store, SQL Database and BI tools like Power BI, Tableau, etc.
It is a great collaborative platform letting data professionals share clusters and workspaces, which leads to higher productivity.🥇
The below screenshot is the diagram puts out by Microsoft to explain Databricks components on Azure:
It offers an interactive workspace that enables data scientists, data engineers and businesses to collaborate and work closely together on notebooks and dashboards
Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics. For more Databricks Runtime
This is an abstraction layer on top of object storage. This allows you to mount storage objects like Azure Blob Storage that lets you access data as if they were on the local file system
Azure data factory is prefered over Airflow
- who has access to workspace, data
- how are groups (infra, data) managed