Databricks and Spark - amitbhilagude/userfullinks GitHub Wiki
Overview
Faster due to in-memory parallel processing
It follows that master-slave concept i.e. Driver nodes and Worker nodes. Worker nodes can scaled out to improve parallel processing.
Languages supported by Spark
Scala
Java
Python
SQL
R
APIs of Spark
RDD
DataFrame : Good option for Python.
DataSet : Latest which is best of RDD and DataFrame but only support Scala and Java
Narrow transformation and Wide transformation: Narrow transformation has less impact however wide tranformation, requires data to shuffle it has performance impact.e.g. Group by and then find count where those properties are in multiple partition. we may have to group in single partition then count.
Transformation vs Actions. where all stages of transformation will be completed before performing action.
Clusters
Cluster types
All purpose Cluster/ Interactive cluster: Good for development
Job clusters: Scheduling pipelines
Pool: Pool of nodes will be shared with multiple clusters. Nodes are available.
Cluster Mode
Standard Mode: Only for single user
High concurrency : Team collaboration
Single node: only driver nod
Spot instances
Cost optimised option as available instances used which are shared.
Advanced options
Enable credential to pass through to data lake
Spark config and environment variable: set the scope here
Tags
Logging: Log cluster data into other location
Initscripts
APIs
Spark.Read : able to read files with different format like csv, parquet
Options
Inferschema: automatically identify schema. It scans all tables and rows first then load it again. Not a recommended option.
Sep: Declare seperation delimeter
Schema: explicitly define schema
Load: Allow to load mutliple files
Filter Function
Add, Drop or Rename column in dataframe
Add new column with constant value in all rows using lit function.