Databricks and Spark - amitbhilagude/userfullinks GitHub Wiki

Overview
1. Faster due to in-memory parallel processing
2. It follows that master-slave concept i.e. Driver nodes and Worker nodes. Worker nodes can scaled out to improve parallel processing.
Languages supported by Spark
1. Scala
2. Java
3. Python
4. SQL
5. R
APIs of Spark
1. RDD
2. DataFrame : Good option for Python.
3. DataSet : Latest which is best of RDD and DataFrame but only support Scala and Java
Narrow transformation and Wide transformation: Narrow transformation has less impact however wide tranformation, requires data to shuffle it has performance impact.e.g. Group by and then find count where those properties are in multiple partition. we may have to group in single partition then count.
Transformation vs Actions. where all stages of transformation will be completed before performing action.
Clusters
1. Cluster types
  1. All purpose Cluster/ Interactive cluster: Good for development
  2. Job clusters: Scheduling pipelines
  3. Pool: Pool of nodes will be shared with multiple clusters. Nodes are available.
2. Cluster Mode
  1. Standard Mode: Only for single user
  2. High concurrency : Team collaboration
  3. Single node: only driver nod
3. Spot instances
  1. Cost optimised option as available instances used which are shared.
4. Advanced options
  1. Enable credential to pass through to data lake
  2. Spark config and environment variable: set the scope here
  3. Tags
  4. Logging: Log cluster data into other location
  5. Initscripts
APIs
1. Spark.Read : able to read files with different format like csv, parquet
  1. Options
    1. Inferschema: automatically identify schema. It scans all tables and rows first then load it again. Not a recommended option.
    2. Sep: Declare seperation delimeter
  2. Schema: explicitly define schema
  3. Load: Allow to load mutliple files
2. Filter Function
3. Add, Drop or Rename column in dataframe
  1. Add new column with constant value in all rows using lit function.
4. Diplay or Show dataframe