BD: Spark - dudycooly/1235 GitHub Wiki

Analytics - Process of getting a meaningful insight of data you have (Is my sales going to grow?, Where are the issues for not meeting target)

Batch Processing - Processing a data collected over a period of time to answer non-time critical queries Real Time Processing (not true RT like Storm, does microbatching) - Instantaneous of process of real time data collected from the field to answer a time critical queries

Apache Spark: High Performance Distributed Cluster Computing System meant for both Batch and Real Time processing

Alternatives: Hadoop + MAPR (Batch Processing) Hadoop + Apache Storm (Real Time Processing)

Applications: May not fit small scale data (fitting in one server)

Building Blocks: Resilient Distributed Dataset (RDD) - Manage data distribution, computation and co-ordination Direct Acyclic Graph (DAG) or Lineage- Execution flow in the form of graph comprising computation nodes (workers?) and weights SparkContext - Orchestration-- Entry point for Spark application Transformation - CSV to RDD, RDD to RDD Actions - (LazyLoading) filter.collect

Problem addressing: Originally Hadoop

  • only meant for Batch time processing
  • was slow in processing (How slow?)
  • MapR problem is complicated

Good Read: https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html