ICP_02 : Spark Programming - acikgozmehmet/BigDataAnalyticsAndApplications GitHub Wiki

Objectives:

We will focus on installation and getting familiar with Big Data Analytics and Applications programming concepts.

Spark

  • Spark is an open source cluster computing environment similar to Hadoop, developed at the University of California, Berkeley
    • Machine Learning
    • Spark Streaming
    • Faster Batch
  • Spark enables in-memory distributed datasets that optimize iterative workloads in addition to interactive queries.
  • Spark is complementary to Hadoop and can run side by side over the Hadoop file system.
  • Spark supports to build large-scale and low-latency data analytics applications.

In Class Programming

1. Spark Integration with Colab (or IDE that you are using)

2. Creating a well commented Spark program and outputting the correct results and writing it to output file.

Results

Recording

Please click on the link to see the recording

References: