Spark - cniackz/public GitHub Wiki

Objective:

Document my Journey with Spark and learn from Spark as much as I can.

Pages:

Steps:

  1. download a packaged release of Spark from:
cd ~
rm -rf ~/spark
mkdir ~/spark
cd ~/spark
wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar -xvzf spark-3.3.1-bin-hadoop3.tgz
rm spark-3.3.1-bin-hadoop3.tgz
cd spark-3.3.1-bin-hadoop3
pwd

  1. Use API with Python:
cd /Users/cniackz/spark/spark-3.3.1-bin-hadoop3
./bin/pyspark

You should see:

$ ./bin/pyspark
Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/02 10:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Python version 3.10.8 (main, Oct 13 2022 09:48:40)
Spark context Web UI available at http://192.168.1.4:4040
Spark context available as 'sc' (master = local[*], app id = local-1669998761326).
SparkSession available as 'spark'.
>>> 
  1. Que Spark Lea Algo de Texto:
  • File: /Users/cniackz/spark/spark-3.3.1-bin-hadoop3/README.md

  • Partial Content:

# Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing,
and Structured Streaming for stream processing.
textFile = spark.read.text("README.md")

You should see:

>>> textFile = spark.read.text("README.md")
>>> textFile.count()
124
>>> textFile.first()
Row(value='# Apache Spark')

Conclusion:

You can play around a file and get the words, count the words, and create programs around it. So what is the propuse?... Remember, Spark is an engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

En Español, es un engine para usar la informacion y procesarla en formas que te permitan sacarle mas jugo. Hasta aqui y con el basic example I am done for now.