Spark - jjin-choi/study_note GitHub Wiki
Β§ Why Distributed Computing?
-
Big data
- Volume : Since the size of our data is growing, we need larger data stores to store that data, and we need ways to run computation across those larger datasets.
- Velocity : As we have more and more mechanisms that can produce data, it is arriving in data pipelines at a faster and faster rate.
- Variety : This ranges from numeric and textual data to images and video streams.
- Veracity : how much do we trust the data that we do have? Some data arriving in our systems might have missing values or it might otherwise be inaccurate. For instance, with user-generated data.
-
Aparche Spark : Original Story
- Fast, general-purpose system
- distributes computation across a cluster of machines
-
Spark Architecture
- One driver : Optimizes queries and Delegates task
- One or many executors : Perform actual queries. More is not always faster
- Slot and task :
- Slot : a unit of parallelism
-
Parallelism and Scalability
- Amdahl's Law of linear scalability : μ£Όμ΄μ§ μμ
μ λ³λ ¬νν΄μ λ³Ό μ μλ μλ ν₯μμ μμ κ·Έ μμ
μ μμ λ³λ ¬λ‘ κ³μ°ν μ μλ ν¨μ ?
- The amount of acceleration we would see from parallelizing a task is a function of what portion of the task can be completed in parallel.
- Amdahl's Law of linear scalability : μ£Όμ΄μ§ μμ
μ λ³λ ¬νν΄μ λ³Ό μ μλ μλ ν₯μμ μμ κ·Έ μμ
μ μμ λ³λ ¬λ‘ κ³μ°ν μ μλ ν¨μ ?
-
When and where to use ?
- Scale out : if you have too much data to process on a single machine.
- Speed up : if your data can't fit on a single machine, you might benefit from speeding up your query by adding more computing resources.
Β§ Spark DataFrames
- Learning Objective : RDD μ DataFrame API within Spark μ μ°¨μ΄μ μ μ€λͺ
ν μ μμ.
- Version Spark 1.3 - RDD API μ μλ¨μ λ λ§μ κΈ°λ₯κ³Ό μ΅μ νλ₯Ό μ 곡νλ DataFrame API λ₯Ό λμ νμ.
- RDD (Resilient Distributed Datasets) :
- Resilient (νλ ₯μ ) : Fault-tolerant (λ΄κ²°ν¨μ±)
- λ΄κ²°ν¨μ±μ μμ±μ μννλ λ°©λ²μ, DAG (Directed Acyclic Graph) λ°©ν₯μ± λΉμν κ·Έλν
- λ°μ΄ν°μ μ μ©νλ μΌλ ¨μ transformation μ΄μ§λ§, you cannot change any of the transformations that came before you in this graph. (μ΄μ μ μ 곡λ λ³νμ λ³κ²½ν μ μλ λΉμν)
- Distributed dataset component : in which the data is distributed and stored across multiple nodes in your cluster.
- ν΄λ¬μ€ν°μ μ¬λ¬ λ Έλμ λΆμ°λκ³ μ μ₯λλ λΆμ° λ°μ΄ν° μ§ν© κ΅¬μ± μμ
- Computed across multiple notes
- Results are aggregated by the driver
- DataFrame API : RDD μμ±μ λλΆλΆμ μμν¨. (resilient + distributed) +
__metadata__
- Metadata : Number of columns, Data types : μ¦ λ°μ΄ν°λ₯Ό μ μ₯νκ³ ν΄λΉ λ°©ν₯ λΉμν κ·Έλνλ₯Ό μ¬μ©νμ¬ μ μ©ν λ³νμ μλ κ² μΈμλ λ°μ΄ν°μ metadata about the number of columns in your dataset and the datatypes
- μ¦ Excel μ΄λ csv νμΌκ³Ό μ μ¬νλ€κ³ μκ°ν μ μμ. μ΄ λ§¨ μμ data type μ μ§μ ν μ μμ (call type)
- Spark : is not a database. It's compute engine that can read from databases.
- But data is ephemeral (μΌμμ ). μ¦, spark cluster κ° λ€μ΄λλλΌλ data λ₯Ό μμ§ μμ.
- You can think about it if one of your friends goes out for lunch from your Spark cluster, leaves your cluster, you haven't lost the data that friend was responsible for.
- DataFrame μ SQL table λ μλκ³ excel μ΄λ csv λ μλ. κ·Έκ²μ abstractions on top of these underlying data sources
- κΈ°λ³Έ λ°μ΄ν° μμ€ μμ μΆμν ?
- The analogy I'm going to give you when we're talking about Catalyst is that when you're using the DataFrame API, you specify what you want to be done not how you want it to be done.
- Spark DataFrame Execution
- Unresolved logical plan before look-up in data catalog
- Then Catalyst resolves them and creates a logical plan
-
- Link: Spark DataFrame Execution
- Catalyst μΈμλ DataFrame μ΄ RDD λ³΄λ€ ν¨κ³Ό μλ μ΄μ μ κΈ°μ¬ν Project Tungsten μ΄ μμ
Β§ Databricks Environment
- Databricks Environment : a unified analytics platform that enables data science and engineering teams to run all analytics in one place.
- This includes running reports, empowering dashboards, as well as running extract transform load jobs known as ETL. This is where new data is cleaned and inserted into databases. On Databricks you can also run machine learning and streaming jobs as well.
- Most important to us in this course is the hosted notebook environment. This means that we can interact with our data in real time by running cells of code hosted on a Databricks server.
- This code will actually be executing against a Spark cluster.
- Spark can be tricky to set up, since it involves networking together different machines.
- Databricks is going to manage the installation and setup for us so that we can focus on doing our analytics in SQL.
- In practice though, Spark is about scaling computation.
- So Community Edition allows us to prototype code but not quite unleash the full power of distributed computation.
- navigate to Databricks.com
============= Β§ Pandas UDF for pyspark
λ³Έλ‘ μΌλ‘ λμμ, Pandas UDFλ μΈνκ³Ό μμνμ ννμ λ°λΌ 3κ°μ§λ‘ λΆλ₯λλ€.
Name Input Output Scalar UDFs pandas.Series pandas.Series Grouped Map UDFs pandas.DataFrame pandas.DataFrame Grouped Aggregate UDFs pandas.Series scala
-
μ°Έκ³ μλ£ : https://chioni.github.io/posts/pandasudf/ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
-
ImportError: PyArrow >= 0.15.1 must be installed; however, it was not found.
- Pyarrow μ€μΉνκΈ°
-
spark νλ : https://gritmind.blog/2020/10/16/spark_tune/
-
pandas UDF : https://chioni.github.io/posts/pandasudf/
Β§