Spark - jjin-choi/study_note GitHub Wiki

Introduction to Spark

Β§ Why Distributed Computing?

  • Big data

    • Volume : Since the size of our data is growing, we need larger data stores to store that data, and we need ways to run computation across those larger datasets.
    • Velocity : As we have more and more mechanisms that can produce data, it is arriving in data pipelines at a faster and faster rate.
    • Variety : This ranges from numeric and textual data to images and video streams.
    • Veracity : how much do we trust the data that we do have? Some data arriving in our systems might have missing values or it might otherwise be inaccurate. For instance, with user-generated data.
  • Aparche Spark : Original Story

    • Fast, general-purpose system
    • distributes computation across a cluster of machines
  • Spark Architecture

    • One driver : Optimizes queries and Delegates task
    • One or many executors : Perform actual queries. More is not always faster
    • Slot and task :
      • Slot : a unit of parallelism
  • Parallelism and Scalability

    • Amdahl's Law of linear scalability : 주어진 μž‘μ—…μ„ λ³‘λ ¬ν™”ν•΄μ„œ λ³Ό 수 μžˆλŠ” 속도 ν–₯μƒμ˜ 양은 κ·Έ μž‘μ—…μ˜ 양을 λ³‘λ ¬λ‘œ 계산할 수 μžˆλŠ” ν•¨μˆ˜ ?
      • The amount of acceleration we would see from parallelizing a task is a function of what portion of the task can be completed in parallel.
  • When and where to use ?

    • Scale out : if you have too much data to process on a single machine.
    • Speed up : if your data can't fit on a single machine, you might benefit from speeding up your query by adding more computing resources.

Β§ Spark DataFrames

  • Learning Objective : RDD 와 DataFrame API within Spark 의 차이점을 μ„€λͺ…ν•  수 있음.
    • Version Spark 1.3 - RDD API 의 상단에 더 λ§Žμ€ κΈ°λŠ₯κ³Ό μ΅œμ ν™”λ₯Ό μ œκ³΅ν•˜λŠ” DataFrame API λ₯Ό λ„μž…ν–ˆμŒ.
    • RDD (Resilient Distributed Datasets) :
      • Resilient (탄λ ₯적) : Fault-tolerant (내결함성)
      • 내결함성을 속성을 μˆ˜ν–‰ν•˜λŠ” 방법은, DAG (Directed Acyclic Graph) λ°©ν–₯μ„± λΉ„μˆœν™˜ κ·Έλž˜ν”„
      • 데이터에 μ μš©ν•˜λŠ” 일련의 transformation μ΄μ§€λ§Œ, you cannot change any of the transformations that came before you in this graph. (이전에 제곡된 λ³€ν™˜μ„ λ³€κ²½ν•  수 μ—†λŠ” λΉ„μˆœν™˜)
      • Distributed dataset component : in which the data is distributed and stored across multiple nodes in your cluster.
      • ν΄λŸ¬μŠ€ν„°μ˜ μ—¬λŸ¬ λ…Έλ“œμ— λΆ„μ‚°λ˜κ³  μ €μž₯λ˜λŠ” λΆ„μ‚° 데이터 집합 ꡬ성 μš”μ†Œ
      • Computed across multiple notes
      • Results are aggregated by the driver
    • DataFrame API : RDD μ†μ„±μ˜ λŒ€λΆ€λΆ„μ„ 상속함. (resilient + distributed) + __metadata__
      • Metadata : Number of columns, Data types : 즉 데이터λ₯Ό μ €μž₯ν•˜κ³  ν•΄λ‹Ή λ°©ν–₯ λΉ„μˆœν™˜ κ·Έλž˜ν”„λ₯Ό μ‚¬μš©ν•˜μ—¬ μ μš©ν•  λ³€ν™˜μ„ μ•„λŠ” 것 외에도 λ°μ΄ν„°μ˜ metadata about the number of columns in your dataset and the datatypes
      • 즉 Excel μ΄λ‚˜ csv 파일과 μœ μ‚¬ν•˜λ‹€κ³  생각할 수 있음. μ—΄ 맨 μœ„μ— data type 을 지정할 수 있음 (call type)
    • Spark : is not a database. It's compute engine that can read from databases.
      • But data is ephemeral (μΌμ‹œμ ). 즉, spark cluster κ°€ λ‹€μš΄λ˜λ”λΌλ„ data λ₯Ό μžƒμ§€ μ•ŠμŒ.
      • You can think about it if one of your friends goes out for lunch from your Spark cluster, leaves your cluster, you haven't lost the data that friend was responsible for.
      • DataFrame 은 SQL table 도 μ•„λ‹ˆκ³  excel μ΄λ‚˜ csv 도 μ•„λ‹˜. 그것은 abstractions on top of these underlying data sources
      • κΈ°λ³Έ 데이터 μ†ŒμŠ€ μœ„μ˜ 좔상화 ?
      • The analogy I'm going to give you when we're talking about Catalyst is that when you're using the DataFrame API, you specify what you want to be done not how you want it to be done.
    • Spark DataFrame Execution
      • Unresolved logical plan before look-up in data catalog
      • Then Catalyst resolves them and creates a logical plan
      • memory

      • Link: Spark DataFrame Execution
      • Catalyst 외에도 DataFrame 이 RDD 보닀 효과 μžˆλŠ” μ΄μœ μ— κΈ°μ—¬ν•  Project Tungsten 이 있음

Β§ Databricks Environment

  • Databricks Environment : a unified analytics platform that enables data science and engineering teams to run all analytics in one place.
    • This includes running reports, empowering dashboards, as well as running extract transform load jobs known as ETL. This is where new data is cleaned and inserted into databases. On Databricks you can also run machine learning and streaming jobs as well.
    • Most important to us in this course is the hosted notebook environment. This means that we can interact with our data in real time by running cells of code hosted on a Databricks server.
      • This code will actually be executing against a Spark cluster.
      • Spark can be tricky to set up, since it involves networking together different machines.
      • Databricks is going to manage the installation and setup for us so that we can focus on doing our analytics in SQL.
    • In practice though, Spark is about scaling computation.
    • So Community Edition allows us to prototype code but not quite unleash the full power of distributed computation.
    • navigate to Databricks.com

============= Β§ Pandas UDF for pyspark

본둠으둜 λŒμ•„μ™€, Pandas UDFλŠ” 인풋과 μ•„μ›ƒν’‹μ˜ ν˜•νƒœμ— 따라 3κ°€μ§€λ‘œ λΆ„λ₯˜λœλ‹€.

Name Input Output Scalar UDFs pandas.Series pandas.Series Grouped Map UDFs pandas.DataFrame pandas.DataFrame Grouped Aggregate UDFs pandas.Series scala

Β§

⚠️ **GitHub.com Fallback** ⚠️