pyspark - jjin-choi/study_note GitHub Wiki

pyspark μ—μ„œ duplicate μ—†μ• λŠ” 방법

pyspark μ—μ„œ μ—¬λŸ¬ DB μ ‘μ†ν•˜λŠ” 방법

from pyspark.sql import SparkSession
 
# mongo DB              
spark = SparkSession.builder
                    .appName('mongoDB')
                    .config('spark.mongodb.input.uri','mongodb://10.230.74.24:27020/dvlr.sim_log')
                    .config('spark.mongodb.output.uri','mongodb://10.230.74.24:27020/dvlr.sim_log')
                    .getOrCreate()
df = spark.read.format('mongo')
               .option('uri','mongodb://10.230.74.24:27020/dvlr.sim_log')
               .load()

# postgreSQL
spark = SparkSession.builder
                    .appName('Pyspark connected with Postgre')
                    .config('spark.jars', 'postgresql-42.2.23.jar')
                    .getOrCreate()
df = spark.read.format('jdbc')
               .option('url', 'jdbc:postgresql://10.230.74.162:5432/ibuilder')
               .option('dbtable', 'NRE_LICENSE')
               .option('user','jongwoo6969')
               .option('password','fbwhddn77^^')
               .option('driver', 'org.postgresql.Driver')
               .load()

pyspark function

  • collect() : ν•΄λ‹Ή dataframe 의 λͺ¨λ“  row λ₯Ό λ°˜ν™˜
import pandas as pd

# df : spark DataFrame
pddf = pd.DataFrame(df.collect())
  • cache() : 자주 뢈리게 λ˜λŠ” dataframe 은 cache λΌλŠ” ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ λ©”λͺ¨λ¦¬μ— 남겨λ‘