Writing to DDB table in PySpark - isgaur/AWS-BigData-Solutions GitHub Wiki
hive> CREATE EXTERNAL TABLE access_ddb_spark (a_col string, b_col string, c_col string, d_col bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://your-s3-bucket/data-set/';
pyspark --packages com.audienceproject:spark-dynamodb_2.11:1.0.1
spark = SparkSession.builder.enableHiveSupport().getOrCreate() t = spark.table("default.access_ddb_spark") t.show(5) +------+------+------+-----+ | a_col| b_col| c_col|d_col| +------+------+------+-----+ |Line 0|field2|field3| 59| |Line 1|field2|field3| 47| |Line 2|field2|field3| 37| |Line 3|field2|field3| 22| |Line 4|field2|field3| 57| +------+------+------+-----+ only showing top 5 rows
Created a destination DynamoDB table "sample-spark" in us-east-1 region with only partition key as "lineno" as string with Write provisioned of 20,000 WCU. Question : Why we need to set as very high Provisioned Capacity Units ? Response :
-
In DynamoDB, a single partition can accommodate only upto 1000 WCU or 3000 RCU per second.
-
Now, If we tried to simply load all the data at once, then it will thrown Provisioned throughput limit exceeded since it will hit per partition limit easily.
-
So, I would suggest pre-partitioning.
- If you set 20,000 WCU for the table, it will create approximately 21 partitions in the backend. - Thus, we can eliminate per partition limit exceeded error while loading the data at starting. - Once the table is created, you may reduce the provisioned capacity to 5.
Once the above step completed, I have reduced the WCU to 5. Also, I have changed the capacity mode into On-demand mode which can sustain any sudden peaks of input writes.
t.write.option("tableName", "sample-spark").option("region", "us-east-1").format("dynamodb").save()