Writing to DDB table in PySpark - isgaur/AWS-BigData-Solutions GitHub Wiki

hive> CREATE EXTERNAL TABLE access_ddb_spark (a_col string, b_col string, c_col string, d_col bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://your-s3-bucket/data-set/';

pyspark --packages com.audienceproject:spark-dynamodb_2.11:1.0.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate() t = spark.table("default.access_ddb_spark") t.show(5) +------+------+------+-----+ | a_col| b_col| c_col|d_col| +------+------+------+-----+ |Line 0|field2|field3| 59| |Line 1|field2|field3| 47| |Line 2|field2|field3| 37| |Line 3|field2|field3| 22| |Line 4|field2|field3| 57| +------+------+------+-----+ only showing top 5 rows

Created a destination DynamoDB table "sample-spark" in us-east-1 region with only partition key as "lineno" as string with Write provisioned of 20,000 WCU. Question : Why we need to set as very high Provisioned Capacity Units ? Response :

In DynamoDB, a single partition can accommodate only upto 1000 WCU or 3000 RCU per second.
Now, If we tried to simply load all the data at once, then it will thrown Provisioned throughput limit exceeded since it will hit per partition limit easily.
So, I would suggest pre-partitioning.
- If you set 20,000 WCU for the table, it will create approximately 21 partitions in the backend. - Thus, we can eliminate per partition limit exceeded error while loading the data at starting. - Once the table is created, you may reduce the provisioned capacity to 5.

Once the above step completed, I have reduced the WCU to 5. Also, I have changed the capacity mode into On-demand mode which can sustain any sudden peaks of input writes.

t.write.option("tableName", "sample-spark").option("region", "us-east-1").format("dynamodb").save()