AWS Glue 500 Errors with Spark Jobs and using AWS s3 - isgaur/AWS-BigData-Solutions GitHub Wiki

This Error means you exceeded the Amazon Simple Storage Service (Amazon S3) request rate (3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket) So ideally, when you receive this Error, S3 service recommends to retry this Error from your Application, in this case your Glue Job.

https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html

To give a background, Glue currently uses library which is based on AWS S3 JAVA SDK to build an S3 Client and make calls to your S3 bucket. The resolution to S3 500's is described here:

https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/

In this article, it talks about increasing "fs.s3.maxretries" property to further increase retries when SPark encounters these Errors. This parameter only applies if you are using S3:// prefix(on your output & Input of your Job) which will invoke EMRFS libraries(com.amazon.ws.emr.hadoop.fs.EmrFileSystem) present on EMR clusters.

How to pass these Arguments for your Glue job ? (Not tested and provided on best efforts)

Glue officially doesn't support passing custom spark / hadoop parameters.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

However, using --conf we can trick to pass multiple spark parameters that we have here https://spark.apache.org/docs/2.4.3/configuration.html

To set this properties in AWS Glue Job, you can refer below steps & set "spark.hadoop.fs.s3.maxRetries"

    ----
    Step 1: Navigate to the 'Script libraries and job parameters (optional)'  from the glue job console -> Job parameters -> enter key/value pair'
    Step 2: Enter following value for :
    --
    key: --conf  value: spark.hadoop.fs.s3.maxRetries=35
    --
    ----

    Actually, in Glue Job we can not pass two/multiple "--conf" values but to workaround this you can use below method in console.

    ----
    Key: --conf
    Value: spark.network.timeout=120s --conf  spark.hadoop.fs.s3.maxRetries=35
    ----

    NOTE : Changing spark parameters is NOT recommended officially and hasn't been approved by Glue  Team. Service updates might change this behavior.

Other mitigation strategies :

Glue :

- I recommend to check the spark history UI to get an idea at what point the job was failing (while reading your input files or while writing your out). You probably have to check all executor logs(for 500's) to get a overall picture here.
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui.html

Reduce the parallelism of the Spark/Glue Job . The more tasks trying to access data in parallel, the more load to S3 service thus crossing the S3 limits. For this you can try decreasing DPU's . Or divide your dataset so it will have less partitions. Or use df.repartition() in spark to reduce partitions before you do any Trasnformation on an RDD/ Dataframe.

S3:
- Use separate buckets for intermediate data/different applications/roles.
- Use significantly different paths for different datasets in the same bucket.
- Check with S3 team if they can increase the rate other than the defaults mentioned above. Ask them if they can do pre-warm your S3 bucket.
- Enable Cloudtrail on S3 to log requests and get some insights on how many Calls are being made on your S3 buckets from Glue. https://docs.aws.amazon.com/AmazonS3/latest/dev/cloudtrail-logging.html https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html https://aws.amazon.com/premiumsupport/knowledge-center/athena-tables-search-cloudtrail-logs/
Alternatively use S3 Access logs to gain insights https://docs.aws.amazon.com/AmazonS3/latest/dev/using-s3-access-logs-to-identify-requests.html https://aws.amazon.com/premiumsupport/knowledge-center/analyze-logs-athena/

EMRFS :

Tune other fs.s3.* parameters so that you can less calls from your Application.