AWS Glue 500 Errors with Spark Jobs and using AWS s3 - isgaur/AWS-BigData-Solutions GitHub Wiki

This Error means you exceeded the Amazon Simple Storage Service (Amazon S3) request rate (3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket) So ideally, when you receive this Error, S3 service recommends to retry this Error from your Application, in this case your Glue Job.

https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html

To give a background, Glue currently uses library which is based on AWS S3 JAVA SDK to build an S3 Client and make calls to your S3 bucket. The resolution to S3 500's is described here:

https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/

In this article, it talks about increasing "fs.s3.maxretries" property to further increase retries when SPark encounters these Errors. This parameter only applies if you are using S3:// prefix(on your output & Input of your Job) which will invoke EMRFS libraries(com.amazon.ws.emr.hadoop.fs.EmrFileSystem) present on EMR clusters.


How to pass these Arguments for your Glue job ? (Not tested and provided on best efforts)

Glue officially doesn't support passing custom spark / hadoop parameters.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

However, using --conf we can trick to pass multiple spark parameters that we have here https://spark.apache.org/docs/2.4.3/configuration.html

To set this properties in AWS Glue Job, you can refer below steps & set "spark.hadoop.fs.s3.maxRetries"

    ----
    Step 1: Navigate to the 'Script libraries and job parameters (optional)'  from the glue job console -> Job parameters -> enter key/value pair'
    Step 2: Enter following value for :
    --
    key: --conf  value: spark.hadoop.fs.s3.maxRetries=35
    --
    ----

    Actually, in Glue Job we can not pass two/multiple "--conf" values but to workaround this you can use below method in console.

    ----
    Key: --conf
    Value: spark.network.timeout=120s --conf  spark.hadoop.fs.s3.maxRetries=35
    ----

    NOTE : Changing spark parameters is NOT recommended officially and hasn't been approved by Glue  Team. Service updates might change this behavior.

Other mitigation strategies :

Glue :

- I recommend to check the spark history UI to get an idea at what point the job was failing (while reading your input files or while writing your out). You probably have to check all executor logs(for 500's) to get a overall picture here.
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui.html

EMRFS :

  • Tune other fs.s3.* parameters so that you can less calls from your Application.