DynamoDB Internal 500 System Errors - isgaur/AWS-BigData-Solutions GitHub Wiki

Explanation:

Lets understand when the system Error occurs:

Internal Server Error i.e system error is a server-side error and entirely because of some transient issue on Dynamodb service end. Transient failures on Dynamodb service are expected because of network blips, table undergoing partition splits, master fail-overs, concurrent request handling, resource exhaustion etc.

When DynamoDB receives an API call, the service endpoint gives the request to one of the back-end servers for processing. If that back-end server is successful in processing the request, your application receives a 200 OK with a very small latency. If that back-end server is not successful in processing the request within a reasonable timeout period, the service endpoint gives the request to another back-end server for processing. If that back-end server is successful in processing the request, your application receives a 200 OK with an slightly elevated latency.

Many times DynamoDB requests fails because of "Internal Server Error" status code 5xx. Being a distributed service such errors are expected and DynamoDB has built in mechanism which will self-heal once issues are detected. Such errors can occur because of several possible reasons like network issue, hardware failure, partition split causing change of master-ship between replica nodes and so on.

Conclusion:

With a distributed system as large as DynamoDB, failures such as these are inevitable, but DynamoDB is built in such a way that it can recover from these failures quickly. The service team is constantly working on making these failures rarer and less impactful when they do occur, but they do still happen from time to time. To provide you with some insight, these errors can occur for a variety of reasons including resource contention, node quarantine, or network timeout.

This high latency occurs because of the system error and is not within Dynamodb normal operation thresholds but is intermittent for which DynamoDB service has mechanisms in place to detect and respond on such anomalies by observing the latencies or failures.

Recommendation:

As these errors cannot be avoided, the only way to handle them is

To implement retries for the requests like an Exponential back-off algorithm in case such internal error occurs for a successful return.[2][3]
To use AWS SDK since it implements automatic retry logic. In addition to that, AWS SDK implements exponential back-off algorithm to use progressive longer waits between each retries for consecutive error responses before failing any request with Internal Server Error(System error 500).

---------- References ----------

[1] DynamoDB Status Code 5xx: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes.http5xx

[2] Error Retries and Exponential Backoff in AWS - https://docs.aws.amazon.com/general/latest/gr/api-retries.html

[3] Error Handling - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html

[4] Error Handling - Error Retries and Exponential Backoff - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html?shortFooter=true#Programming.Errors.RetryAndBackoff