Frequently Asked Questions Autotuner for Apache Spark - synccomputingcode/user_documentation Wiki

General

Why would someone use the Autotuner for Apache Spark?

When choosing Spark Configuration and Infrastructure options to optimally run a job, developers are faced with endless combinations of options, making it impossible to select the correct combination each time, leading to failed jobs, increased cost, and reduced performance. The Sync Autotuner for Apache Spark predicts cost and runtime across thousands of possible combinations, solving this problem in seconds.

How is this autotuner better than previous solutions?

What platforms does the Autotuner work with?

The Autotuner Beta supports Apache Spark running on AWS EMR. This does not yet include EKS or Serverless support. We will soon be launching support for Databricks on AWS.

What information do you need from users?

To run the autotuner, we need cluster information with an associated Spark Event log from the most recent successful job run. If you aren't sure how to access this information, please view our Client Tools.

How long does it take to get results back?

This depends on the size of your log file, but most of the time results are returned in less than 10 minutes. We are working on speeding this up so results are in seconds.

How long are the recommended configurations valid for?

Many elements can change daily in a data pipeline such as data size, data skew, the code itself, spot pricing, and spot availability. Because of this, we recommend running the autotuner right before you run your real job, to avoid any parameters from aging. Eventually, the autotuner is meant to be run every time your production Spark job runs.

Where does your pricing information come from for the current and predicted cost?

We query the AWS public APIs as part of the prediction process to assure we have the most current on demand and spot pricing. We know many users have special vendor discounts, as well as RIs and Savings plans that impact on demand costs. Eventually we will take this into account when reporting on costs.

Do you recommend Spot instances?

Currently we assume that you would use on demand for Master or Driver nodes, and Spot for worker nodes. We plan in the future to add user settings around on demand and spot preferences, as well as options for instance fleet settings.

How do you take into account spot frequency of interruption in your predictions?

We check this at runtime, and won't recommend highly interruptible instance types. This is one of the reasons you should run the Autotuner as close to your job run as possible, as these rates change frequently.

Do you have an API?

The user API is under development, and we anticipate having it available soon!

Troubleshooting

There was an error when submitting my cluster config and/or log information, what do I do?

Please email [email protected], or open an issue: https://github.com/synccomputingcode/user_documentation/issues

My Spark event logs contain sensitive data that I do not want to share outside my organization, what are my options?

We have an open source log parser that will remove sensitive information, and only include the info that our Autotuner needs. You can access the parser here: https://github.com/synccomputingcode/spark_log_parser

Can I just implement a few of the recommendations, or do I need to implement all of them?

Each configuration in a recommendation must be implemented as given to achieve the predicted results. They should not be considered independent recommendations.

The recommended instances have low availability, and my job couldn’t start.

We check this at runtime, and won't recommend highly interruptible instance types. This is one of the reasons you should run the Autotuner as close to your job run as possible, as these rates change frequently.

Did I implement the recommendations correctly?

While in the future we will build support for a feedback loop and tracking of ROI, we would love to hear from you on results of running our recommended configurations. Please email [email protected], or open an issue: https://github.com/synccomputingcode/user_documentation/issues

The predicted runtime/cost did not match my real job run values, why?

This is still a beta product, you may very well have found a bug. One other large source of error we’ve found is the recommended settings were not correctly implemented. Sometimes companies forbid certain things from changing, or maybe your infrastructure has an edge case we’ve never seen before. Either way, please send us a message and we can help troubleshoot. Email us at [email protected], or open an issue: https://github.com/synccomputingcode/user_documentation/issues