Frequently Asked Questions Autotuner for Apache Spark - synccomputingcode/user_documentation GitHub Wiki

General

Why would someone use the Autotuner for Apache Spark?

When choosing Spark Configuration and Infrastructure options to optimally run a job, developers are faced with endless combinations of options, making it impossible to select the correct combination each time, leading to failed jobs, increased cost, and reduced performance. The Sync Autotuner for Apache Spark predicts cost and runtime across thousands of possible combinations, solving this problem in seconds.

How is this autotuner better than previous solutions?

Most other solutions recommend tuning of single parameters. Each Autotuner prediction is a combination of infrastructure and application configuration parameters, and takes into account the interrelationships between parameters and how that can impact cost and runtime.
We just need 1 log to give you better performance, other solutions require many runs to train their models.
We provide many recommendations which result in different runtime and cost values, users can choose what works for them. Other solutions give you just 1 option with unclear runtime/cost impact.

What platforms does the Autotuner work with?

The Autotuner Beta supports Apache Spark running on AWS EMR, as well as Databricks running on AWS. This does not yet include EKS or Serverless support.

What information do you need from users?

To run the autotuner, we need cluster information with an associated Spark Event log from the most recent successful job run. If you aren't sure how to access this information, please view our User Documentation.

Why can't I find my event logs?

Assure that you have spark.eventLog.enabled set to true for any jobs you are interested in optimizing. For more information, see: https://spark.apache.org/docs/latest/configuration.html

How long does it take to get results back?

This depends on the size of your log file, but most of the time results are returned in less than 10 minutes. We are working on speeding this up so results are in seconds.

How long are the recommended configurations valid for?

Many elements can change daily in a data pipeline such as data size, data skew, the code itself, spot pricing, and spot availability. Because of this, we recommend running the autotuner right before you run your real job, to avoid any parameters from aging. Eventually, the autotuner is meant to be run every time your production Spark job runs.

Where does your pricing information come from for the current and predicted cost?

We query the AWS public APIs as part of the prediction process to assure we have the most current on demand and spot pricing. We know many users have special vendor discounts, as well as RIs and Savings plans that impact on demand costs. Eventually we will take this into account when reporting on costs. For Databricks, we use list price for DBUs: https://databricks.com/product/aws-pricing

Do you recommend Spot instances?

Currently we align our recommendation in terms of On Demand and Spot with your log and cluster input data. We plan in the future to add user settings around on demand and spot preferences, as well as options for instance fleet settings.

How do you take into account spot frequency of interruption in your predictions?

We check this at runtime, and won't recommend highly interruptible instance types. This is one of the reasons you should run the Autotuner as close to your job run as possible, as these rates change frequently.

Do you have an API?

The user API is under development, and we anticipate having it available soon!

Troubleshooting

There was an error when submitting my cluster config and/or log information, what do I do?

Please email [email protected].

My Spark event logs contain sensitive data that I do not want to share outside my organization, what are my options?

We have an open source log parser that will remove sensitive information, and only include the info that our Autotuner needs. You can access the parser here: https://github.com/synccomputingcode/spark_log_parser

Can I just implement a few of the recommendations, or do I need to implement all of them?

Each configuration in a recommendation must be implemented as given to achieve the predicted results. They should not be considered independent recommendations.

The recommended instances have low availability, and my job couldn’t start.

Did I implement the recommendations correctly?

While in the future we will build support for a feedback loop and tracking of ROI, we would love to hear from you on results of running our recommended configurations. Please email [email protected].

The predicted runtime/cost did not match my real job run values, why?

This is still a beta product, you may very well have found a bug. One other large source of error we’ve found is the recommended settings were not correctly implemented. Sometimes companies forbid certain things from changing, or maybe your infrastructure has an edge case we’ve never seen before. Either way, please send us a message and we can help troubleshoot. Email us at [email protected].