From GPT (what a 4 years experienced developer should know with respect to code in spark and java ?) - ayushmathur94/Spark GitHub Wiki
A 4+ years experienced developer in Spark and Java should have a strong understanding of the following concepts and technologies:
1.) Core Spark concepts such as Resilient Distributed Datasets (RDDs), DataFrames, Datasets, and Spark SQL. They should be able to write Spark code in both RDD and DataFrame/Dataset API's and understand the trade-offs between them.
2.) Spark's cluster manager integration: They should be familiar with deploying Spark applications on different cluster managers such as YARN, Mesos, and standalone cluster manager and understand the configuration settings required for each.
3.) Spark Streaming: They should understand the fundamentals of Spark Streaming and be able to write applications that process real-time data streams using the DStreams API.
4.) Spark's Machine Learning Library (MLlib): They should have a good understanding of the MLlib library and be able to use it to build, evaluate, and deploy machine learning models.
5.) Spark's GraphX library: They should have a good understanding of the GraphX library and be able to use it to process graph data and perform graph computations.
6.) Java 8: They should be proficient in using the new features of Java 8, such as lambda expressions, streams, and functional interfaces, to write more concise and expressive code.
7.) Advanced Java concepts: They should have a good understanding of advanced Java concepts such as multithreading, concurrency, and memory management and be able to apply them to optimize Spark applications.
8.) Big Data technologies: They should have a good understanding of other big data technologies such as Hadoop, HDFS, Hive, and Hbase and be able to integrate them with Spark.
9.) Data serialization: They should be familiar with different data serialization frameworks such as Avro, Parquet, and ORC and be able to choose the right one for a given use case.
10.) Data pipelines and ETL: They should have experience building data pipelines and ETL processes and be familiar with tools such as Apache NiFi and Apache Kafka.
11.) Security: They should understand the security aspects of Spark, such as authentication, authorization, and encryption, and be able to implement them in their applications.
12.) Debugging and Monitoring: They should be proficient in using the tools and techniques for debugging and monitoring Spark applications, such as Spark web UI, Spark log, and Spark event logs.
13.) Continous integration and deployment : They should be familiar with CI/CD pipeline and familiar with tools like Jenkins, TravisCI and be able to implement them for Spark applications
14.) Cloud services: They should be familiar with deploying Spark applications on cloud services such as AWS EMR, GCP Dataproc, and Azure HDInsight and be able to take advantage of the cloud-specific features and services.