Apache Spark - ghdrako/doc_snipets GitHub Wiki

Spark provides APIs in multiple programming languages, such as Scala, Python, Java, and R, allowing developers to write applications in their preferred language. It also offers a high-level set of libraries, including the following:

  • Spark SQL: An API library for querying structured data that uses SQL or the Dataset/DataFrame APIs. It supports various data sources, such as Hive, Avro, Parquet, ORC, JSON, and JDBC.
  • Spark Streaming: A library for processing real-time data streams, enabling developers to build scalable and fault-tolerant streaming applications
  • Spark Structured Streaming: A library for processing real-time data streams in a batch-like manner, treating streaming data as a series of small, continuously appended micro-batches. It supports various data sources, such as Kafka, Azure Event Hubs, and Amazon Kinesis, making it a popular choice for real-time data processing in modern data architectures.
  • MLlib: A machine learning library with algorithms for regression, classification, clustering, and recommendation, as well as model evaluation and hyperparameter tuning tools.
  • GraphX: A graph processing library that offers a flexible graph computation API and a set of built-in graph algorithms