pyspark.sql module - awantik/spark GitHub Wiki
###Important classes of Spark SQL and DataFrames:
- pyspark.sql.SQLContext Main entry point for DataFrame and SQL functionality.
- pyspark.sql.DataFrame A distributed collection of data grouped into named columns.
- pyspark.sql.Column A column expression in a DataFrame.
- pyspark.sql.Row A row of data in a DataFrame.
- pyspark.sql.GroupedData Aggregation methods, returned by DataFrame.groupBy().
- pyspark.sql.DataFrameNaFunctions Methods for handling missing data (null values).
- pyspark.sql.DataFrameStatFunctions Methods for statistics functionality.
- pyspark.sql.functions List of built-in functions available for DataFrame.
- pyspark.sql.types List of data types available.
- pyspark.sql.Window For working with window functions.
###Various Types of deployment
- YARN as cluster manager - Don't come inbuilt
- Mesos as cluster manager - Don't come inbuilt
- Spark Standalone - This one is provided by spark & comes bundled with Spark
#####master URL of cluster, suppose everything is running in local machine - "local[4]" (run locally with 4 cores) spark://master:7077 - You have a standalone cluster whose url is this.
#####config This guy will contain all key-value configuration pair SparkSession.builder.config(conf=SparkConf()) config("info","great")
####sparkSession - spark Wapper around SqlContext ####SparkContext - sc Entry point to entire spark ####SqlContext - sqlContext Entry point for working with structured data