House Price Problem - datacouch-io/spark-java GitHub Wiki

Objective: The goal of this Spark program is to read a dataset of house listings from "in/RealEstate.csv," group the data by location, aggregate the average price per square foot (SQ Ft), and find the maximum price for each location. The results will be sorted by the average price per SQ Ft in ascending order.

Data Description: The dataset, named "RealEstate.csv," contains a collection of recent real estate listings in San Luis Obispo county and its surroundings. Each entry in the dataset represents a house listing and includes the following fields:

  1. MLS: Multiple listing service number for the house (unique ID).
  2. Location: The city or town where the house is located. Most locations are in San Luis Obispo county and northern Santa Barbara county, with some out-of-area locations.
  3. Price: The most recent listing price of the house (in dollars).
  4. Bedrooms: The number of bedrooms in the house.
  5. Bathrooms: The number of bathrooms in the house.
  6. Size: The size of the house in square feet.
  7. Price/SQ.ft: The price of the house per square foot.
  8. Status: The type of sale, which can be Short Sale, Foreclosure, or Regular.

Each field in the dataset is separated by commas (CSV format).

Sample Output: The expected output of the Spark program will be a DataFrame containing the location, average price per SQ Ft, and the maximum price for each location. The results will be sorted in ascending order based on the average price per SQ Ft. Here's a sample output format:

+----------------+-----------------+----------+
|   Location     | avg(Price SQ Ft)| max(Price)|
+----------------+-----------------+----------+
|   Oceano       |     1145.0      | 1195000  |
|   Bradley      |      606.0      | 1600000  |
|   San Luis Obispo|    459.0      | 2369000  |
|   Santa Ynez   |      391.4      | 1395000  |
|   Cayucos      |      387.0      | 1500000  |
|   ...          |      ...        | ...      |
|   ...          |      ...        | ...      |
|   ...          |      ...        | ...      |
+----------------+-----------------+----------+

Implementation Steps:

  1. Create a SparkSession to initialize Spark.

  2. Read the "RealEstate.csv" file into a DataFrame using the spark.read.csv method. Make sure to specify the header option to use the first row as column names.

  3. Group the DataFrame by the "Location" column.

  4. Calculate the average price per SQ Ft for each group using the avg function.

  5. Find the maximum price for each group using the max function.

  6. Sort the result DataFrame in ascending order based on the average price per SQ Ft using the orderBy function.

  7. Display the final result using the show method.

  8. Stop the SparkSession to release resources.

By following these steps, the program will process the dataset and provide the desired output, showing the average price per SQ Ft and the maximum price for houses in each location, sorted by the average price per SQ Ft.