Spark 2 Recruitment Challenge - bdu-xpand-it/BDU-Recruitment-Challenges GitHub Wiki
Introduction
For this challenge you'll have to develop a Spark application (version 2 or greater in Scala) that processes 2 CSV files (google-play-store-apps.zip).
For this project you should use Maven for dependency management.
Important Info
- When writing code and structuring the project try to produce clean and concise implementations.
- Produce code that does not need adaptations to run (self-contained and runnable).
- Bottom line try to produce a well structured and tested project.
In case of any interpretation doubts or detect inconsistencies in the supplied material, you have full autonomy to make any decision you consider necessary as long you clearly detail and present you justification.
You should submit a compilable and executable project solution as well any artifact and documentation that you consider relevant for the correct evaluation of this challenge.
Datasets
Both CSV file contain scraped data from the Google Play Store.
googleplaystore.csv
This dataset contains all the information about the the mobile applications registered in the Google Play Store.
Columns
- App - Application name
- Category - Category the app belongs to
- Rating - Overall user rating of the app (as when scraped)
- Reviews - Number of user reviews for the app (as when scraped)
- Size - Size of the app (as when scraped)
- Installs - Number of user downloads/installs for the app (as when scraped)
- Type - Paid or Free
- Price - Price of the app (as when scraped)
- Content Rating - Age group the app is targeted at - Children / Mature 21+ / Adult
- Genres - An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
- Last Updated - Date when the app was last updated on Play Store (as when scraped)
- Current Ver - Current version of the app available on Play Store (as when scraped)
- Android Ver - Min required Android version (as when scraped)
Example/Sample 1
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Photo Editor | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
Coloring book | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
U Launcher Lite. | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
Sketch | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
Pixel Draw | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
Clash of Clans | GAME | 4.6 | 44893888 | 98M | 100,000,000+ | Free | 0 | Everyone 10+ | Strategy | July 15, 2018 | 10.322.16 | 4.1 and up |
Clash of Clans | FAMILY | 4.6 | 44881447 | 98M | 100,000,000+ | Free | 0 | Everyone 10+ | Strategy | July 15, 2018 | 10.322.16 | 4.1 and up |
googleplaystore_user_reviews.csv
This dataset contains all the information about the users' reviews for the mobiles apps registered in the Google Play Store.
Columns
- App - Name of app
- Translated_Review - User review (Preprocessed and translated to English)
- Sentiment - Positive/Negative/Neutral (Preprocessed)
- Sentiment_Polarity - Sentiment polarity score
- Sentiment_Subjectivity - Sentiment subjectivity score
Example/Sample 2
App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity |
---|---|---|---|---|
10 Best Foods for You | This help eating healthy exercise regular basis | Positive | 0.25 | 0.28846153846153844 |
10 Best Foods for You | Best idea us | Positive | 1.0 | 0.3 |
10 Best Foods for You | Amazing | Positive | 0.6000000000000001 | 0.9 |
10 Best Foods for You | ||||
10 Best Foods for You | Looking forward app, | Neutral | 0.0 | 0.0 |
Exercise
In this exercise you will be asked to develop a Spark basic application that performs some basic operations on both datasets and stores the final results in other files.
Part 1
From googleplaystore_user_reviews.csv create a Dataframe (df_1) with the following structure:
Column name | Data type | Default Value | Notes |
---|---|---|---|
App | String | ||
Average_Sentiment_Polarity | Double | 0 (instead of NULL) | Average of the column Sentiment_Polarity grouped by App name |
Part 2
- Read googleplaystore.csv as a Dataframe and obtain all Apps with a "Rating" greater or equal to 4.0 sorted in descending order.
- Save that Dataframe as a CSV (delimiter: "§") named "best_apps.csv"
Part 3
- From googleplaystore.csv create a Dataframe (df_3) with the structure from the table below
Attention
- App should be a unique value;
- In case of App duplicates, the column "Categories" of the resulting row should contain an array with all the possible categories (without duplicates) for that app (compare example 1 with 3);
- In case of App duplicates (for all columns except categories), the remaining columns should have the same values as the ones on the row with the maximum number of reviews (compare example 1 with 3).
Table with Dataframe's structure that you should do to produce the final Dataframe:
Column name | Data type | Default Value | IMPORTANT NOTES |
---|---|---|---|
App | String | Remove duplicates | |
Categories | Array[String] | Rename column | |
Rating | Double | null | |
Reviews | Long | 0 | |
Size | Double | null | Convert from string to double (value in MB). Attention - Not all values end in "M" |
Installs | String | null | |
Type | String | null | |
Price | Double | null | Convert from string to double and present the value in euros (All values are in dollars) (Consider conversion rate: 1$ = 0.9€) |
Content_Rating | String | null | Rename column from 'Content Rating' |
Genres | Array[String] | null | Convert string to array of strings (delimiter: ";") |
Last_Updated | Date | null | Convert string to date. Rename column from 'Last Updated' |
Current_Version | String | null | Rename column from 'Current Ver' |
Minimum_Android_Version | String | null | Rename column from 'Android Ver' |
Example/sample 3
App | Categories | Rating | Reviews | Size | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Current_Version | Minimum_Android_Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Photo Editor | ["ART_AND_DESIGN"] | 4.1 | 159 | 19 | 10,000+ | Free | 0 | Everyone | ["Art & Design"] | 2018-01-07 00:00:00 | 1.0.0 | 4.0.3 |
Coloring book | ["ART_AND_DESIGN"] | 3.9 | 967 | 14 | 500,000+ | Free | 0 | Everyone | ["Art & Design", "Pretend Play"] | 2018-01-15 00:00:00 | 2.0.0 | 4.0.3 |
U Launcher Lite. | ["ART_AND_DESIGN"] | 4.7 | 87510 | 8.7 | 5,000,000+ | Free | 0 | Everyone | ["Art & Design"] | 2018-08-01 00:00:00 | 1.2.4 | 4.0.3 |
Sketch | ["ART_AND_DESIGN"] | 4.5 | 215644 | 25 | 50,000,000+ | Free | 0 | Teen | ["Art & Design"] | 2018-06-08 00:00:00 | Varies with device | 4.2 |
Pixel Draw | ["ART_AND_DESIGN"] | 4.3 | 967 | 2.8 | 100,000+ | Free | 0 | Everyone | ["Art & Design", "Creativity"] | 2018-06-20 00:00:00 | 1.1 | 4.4 |
Clash of Clans | ["GAME", "FAMILY"] | 4.6 | 44893888 | 98 | 100,000,000+ | Free | 0 | Everyone 10+ | ["Strategy"] | 2018-06-15 00:00:00 | 10.322.16 | 4.1 |
Part 4
- Given the Dataframes produced by Exercise 1 and 3, produce a Dataframe with all its information plus its 'Average_Sentiment_Polarity' calculated in Exercise 1
- Save the final Dataframe as a parquet file with gzip compression with the name "googleplaystore_cleaned"
Example/Sample 4
App | Categories | Rating | Reviews | Size | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Current_Version | Minimum_Android_Version | Average_Sentiment_Polarity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Photo Editor | ["ART_AND_DESIGN"] | 4.1 | 159 | 19 | 10,000+ | Free | 0 | Everyone | ["Art & Design"] | 2018-01-07 00:00:00 | 1.0.0 | 4.0.3 | 6.83523 |
Coloring book | ["ART_AND_DESIGN"] | 3.9 | 967 | 14 | 500,000+ | Free | 0 | Everyone | ["Art & Design" ,"Pretend Play"] | 2018-01-15 00:00:00 | 2.0.0 | 4.0.3 | null |
U Launcher Lite. | ["ART_AND_DESIGN"] | 4.7 | 87510 | 8.7 | 5,000,000+ | Free | 0 | Everyone | ["Art & Design"] | 2018-08-01 00:00:00 | 1.2.4 | 4.0.3 | null |
Sketch | ["ART_AND_DESIGN"] | 4.5 | 215644 | 25 | 50,000,000+ | Free | 0 | Teen | ["Art & Design"] | 2018-06-08 00:00:00 | Varies with device | 4.2 | null |
Pixel Draw | ["ART_AND_DESIGN"] | 4.3 | 967 | 2.8 | 100,000+ | Free | 0 | Everyone | ["Art & Design", "Creativity"] | 2018-06-20 00:00:00 | 1.1 | 4.4 | 7.211111 |
Clash of Clans | [ "GAME", "FAMILY" ] | 4.6 | 44893888 | 98 | 100,000,000+ | Free | 0 | Everyone 10+ | ["Strategy"] | 2018-06-15 00:00:00 | 10.322.16 | 4.1 | null |
Note: the Average_Sentiment_Polarity values in this example may not be correct
Part 5
- Using df_3 create a new Dataframe (df_4) containing the number of applications, the average rating and the average sentiment polarity by genre and save it as a parquet file with gzip compression with the name "googleplaystore_metrics".
Example/Sample 5
Genre | Count | Average_Rating | Average_Sentiment_Polarity |
---|---|---|---|
Art & Design | 345 | 2.6 | -2.1 |
Pretend Play | 78 | 3.12222 | null |
Creativity | 139 | 3.8 | 3.3332 |
Strategy | 451 | 4.2 | 8.2 |