icp3module3 - gracesyl/big-data-hadoop GitHub Wiki

What are DataFrames? DataFrames have the following features: 1.Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster. 2.Support for a wide array of data formats and storage systems. 3.State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer. 4.Seamless integration with all big data tooling and infrastructure viaSpark. 5.APIsforPython,Java,Scala,and R.

Hence by creating the dataframe each dataset is imported into the scala .

Step1: By opening the source code in intellij it automatically change the version type and gives the build.sbt so that it does not give configuration error during the runtime.

Step2: Import the data set into the intellij with required dependencies of apache and created the dataframe directly on import and saved the file as 'survey' tempview.

Step3:Do the groupby sql in scala as follows:

Outputgroupby:

step4: NOrmal join is posssible by splitting the dataset into 50 and 80 and then doing the join as follows:

O/PNormaljoin:

Step5:Aggregate input and output:

Step6:The 13th row in the dataset is executed as follows:

Step7: The bonus part of parsing the delimited comma separated values in the dataset.

REFERENCES: