Splitting Text Data - datacouch-io/spark-java GitHub Wiki

Overview

In this lab, we will work with text data, specifically page view data from Wikimedia. The data is structured, but because it's in text format, Spark cannot automatically infer its structure. Therefore, we will use DataFrame tools in Apache Spark to create a DataFrame with a defined schema, making it easy to query and analyze the data.

The primary objective of this lab is to gain hands-on experience in processing and structuring text data into a format suitable for analysis. We will cover various aspects of working with text data, including loading, splitting, and transforming it into a structured DataFrame.

Lab Duration

The estimated runtime for this lab is approximately 30 to 40 minutes, depending on your familiarity with Spark and the complexity of the data.

Prerequisites

Before starting this lab, ensure that you have the following prerequisites in place:

Apache Spark environment set up and configured.
Basic knowledge of Apache Spark and its DataFrame API.

Lab Tasks

In this lab, we will perform the following tasks to process and structure text data:

Wikimedia PageView Data File

We provide a data file (<path>/data/wiki-pageviews.txt) that contains a dump of pageview data for many projects under the Wikimedia umbrella.

The data file has lines containing four fields.

Domain.project (e.g. "en.b")
Page name (e.g. "AP_Biology/Evolution")
Page view count (e.g. 1)
Total response size in bytes (e.g. 10662 - but for this particular dump, value is always 0).

The data is in simple text format, so when we read it in, we get a dataframe with a single column - a string containing all the data in each row. This is cumbersome to work with. In this lab, we'll apply a better schema to this data.

Tasks Create a DataFrame by reading in the page view data in sparklabs/data/wiki-pageviews.txt.
- Once you've created it, view a few lines to see the format of the data.
  - You'll see that you have one line of input per dataframe row.

Split the Lines

Our first step in creating an easier to use schema is splitting each row into separate columns. We'll use the split() function defined in the Spark SQL funtions.

We've used this in our word count examples in the main manual.

Tasks

Create a dataframe by splitting each line up.
- Use split() to split on a whitespace (pattern of "\s+")

Java: You'll need to import from the functions module as shown in the Python example below

import org.apache.spark.sql.functions.*;
Dataset<Row> splitViewsDF = viewsDF.withColumn("split_data", split(col("your_column_name"), "your_delimiter"));

Create a Better Schema

We'll create a dataframe with an easier-to-use schema containing the following columns which align with the data in the views file.

domain: String - The domain.project data.
pageName: String - The page name.
viewCount: Integer - The view count.
size: Long - The response size (always 0 in this file, but we'll keep it in our dataframe).

Execute the below queries one after the other

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;

public class WikimediaPageViewData {
    public static void main(String[] args) {
        // Create a SparkSession
        SparkSession spark = SparkSession.builder()
                .appName("WikimediaPageViewData")
                .getOrCreate();

        // Read the data file and show some lines
        Dataset<Row> viewsDF = spark.read().text("data/wiki-pageviews.txt");
        viewsDF.limit(5).show();

        // Split the lines on whitespace, show some lines, and show the schema
        Dataset<Row> splitViewsDF = viewsDF.select(split(col("value"), "\\s+").as("splitLine"));
        splitViewsDF.limit(5).show(false);
        splitViewsDF.printSchema();

        // Create a better schema, view the schema, and view some data
        Dataset<Row> viewsWithSchemaDF = splitViewsDF.select(
                col("splitLine").getItem(0).as("domain"),
                col("splitLine").getItem(1).as("pageName"),
                col("splitLine").getItem(2).cast("integer").as("viewCount"),
                col("splitLine").getItem(3).cast("long").as("size")
        );
        viewsWithSchemaDF.printSchema();
        viewsWithSchemaDF.limit(5).show();

        // Try some queries
        viewsWithSchemaDF.filter(col("viewCount").gt(500)).limit(5).show();
        viewsWithSchemaDF.filter(col("domain").equalTo("en").and(col("viewCount").gt(500))).limit(5).show();
        viewsWithSchemaDF.filter(col("domain").equalTo("en")).orderBy(col("viewCount").desc()).limit(5).show();
        viewsWithSchemaDF.filter(col("domain").equalTo("en").and(col("viewCount").gt(500)).and(!col("pageName").contains(":")))
                .orderBy(col("viewCount").desc()).limit(5).show();

        // Stop the SparkSession
        spark.stop();
    }
}

Summary

We can see that text data can require a little more work than data like JSON with a pre-existing structure. Once you've restructured it with a clear schema, which is not usually difficult, then the full power of DataFrames can be easily applied