EMR 008 Alluxio on EMR - qyjohn/AWS_Tutorials GitHub Wiki

(1) Start a single node EMR cluster, with the following configuration

[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.alluxio.impl": "alluxio.hadoop.FileSystem"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
          "spark.driver.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar",
          "spark.executor.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar"
     }
  }
]

(2) Install Alluxio

tar -xzf alluxio-1.8.1-bin.tar.gz
cd alluxio-1.8.1
cp conf/alluxio-site.properties.template conf/alluxio-site.properties
#
# Add the following line to conf/alluxio-site.properties
# alluxio.master.hostname=192.168.2.89
./bin/alluxio validateEnv local
./bin/alluxio format
./bin/alluxio-start.sh local SudoMount

Create a local file test.csv, with the following content:

12345,ABCDE
98760,ZAWWE

Copy the local file to Alluxio:

./bin/alluxio fs copyFromLocal test.csv /test.csv
./bin/alluxio fs ls /
-rw-r--r-- hadoop         hadoop                      24   NOT_PERSISTED 04-19-2019 00:15:24:804 100% /test.csv

/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar

If you do not find a good place to put this file, put it under emrfs/lib/.

val input= "alluxio://192.168.2.89:19998/test.csv"
val df=spark.read.format("csv").option("header","false").load(input) 

core-site.xml

<property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
</property>
[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.alluxio.impl": "alluxio.hadoop.FileSystem"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
          "spark.driver.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar"
     }
  }
]
⚠️ **GitHub.com Fallback** ⚠️