EMR 008 Alluxio on EMR - qyjohn/AWS_Tutorials GitHub Wiki
(1) Start a single node EMR cluster, with the following configuration
[
{
"Classification": "core-site",
"Properties": {
"fs.alluxio.impl": "alluxio.hadoop.FileSystem"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.driver.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar",
"spark.executor.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar"
}
}
]
(2) Install Alluxio
tar -xzf alluxio-1.8.1-bin.tar.gz
cd alluxio-1.8.1
cp conf/alluxio-site.properties.template conf/alluxio-site.properties
#
# Add the following line to conf/alluxio-site.properties
# alluxio.master.hostname=192.168.2.89
./bin/alluxio validateEnv local
./bin/alluxio format
./bin/alluxio-start.sh local SudoMount
Create a local file test.csv, with the following content:
12345,ABCDE
98760,ZAWWE
Copy the local file to Alluxio:
./bin/alluxio fs copyFromLocal test.csv /test.csv
./bin/alluxio fs ls /
-rw-r--r-- hadoop hadoop 24 NOT_PERSISTED 04-19-2019 00:15:24:804 100% /test.csv
/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar
If you do not find a good place to put this file, put it under emrfs/lib/.
val input= "alluxio://192.168.2.89:19998/test.csv"
val df=spark.read.format("csv").option("header","false").load(input)
core-site.xml
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
[
{
"Classification": "core-site",
"Properties": {
"fs.alluxio.impl": "alluxio.hadoop.FileSystem"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.driver.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/alluxio-1.8.1/client/alluxio-1.8.1-client.jar"
}
}
]