Running on the Daplab - derlin/bda-lsa-project GitHub Wiki
Given that you have access to the Daplab, here is the procedure to make it run.
copy the jar on the daplab:
scp -P 2201 target/scala-2.11/bda-project-lsa-assembly-1.0.jar [email protected]:
ssh into the Daplab:
ssh -p 2201 [email protected]
create the config.properties file:
echo "path.wikidump=wikidump-1500.xml
path.base=/Users/Lin/Documents/spark-wiki/1500" > config.properties
create a new screen:
screen -US bda-wiki
(as a reminder: ctrl+A D
to detach the screen, screen -r bda-wiki
to reattach).
export environment variables:
export SPARK_MAJOR_VERSION=2
export HADOOP_CONF_DIR=/etc/hadoop/conf
export LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native:$LD_LIBRARY_PATH
launch the shell on yarn:
spark-shell --master yarn --deploy-mode client --jars target/scala-2.11/bda-project-lsa-assembly-1.0.jar
To have better performances, you can customize the number of executors and the memory available to the shell with the following options:
spark-shell --master yarn --deploy-mode client --jars target/scala-2.11/bda-project-lsa-assembly-1.0.jar \
--driver-memory 2G --executor-memory 15G --executor-cores 8
View your yarn job: http://hadoop-rm.daplab.ch
Access your spark job UI:
-
open a tunnel to the daplab:
sshuttle --dns -r [email protected]:2201 10.10.10.0/24
-
find your spark UI address by clicking on your job in the http://hadoop-rm.daplab.ch interface, then click on ApplicationMaster