Running on the Daplab - derlin/bda-lsa-project GitHub Wiki

Given that you have access to the Daplab, here is the procedure to make it run.

copy the jar on the daplab:

scp -P 2201 target/scala-2.11/bda-project-lsa-assembly-1.0.jar [email protected]:

ssh into the Daplab:

ssh -p 2201 [email protected]

create the config.properties file:

echo "path.wikidump=wikidump-1500.xml
path.base=/Users/Lin/Documents/spark-wiki/1500" > config.properties

create a new screen:

screen -US bda-wiki

(as a reminder: ctrl+A D to detach the screen, screen -r bda-wiki to reattach).

export environment variables:

export SPARK_MAJOR_VERSION=2
export HADOOP_CONF_DIR=/etc/hadoop/conf
export LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native:$LD_LIBRARY_PATH

launch the shell on yarn:

spark-shell --master yarn --deploy-mode client  --jars target/scala-2.11/bda-project-lsa-assembly-1.0.jar  

To have better performances, you can customize the number of executors and the memory available to the shell with the following options:

spark-shell --master yarn --deploy-mode client --jars target/scala-2.11/bda-project-lsa-assembly-1.0.jar \
      --driver-memory 2G --executor-memory 15G --executor-cores 8 

View your yarn job: http://hadoop-rm.daplab.ch

Access your spark job UI:

  1. open a tunnel to the daplab:

     sshuttle --dns -r [email protected]:2201 10.10.10.0/24
    
  2. find your spark UI address by clicking on your job in the http://hadoop-rm.daplab.ch interface, then click on ApplicationMaster