spark & kaspacore - mata-elang-stable/mataelang-platform GitHub Wiki

Back to HOME

spark

Example Environment

Item Value
Spark IP address 172.16.2.50
Hadoop IP address (network interface) 172.16.2.50
(*Must be the same host as Spark)
Hadoop IP address (docker0 interface) 172.17.0.1
Kafka IP address 172.16.2.40
Hadoop user ubuntu

Prerequisite

✅ Ubuntu 20.04 LTS installed and updated with the following command.

sudo apt update && sudo apt -y upgrade

Time Zone and NTP already set.

✅ Docker 20.10 or later installed with the following command.

sudo apt -y install docker.io

✅ Docker Compose 2.13 or later installed with the following command.

sudo curl -L "https://github.com/docker/compose/releases/download/v2.13.0/docker-compose-$(uname -s)-$(uname -m)"\
 -o /usr/bin/docker-compose && sudo chmod +x /usr/bin/docker-compose

Setup Spark

1. Configure Spark

▶️ Clone Mata-Elang-Stable/spark-asset from GitHub to your server.

git clone https://github.com/mata-elang-stable/spark-asset.git ~/spark

▶️ Configure .env to set the environment variables.

mv ~/spark/.env.example ~/spark/.env
nano ~/spark/.env
Configuration

🔑 Change ubuntu of HADOOP_USER_NAME to your user account if necessary. (e.g. hadoop)

🔑 Change ubuntu of "/user/ubuntu" to your user account if necessary. (e.g. /user/hadoop)

HADOOP_USER_NAME=ubuntu
SPARK_EVENTLOG_DIR=hdfs://172.17.0.1:9000/user/ubuntu/spark/spark-events
SPARK_APP_JAR_PATH=hdfs://172.17.0.1:9000/user/ubuntu/kaspacore/files/kaspacore.jar
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://172.17.0.1:9000/user/ubuntu/spark/spark-events"

▶️ Create a Hadoop-HDFS directory.

🔑 Change ubuntu of "/user/ubuntu" to your user account if necessary. (e.g. /user/hadoop)

hdfs dfs -mkdir -p hdfs://localhost:9000/user/ubuntu/spark/spark-events

▶️ Configure app.properties.

mv ~/spark/conf/app.properties.example ~/spark/conf/app.properties
nano ~/spark/conf/app.properties
Configuration

🔑 Change ubuntu of "/user/ubuntu" to your user account if necessary. (e.g. /user/hadoop)

🔑 Change TIMEZONE to match your time zone if necessary. (e.g. Asia/Jakarta)

🔑 Change KAFKA_BOOTSTRAP_SERVERS to the Kafka server IP address and port number. (e.g. 172.16.2.40:9093)

SPARK_MASTER=spark://spark-master:7077
SPARK_CHECKPOINT_PATH=hdfs://172.17.0.1:9000/user/ubuntu/kafka-checkpoint
TIMEZONE=UTC

KAFKA_BOOTSTRAP_SERVERS=172.17.0.1:9093
KAFKA_INPUT_STARTING_OFFSETS=latest

SENSOR_STREAM_INPUT_TOPIC=sensor_events
SENSOR_STREAM_OUTPUT_TOPIC=sensor_events_with_geoip

MAXMIND_DB_PATH=hdfs://172.17.0.1:9000/user/ubuntu/kaspacore/files/GeoLite2-City.mmdb
MAXMIND_DB_FILENAME=GeoLite2-City.mmdb

▶️ Prepare spark-defaults.conf.

mv ~/spark/conf/spark-defaults.conf.example ~/spark/conf/spark-defaults.conf
Click here if you want to edit the configuration.

▶️ Configure spark-defaults.conf.

nano ~/spark/conf/spark-defaults.conf

The contents of the configuration file are as follows:

# Worker
spark.worker.cleanup.enabled=true
spark.worker.cleanup.interval=1800
spark.worker.cleanup.appDataTtl=14400

# History Server
spark.history.ui.port=18080
spark.history.retainedApplications=10
spark.history.fs.update.interval=10s
spark.history.fs.cleaner.enabled=true
spark.history.fs.cleaner.interval=1d
spark.history.fs.cleaner.maxAge=7d

# App Configuration
spark.master=spark://spark-master:7077
spark.eventLog.enabled=true

▶️ Prepare log4j2.properties.

mv ~/spark/conf/log4j2.properties.example ~/spark/conf/log4j2.properties
Click here if you want to edit the configuration.

nano ~/spark/conf/log4j2.properties

The contents of the configuration file are as follows:

log4j.rootLogger=ERROR, console

# set the log level for these components
log4j.logger.com.test=DEBUG
log4j.logger.org=ERROR
log4j.logger.org.apache.spark=ERROR
log4j.logger.org.spark-project=ERROR
log4j.logger.org.apache.hadoop=ERROR
log4j.logger.io.netty=ERROR
log4j.logger.org.apache.zookeeper=ERROR

# add a ConsoleAppender to the logger stdout to write to the console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
# use a simple message format
log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

▶️ Configure docker-compose.yaml.

nano ~/spark/docker-compose.yaml
Configuration

🔑 Change services.spark-worker.deploy.replicas to increase the number of workers as needed.

services:
  spark-worker:
    environment:
      <<: *spark-worker-default-env
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 4G
    deploy:
      mode: replicated
      replicas: 2

2. Start Spark

▶️ Start Spark service.

sudo docker-compose -f ~/spark/docker-compose.yaml up -d

▶️ Confirm the containers are running.

sudo docker-compose -f ~/spark/docker-compose.yaml ps -a
Result: It takes about 30 seconds for the spark-submit-* services to successfully complete the registration process.

spark-spark-historyserver-1   "/opt/entrypoint.sh …"   spark-historyserver   running             0.0.0.0:18080->18080/tcp, :::18080->18080/tcp
spark-spark-master-1          "/opt/entrypoint.sh …"   spark-master          running             0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
spark-spark-submit-aggr-1     "/opt/entrypoint.sh …"   spark-submit-aggr     exited (0)
spark-spark-submit-enrich-1   "/opt/entrypoint.sh …"   spark-submit-enrich   exited (0)
spark-spark-worker-1          "/opt/entrypoint.sh …"   spark-worker          running
spark-spark-worker-2          "/opt/entrypoint.sh …"   spark-worker          running

Admin Web UI

▶️ Open the following URL to see the Spark Master.

  • URL: http://<SPARK_SERVER_IP_OR_NAME (e.g. 172.16.2.50)>:8080/
Click to view screen image

spark

▶️ Open the following URL to see the Spark History Server.

  • URL: http://<SPARK_SERVER_IP_OR_NAME (e.g. 172.16.2.50)>:18080/
Click to view screen image

history

Useful Commands

Click to show commands

Service Commands

✅ Show service status

sudo docker-compose -f ~/spark/docker-compose.yaml ps -a
Result

spark-spark-historyserver-1   "/opt/entrypoint.sh …"   spark-historyserver   running             0.0.0.0:18080->18080/tcp, :::18080->18080/tcp
spark-spark-master-1          "/opt/entrypoint.sh …"   spark-master          running             0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
spark-spark-submit-aggr-1     "/opt/entrypoint.sh …"   spark-submit-aggr     exited (0)
spark-spark-submit-enrich-1   "/opt/entrypoint.sh …"   spark-submit-enrich   exited (0)
spark-spark-worker-1          "/opt/entrypoint.sh …"   spark-worker          running
spark-spark-worker-2          "/opt/entrypoint.sh …"   spark-worker          running

✅ Start services

sudo docker-compose -f ~/spark/docker-compose.yaml up -d

✅ Stop services (and remove containers)

sudo docker-compose -f ~/spark/docker-compose.yaml down

✅ Stop services (and keep containers)

sudo docker-compose -f ~/spark/docker-compose.yaml stop

✅ Restart services

sudo docker-compose -f ~/spark/docker-compose.yaml restart

Maintenance Commands

✅ Build Mata Elang Spark image.

  • Please prepare another host to build the image.
# update packages and install docker
sudo apt update && sudo apt -y upgrade
sudo apt -y install docker.io

# download Spark
wget -P ~/ https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3-scala2.13.tgz
tar -xzf ~/spark-3.3.1-bin-hadoop3-scala2.13.tgz -C ~/

# build docker image
cd ~/spark-3.3.1-bin-hadoop3-scala2.13
sudo docker build -t <REPOSITORY>/<IMAGE>[:TAG] -f kubernetes/dockerfiles/spark/Dockerfile .

# push image to your Docker Hub
sudo docker login -u <USERNAME>
Password:
sudo docker push <REPOSITORY>/<IMAGE>[:TAG]

Configuration Commands

✅ Show environment variables

sudo docker inspect --format='{{range .Config.Env}}{{println .}}{{end}}' spark-spark-master-1
sudo docker inspect --format='{{range .Config.Env}}{{println .}}{{end}}' spark-spark-worker-1
sudo docker inspect --format='{{range .Config.Env}}{{println .}}{{end}}' spark-spark-submit-enrich-1
sudo docker inspect --format='{{range .Config.Env}}{{println .}}{{end}}' spark-spark-submit-aggr-1
sudo docker inspect --format='{{range .Config.Env}}{{println .}}{{end}}' spark-spark-historyserver-1

✅ Show the loaded configurations

sudo docker-compose -f ~/spark/docker-compose.yaml exec spark-master cat /opt/spark/conf/app.properties
sudo docker-compose -f ~/spark/docker-compose.yaml exec spark-master cat /opt/spark/conf/spark-defaults.conf
sudo docker-compose -f ~/spark/docker-compose.yaml exec spark-master cat /opt/spark/conf/log4j2.properties

Log Commands

✅ Show Spark log

sudo docker-compose -f ~/spark/docker-compose.yaml logs spark-master
sudo docker-compose -f ~/spark/docker-compose.yaml logs spark-worker
sudo docker-compose -f ~/spark/docker-compose.yaml logs spark-submit-aggr
sudo docker-compose -f ~/spark/docker-compose.yaml logs spark-submit-enrich
sudo docker-compose -f ~/spark/docker-compose.yaml logs spark-historyserver

Version Commands

✅ Show Spark version

sudo docker-compose -f ~/spark/docker-compose.yaml exec spark-master /opt/spark/bin/spark-shell --version

✅ Show Docker version

sudo docker version

✅ Show Docker Compose version

docker-compose version

✅ Show OS version

cat /etc/os-release

Next Step >>

⚠️ **GitHub.com Fallback** ⚠️