FATE On Spark Leverage the external cluster - FederatedAI/KubeFATE GitHub Wiki

Overview

As FATE on Spark mentioned, FATE support to use of Spark, HDFS, and Pulsar as its computing, storage, and federation engine respectively. This document targets how to leverage these external services with KubeFATE.

The following figure describes the overall architecture of the deployment, while components except Spark and HDFS are running on the Kubernetes.

leverage external spark cluster

Spark and HDFS version requirements

  • Spark 2.4.1
  • HDFS 2.7

Configuration of FATE Cluster

HDFS

To use external HDFS, a user only need to specify the correct "namenode" address in the "python" component of the cluster's deployment file. For example:

python:
  ...
  hdfs:
    name_node: hdfs://192.168.0.1:9000
    path_prefix:

Where the 192.168.0.1:9000 is the namenode's address.

In an existing cluster, a user can update the "python-config", the configmap of the "python" component to switch to another HDFS cluster, however, it is required to restart the pod to make the update work.

Pulsar

It is similar to use external HDFS, a user only need to make some adjustments which include two parts:

the python component

A user needs to specify the correct "pulsar" address in "python" component of the cluster's deployment, for example, if the "pulsar" is listening on 192.168.0.2:6650:

python:
  ...
  pulsar:
    host: 192.168.0.2
    port: 6650
    sslPort: 6651
    mng_port: 8080

the pulsar's route table

Also, the route table of "pulsar" component is required to update. Here's an example, assuming the local party name is "9999", other parties remains unchanged

pulsar:
  route_table:
  9999:
    host: 192.168.0.2
    port: 6650
    sslPort: 6651
  10000:
    host: 192.168.220.12
    port: 6650
    sslPort: 6651

For a deployed FATE cluster to connect other "pulsar" cluster, a user needs to update the python and pulsar-route-table configmap respectively.

Spark

To connect an external Spark cluster will be relatively more complicated, which includes three parts:

Add FATE's dependencies to Spark cluster

It is recommended to use a separated python environment to install FATE's dependencies and then refer to this environment in the submission of the spark application. For how to set up the python environment and install the dependencies, a user can refer to the blog Execute Federated Learning Tasks With Spark in FATE for more details.

Prepare and open ports for Spark driver

Once the spark application was submitted and the executor was created, the executor will connect back to the driver for further process. However, since the driver is running on the Kubernetes and the Spark cluster is running outside of the Kubernetes, it is necessary to open some ports in Kubernetes to make executor can communicate with the driver normally.

For every spark application, it is at least required 2 ports, one is for driver and another is for block manager. If a user needs multiple spark applications/fate job runs at the same time, he should open such two ports for each job in advance.

For example, assumes a use have a FATE cluster called "9999", whether the cluster is deployed or not, he can use kubectl apply to open relevant ports for the spark driver with the following definition of "service":

apiVersion: v1
kind: Service
metadata:
  name: spark-driver
  namespace: fate-9999
spec:
  ports:
  - name: driver-1
    port: 9000
    protocol: TCP
  - name: driver-2
    port: 9001
    protocol: TCP
  - name: block-manager-1
    port: 8000
    protocol: TCP
  - name: block-manager-2
    port: 8001
    protocol: TCP
  selector:
    fateMoudle: python
    name: fate-9999
    partyId: "9999"
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer: {}

Where it opens two pairs of ports, thus can support two jobs to run simultaneously. A user can replace the LoadBalancer service type with NodePort in advanced.

Update the Spark relevant config in the python component

Assume the LB's address is "192.168.0.3", a user should specify the spark driver's configuration in the python component of the deployment file as the following example:

python:
  spark:
    driverHost: 192.168.0.3      
    driverStartPort: 9000
    blockManagerStartPort: 8000
    pysparkPython: /data/projects/python/venv/bin/python

Or a user can update the "spark-defaults.conf" part for "python-config" configmap relevantly.

Verification

A user can use the following instructions to verify the configuration, he should make adjustments to fit the actual environment.

  1. log in to the python pod
  2. cd to /data/projects/fate/examples/toy_example
  3. append the following content to job_parameters of "toy_example_conf.json":
"spark_run": {
  "num-executors": 1,
  "executor-cores": 1,
  "executor-memory": "2G",
  "total-executor-cores": 1
}
  1. run the example with the following command:
python run_toy_example.py -b 2 9999 9999 1

If return success then the configuration is working as intended.