Running FATE with Eggroll On Kunpeng CPU and Ascend 910A NPU - FederatedAI/KubeFATE GitHub Wiki

Server configuration:

Item Configuration
OS EulerOS 2.0 (SP8)
CPU Hisilicon Kunpeng 920(4*48Core @ 2.6GHz)
RAM 770GB
NPU 8 * Ascend-910A (32G HBM)
npu-smi info

+-------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc1                 Version: 23.0.rc1                                        |
+----------------------+---------------+----------------------------------------------------+
| NPU   Name           | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                 | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+======================+===============+====================================================+
| 0     910B           | OK            | 72.1        35                0    / 0             |
| 0                    | 0000:C1:00.0  | 0           951  / 15137      0    / 32768         |
+======================+===============+====================================================+
| 1     910B           | OK            | 68.8        36                0    / 0             |
| 0                    | 0000:81:00.0  | 0           1297 / 15137      2    / 32768         |
+======================+===============+====================================================+
| 2     910B           | OK            | 71.6        36                0    / 0             |
| 0                    | 0000:41:00.0  | 0           2694 / 15137      2    / 32768         |
+======================+===============+====================================================+
| 3     910B           | OK            | 69.0        36                0    / 0             |
| 0                    | 0000:01:00.0  | 0           1699 / 15039      2    / 32768         |
+======================+===============+====================================================+
| 4     910B           | OK            | 71.2        35                0    / 0             |
| 0                    | 0000:C2:00.0  | 0           937  / 15137      1    / 32768         |
+======================+===============+====================================================+
| 5     910B           | OK            | 68.2        36                0    / 0             |
| 0                    | 0000:82:00.0  | 0           881  / 15137      0    / 32768         |
+======================+===============+====================================================+
| 6     910B           | OK            | 71.0        36                0    / 0             |
| 0                    | 0000:42:00.0  | 0           1842 / 15137      1    / 32768         |
+======================+===============+====================================================+
| 7     910B           | OK            | 69.1        35                0    / 0             |
| 0                    | 0000:02:00.0  | 0           2980 / 15039      2    / 32768         |
+======================+===============+====================================================+
+----------------------+---------------+----------------------------------------------------+
| NPU     Chip         | Process id    | Process name             | Process memory(MB)      |
+======================+===============+====================================================+
| No running processes found in NPU 0                                                       |
+======================+===============+====================================================+
| No running processes found in NPU 1                                                       |
+======================+===============+====================================================+
| No running processes found in NPU 2                                                       |
+======================+===============+====================================================+
| No running processes found in NPU 3                                                       |
+======================+===============+====================================================+
| No running processes found in NPU 4                                                       |
+======================+===============+====================================================+
| No running processes found in NPU 5                                                       |
+======================+===============+====================================================+
| No running processes found in NPU 6                                                       |
+======================+===============+====================================================+
| No running processes found in NPU 7                                                       |
+======================+===============+====================================================+

1. Create a base image

1.1 Start the NPU Container

docker run -u root -it --ipc=host \
--name base-fate-npu \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
ascendhub.huawei.com/public-ascendhub/pytorch-modelzoo:23.0.RC2-1.11.0 \
/bin/bash

Install dependencies

docker exec -it base-fate-npu bash
mkdir -p /data/projects/fate
mkdir -p /data/projects/eggroll
apt update
apt-get install -y sudo git lzma liblzma-dev libbz2-dev vim mysql-client lsof

1.2 Install Python3.9

apt install -y build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev
cd /data/projects
wget https://www.python.org/ftp/python/3.9.18/Python-3.9.18.tgz
tar xvzf Python-3.9.18.tgz
cd Python-3.9.18
./configure --enable-optimizations
make -j 8
make install
cd /data/projects
python3.9 -m venv venv
export FATE_VENV_BASE=$PWD/venv
source venv/bin/activate

1.3 Install FATE

cd /data/projects/fate
git clone https://github.com/FederatedAI/FATE.git .
git submodule init
git submodule update
export FATE_PROJECT_BASE=$PWD
sed -i 's/cd \/usr\/lib\/x86_64-linux-gnu/cd \/usr\/lib\/aarch64-linux-gnu/g' bin/install_os_dependencies.sh
bash bin/install_os_dependencies.sh
sed -i 's/tensorflow-cpu==2.11.1/tensorflow/g; s/torch==1.13.1+cpu/torch==1.13.1/g; s/torchvision==0.14.1+cpu/torchvision==0.14.1/g; s/ipcl-python==2.0.0/#ipcl-python==2.0.0/g' python/requirements.txt
apt install -y llvm-10 libxml2-dev libxslt-dev
pip install llvmlite
pip install decorator attrs psutil absl-py cloudpickle scipy synr==0.5.0 tornado
LLVM_CONFIG=/usr/bin/llvm-config-10 pip install -r python/requirements.txt
pip install -r python/requirements-fate-llm.txt 

cd ${FATE_PROJECT_BASE}
sed -i "s#PYTHONPATH=.*#PYTHONPATH=$PWD/python:$PWD/fateflow/python:/data/projects/eggroll/python#g" bin/init_env.sh
sed -i "s#venv=.*#venv=${FATE_VENV_BASE}#g" bin/init_env.sh

1.4 Install Torch-NPU for Ascend 910 (Optional)

Install pytorch2.0 and torch_npu:

cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
pip3 install torch==2.0.1
wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc2-pytorch2.0.1/torch_npu-2.0.1rc1-cp39-cp39-linux_aarch64.whl
pip3 install torch_npu-2.0.1rc1-cp39-cp39-linux_aarch64.whl
pip3 install torchvision==0.15.2

vi bin/init_env.sh

  • add:
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1
    

vi python/federatedml/nn/homo/trainer/fedavg_trainer.py

  • Line 19, insert:
    try:
        import torch_npu
        from torch_npu.contrib import transfer_to_npu
    except ImportError:
        torch_npu = None
    

vi fateflow/python/fate_flow/worker/task_executor.py

  • Line 46, insert:
    try:
        import torch_npu
        from torch_npu.contrib import transfer_to_npu
    except ImportError:
        torch_npu = None
    

1.5 Install Eggroll

cd ${FATE_PROJECT_BASE}
apt install -y openjdk-8-jdk maven
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
sed -i "s#JAVA_HOME=.*#JAVA_HOME=${JAVA_HOME}#g" bin/init_env.sh
sed -i.bak "s#\$PATH:\$JAVA_HOME/bin#\$JAVA_HOME/bin:\$PATH#g" bin/init_env.sh
source bin/init_env.sh

export EGGROLL_HOME=/data/projects/eggroll
cd ${FATE_PROJECT_BASE}/eggroll
bash deploy/auto-packaging.sh
cp eggroll.tar.gz $EGGROLL_HOME
cd $EGGROLL_HOME
tar xvzf eggroll.tar.gz
sed -i.bak "s#EGGROLL_HOME=.*#EGGROLL_HOME=${EGGROLL_HOME}#g" $FATE_PROJECT_BASE/bin/init_env.sh
exit

1.6 Save base image

docker commit base-fate-npu-id base-fate-npu

2. Deploy and Test FATE in Standalone Mode

docker run -u root -it --ipc=host \
--name standalone-fate-npu \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
base-fate-npu:latest \
/bin/bash
docker exec -it standalone-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh

vi conf/service_conf.yaml

default_engines:
  computing: standalone
  federation: standalone
  storage: standalone
cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
bash fateflow/bin/service.sh start

cd python/fate_client/
python setup.py install

cd ../../
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380

Toy test:

flow test toy -gid 9999 -hid 9999

If successful, the screen displays a statement similar to the following:

success to calculate secure_sum, it is 2000.0

Unit tests:

cd ${FATE_PROJECT_BASE}
bash ./python/federatedml/test/run_test.sh

If successful, the screen displays a statement similar to the following:

there are 0 failed test

3. Deploy and Test FATE in Cluster Mode

3.1 Deploy the 1st Party

3.1.1 Launch MySQL

docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=eggroll_meta -e MYSQL_USER=fate -e MYSQL_PASSWORD=fate_dev -e user=root -d --name 9999-mysql-npu mysql:8.0.34
docker exec -it 9999-mysql-npu mysql
create database fate_flow;
GRANT ALL PRIVILEGES ON fate_flow.* TO 'fate'@'%';
quit

3.1.2 Launch FATE + Eggroll

docker run -u root -it --ipc=host \
--name 9999-eggroll-fate-npu \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
base-fate-npu:latest \
/bin/bash
docker exec -it 9999-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh

vim conf/service_conf.yaml and changes the following:

database:
  ....
  passwd: fate_dev
  host: <mysql ip of 9999-mysql-npu>
default_engines:
  computing: eggroll
  federation: eggroll
  storage: eggroll

Configure Eggroll

cd ${EGGROLL_HOME}

vim conf/eggroll.properties

  • change the mysql ip and timezone in eggroll.resourcemanager.clustermanager.jdbc.url

    • set timezone from Asia/Shanghai to UTC
  • add mysql username(fate) and password(fate_dev).

  • also change

    eggroll.resourcemanager.process.tag=9999
    eggroll.resourcemanager.bootstrap.egg_pair.venv=/data/projects/venv
    eggroll.rollsite.party.id=9999
    

vim conf/route_table.json

  • add an item:
    "9999":
    {
      "default":[
        {
          "port": 9370,
          "ip": "127.0.0.1"
        }
      ],
      "fateflow":[
        {
          "port": 9360,
          "ip": "127.0.0.1"
        }
      ]
    }
    
  • if needed, add other parties like:
    "10000":
    {
      "default":[
        {
          "port": 9370,
          "ip": "<other party's rollsite ip>"
        }
      ]
    }
    

vim conf/create-eggroll-meta-tables.sql

  • uncomment the create and use database statement
mysql -u fate -pfate_dev -h <mysql_ip of 9999-mysql-npu> < conf/create-eggroll-meta-tables.sql

source $FATE_PROJECT_BASE/bin/init_env.sh

bash bin/eggroll.sh all start
bash bin/eggroll.sh all status

Launch fateflow

cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
bash fateflow/bin/service.sh start

cd python/fate_client/
python setup.py install

cd ../../
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380

Toy test:

flow test toy -gid 9999 -hid 9999

If successful, the screen displays a statement similar to the following:

success to calculate secure_sum, it is 2000.0

Unit tests:

cd ${FATE_PROJECT_BASE}
bash ./python/federatedml/test/run_test.sh

If successful, the screen displays a statement similar to the following:

there are 0 failed test

3.2 Deploy the 2nd Party

3.2.1 Launch MySQL

docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=eggroll_meta -e MYSQL_USER=fate -e MYSQL_PASSWORD=fate_dev -e user=root -d --name 10000-mysql-npu mysql:8.0.34
docker exec -it 10000-mysql-npu mysql
create database fate_flow;
GRANT ALL PRIVILEGES ON fate_flow.* TO 'fate'@'%';
quit

3.2.2 Launch FATE + Eggroll

docker run -u root -it --ipc=host \
--name 10000-eggroll-fate-npu \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
base-fate-npu:latest \
/bin/bash
docker exec -it 10000-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh

vim conf/service_conf.yaml and changes the following:

database:
  ....
  passwd: fate_dev
  host: <mysql ip of 10000-mysql-npu>
default_engines:
  computing: eggroll
  federation: eggroll
  storage: eggroll

Configure Eggroll

cd ${EGGROLL_HOME}

vim conf/eggroll.properties

  • change the mysql ip and timezone in eggroll.resourcemanager.clustermanager.jdbc.url

    • set timezone from Asia/Shanghai to UTC
  • add mysql username(fate) and password(fate_dev).

  • also change

    eggroll.resourcemanager.process.tag=10000
    eggroll.resourcemanager.bootstrap.egg_pair.venv=/data/projects/venv
    eggroll.rollsite.party.id=10000
    

vim conf/route_table.json

  • add an item:
    "10000":
    {
      "default":[
        {
          "port": 9370,
          "ip": "127.0.0.1"
        }
      ],
      "fateflow":[
        {
          "port": 9360,
          "ip": "127.0.0.1"
        }
      ]
    }
    
  • if needed, add other parties like:
    "9999":
    {
      "default":[
        {
          "port": 9370,
          "ip": "<other party's rollsite ip>"
        }
      ]
    }
    

vim conf/create-eggroll-meta-tables.sql

  • uncomment the create and use database statement
mysql -u fate -pfate_dev -h <mysql_ip of 10000-mysql-npu> < conf/create-eggroll-meta-tables.sql

source $FATE_PROJECT_BASE/bin/init_env.sh

bash bin/eggroll.sh all start
bash bin/eggroll.sh all status

Launch fateflow

cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
bash fateflow/bin/service.sh start

cd python/fate_client/
python setup.py install

cd ../../
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380

Toy test:

flow test toy -gid 10000 -hid 10000

If successful, the screen displays a statement similar to the following:

success to calculate secure_sum, it is 2000.0

Unit tests:

cd ${FATE_PROJECT_BASE}
bash ./python/federatedml/test/run_test.sh

If successful, the screen displays a statement similar to the following:

there are 0 failed test

3.3 Two-Party Test

3.3.1 Configure the 1st Party

docker exec -it 9999-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
cd ${EGGROLL_HOME}

vim conf/route_table.json, add other parties like:

    "10000":
    {   
      "default":[
        {   
          "port": 9370,
          "ip": "<other party's rollsite ip>"
        }
      ]
    }
bash bin/eggroll.sh all start
cd ${FATE_PROJECT_BASE}
bash fateflow/bin/service.sh start
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380

3.3.2 Configure the 2nd Party

docker exec -it 10000-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
cd ${EGGROLL_HOME}

vim conf/route_table.json, add other parties like:

    "9999":
    {   
      "default":[
        {   
          "port": 9370,
          "ip": "<other party's rollsite ip>"
        }
      ]
    }
bash bin/eggroll.sh all start
cd ${FATE_PROJECT_BASE}
bash fateflow/bin/service.sh start
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380

3.3.3 Two-party test

On 1st party:

docker exec -it 9999-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
flow test toy -gid 9999 -hid 10000

If successful, the screen displays a statement similar to the following:

success to calculate secure_sum, it is 2000.0

On 2nd party:

docker exec -it 10000-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
flow test toy -gid 10000 -hid 9999

If successful, the screen displays a statement similar to the following:

success to calculate secure_sum, it is 2000.0