Running FATE with Eggroll On Kunpeng CPU and Ascend 910A NPU - FederatedAI/KubeFATE GitHub Wiki
- 1. Create a base image
- 2. Deploy and Test FATE in Standalone Mode
- 3. Deploy and Test FATE in Cluster Mode
Server configuration:
Item | Configuration |
---|---|
OS | EulerOS 2.0 (SP8) |
CPU | Hisilicon Kunpeng 920(4*48Core @ 2.6GHz) |
RAM | 770GB |
NPU | 8 * Ascend-910A (32G HBM) |
npu-smi info
+-------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc1 Version: 23.0.rc1 |
+----------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+======================+===============+====================================================+
| 0 910B | OK | 72.1 35 0 / 0 |
| 0 | 0000:C1:00.0 | 0 951 / 15137 0 / 32768 |
+======================+===============+====================================================+
| 1 910B | OK | 68.8 36 0 / 0 |
| 0 | 0000:81:00.0 | 0 1297 / 15137 2 / 32768 |
+======================+===============+====================================================+
| 2 910B | OK | 71.6 36 0 / 0 |
| 0 | 0000:41:00.0 | 0 2694 / 15137 2 / 32768 |
+======================+===============+====================================================+
| 3 910B | OK | 69.0 36 0 / 0 |
| 0 | 0000:01:00.0 | 0 1699 / 15039 2 / 32768 |
+======================+===============+====================================================+
| 4 910B | OK | 71.2 35 0 / 0 |
| 0 | 0000:C2:00.0 | 0 937 / 15137 1 / 32768 |
+======================+===============+====================================================+
| 5 910B | OK | 68.2 36 0 / 0 |
| 0 | 0000:82:00.0 | 0 881 / 15137 0 / 32768 |
+======================+===============+====================================================+
| 6 910B | OK | 71.0 36 0 / 0 |
| 0 | 0000:42:00.0 | 0 1842 / 15137 1 / 32768 |
+======================+===============+====================================================+
| 7 910B | OK | 69.1 35 0 / 0 |
| 0 | 0000:02:00.0 | 0 2980 / 15039 2 / 32768 |
+======================+===============+====================================================+
+----------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+======================+===============+====================================================+
| No running processes found in NPU 0 |
+======================+===============+====================================================+
| No running processes found in NPU 1 |
+======================+===============+====================================================+
| No running processes found in NPU 2 |
+======================+===============+====================================================+
| No running processes found in NPU 3 |
+======================+===============+====================================================+
| No running processes found in NPU 4 |
+======================+===============+====================================================+
| No running processes found in NPU 5 |
+======================+===============+====================================================+
| No running processes found in NPU 6 |
+======================+===============+====================================================+
| No running processes found in NPU 7 |
+======================+===============+====================================================+
1. Create a base image
1.1 Start the NPU Container
docker run -u root -it --ipc=host \
--name base-fate-npu \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
ascendhub.huawei.com/public-ascendhub/pytorch-modelzoo:23.0.RC2-1.11.0 \
/bin/bash
Install dependencies
docker exec -it base-fate-npu bash
mkdir -p /data/projects/fate
mkdir -p /data/projects/eggroll
apt update
apt-get install -y sudo git lzma liblzma-dev libbz2-dev vim mysql-client lsof
1.2 Install Python3.9
apt install -y build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev
cd /data/projects
wget https://www.python.org/ftp/python/3.9.18/Python-3.9.18.tgz
tar xvzf Python-3.9.18.tgz
cd Python-3.9.18
./configure --enable-optimizations
make -j 8
make install
cd /data/projects
python3.9 -m venv venv
export FATE_VENV_BASE=$PWD/venv
source venv/bin/activate
1.3 Install FATE
cd /data/projects/fate
git clone https://github.com/FederatedAI/FATE.git .
git submodule init
git submodule update
export FATE_PROJECT_BASE=$PWD
sed -i 's/cd \/usr\/lib\/x86_64-linux-gnu/cd \/usr\/lib\/aarch64-linux-gnu/g' bin/install_os_dependencies.sh
bash bin/install_os_dependencies.sh
sed -i 's/tensorflow-cpu==2.11.1/tensorflow/g; s/torch==1.13.1+cpu/torch==1.13.1/g; s/torchvision==0.14.1+cpu/torchvision==0.14.1/g; s/ipcl-python==2.0.0/#ipcl-python==2.0.0/g' python/requirements.txt
apt install -y llvm-10 libxml2-dev libxslt-dev
pip install llvmlite
pip install decorator attrs psutil absl-py cloudpickle scipy synr==0.5.0 tornado
LLVM_CONFIG=/usr/bin/llvm-config-10 pip install -r python/requirements.txt
pip install -r python/requirements-fate-llm.txt
cd ${FATE_PROJECT_BASE}
sed -i "s#PYTHONPATH=.*#PYTHONPATH=$PWD/python:$PWD/fateflow/python:/data/projects/eggroll/python#g" bin/init_env.sh
sed -i "s#venv=.*#venv=${FATE_VENV_BASE}#g" bin/init_env.sh
1.4 Install Torch-NPU for Ascend 910 (Optional)
Install pytorch2.0 and torch_npu:
cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
pip3 install torch==2.0.1
wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc2-pytorch2.0.1/torch_npu-2.0.1rc1-cp39-cp39-linux_aarch64.whl
pip3 install torch_npu-2.0.1rc1-cp39-cp39-linux_aarch64.whl
pip3 install torchvision==0.15.2
vi bin/init_env.sh
- add:
source /usr/local/Ascend/ascend-toolkit/set_env.sh export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1
vi python/federatedml/nn/homo/trainer/fedavg_trainer.py
- Line 19, insert:
try: import torch_npu from torch_npu.contrib import transfer_to_npu except ImportError: torch_npu = None
vi fateflow/python/fate_flow/worker/task_executor.py
- Line 46, insert:
try: import torch_npu from torch_npu.contrib import transfer_to_npu except ImportError: torch_npu = None
1.5 Install Eggroll
cd ${FATE_PROJECT_BASE}
apt install -y openjdk-8-jdk maven
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
sed -i "s#JAVA_HOME=.*#JAVA_HOME=${JAVA_HOME}#g" bin/init_env.sh
sed -i.bak "s#\$PATH:\$JAVA_HOME/bin#\$JAVA_HOME/bin:\$PATH#g" bin/init_env.sh
source bin/init_env.sh
export EGGROLL_HOME=/data/projects/eggroll
cd ${FATE_PROJECT_BASE}/eggroll
bash deploy/auto-packaging.sh
cp eggroll.tar.gz $EGGROLL_HOME
cd $EGGROLL_HOME
tar xvzf eggroll.tar.gz
sed -i.bak "s#EGGROLL_HOME=.*#EGGROLL_HOME=${EGGROLL_HOME}#g" $FATE_PROJECT_BASE/bin/init_env.sh
exit
1.6 Save base image
docker commit base-fate-npu-id base-fate-npu
2. Deploy and Test FATE in Standalone Mode
docker run -u root -it --ipc=host \
--name standalone-fate-npu \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
base-fate-npu:latest \
/bin/bash
docker exec -it standalone-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
vi conf/service_conf.yaml
default_engines:
computing: standalone
federation: standalone
storage: standalone
cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
bash fateflow/bin/service.sh start
cd python/fate_client/
python setup.py install
cd ../../
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380
Toy test:
flow test toy -gid 9999 -hid 9999
If successful, the screen displays a statement similar to the following:
success to calculate secure_sum, it is 2000.0
Unit tests:
cd ${FATE_PROJECT_BASE}
bash ./python/federatedml/test/run_test.sh
If successful, the screen displays a statement similar to the following:
there are 0 failed test
3. Deploy and Test FATE in Cluster Mode
3.1 Deploy the 1st Party
3.1.1 Launch MySQL
docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=eggroll_meta -e MYSQL_USER=fate -e MYSQL_PASSWORD=fate_dev -e user=root -d --name 9999-mysql-npu mysql:8.0.34
docker exec -it 9999-mysql-npu mysql
create database fate_flow;
GRANT ALL PRIVILEGES ON fate_flow.* TO 'fate'@'%';
quit
3.1.2 Launch FATE + Eggroll
docker run -u root -it --ipc=host \
--name 9999-eggroll-fate-npu \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
base-fate-npu:latest \
/bin/bash
docker exec -it 9999-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
vim conf/service_conf.yaml and changes the following:
database:
....
passwd: fate_dev
host: <mysql ip of 9999-mysql-npu>
default_engines:
computing: eggroll
federation: eggroll
storage: eggroll
Configure Eggroll
cd ${EGGROLL_HOME}
vim conf/eggroll.properties
-
change the mysql ip and timezone in eggroll.resourcemanager.clustermanager.jdbc.url
- set timezone from Asia/Shanghai to UTC
-
add mysql username(fate) and password(fate_dev).
-
also change
eggroll.resourcemanager.process.tag=9999 eggroll.resourcemanager.bootstrap.egg_pair.venv=/data/projects/venv eggroll.rollsite.party.id=9999
vim conf/route_table.json
- add an item:
"9999": { "default":[ { "port": 9370, "ip": "127.0.0.1" } ], "fateflow":[ { "port": 9360, "ip": "127.0.0.1" } ] }
- if needed, add other parties like:
"10000": { "default":[ { "port": 9370, "ip": "<other party's rollsite ip>" } ] }
vim conf/create-eggroll-meta-tables.sql
- uncomment the create and use database statement
mysql -u fate -pfate_dev -h <mysql_ip of 9999-mysql-npu> < conf/create-eggroll-meta-tables.sql
source $FATE_PROJECT_BASE/bin/init_env.sh
bash bin/eggroll.sh all start
bash bin/eggroll.sh all status
Launch fateflow
cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
bash fateflow/bin/service.sh start
cd python/fate_client/
python setup.py install
cd ../../
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380
Toy test:
flow test toy -gid 9999 -hid 9999
If successful, the screen displays a statement similar to the following:
success to calculate secure_sum, it is 2000.0
Unit tests:
cd ${FATE_PROJECT_BASE}
bash ./python/federatedml/test/run_test.sh
If successful, the screen displays a statement similar to the following:
there are 0 failed test
3.2 Deploy the 2nd Party
3.2.1 Launch MySQL
docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=eggroll_meta -e MYSQL_USER=fate -e MYSQL_PASSWORD=fate_dev -e user=root -d --name 10000-mysql-npu mysql:8.0.34
docker exec -it 10000-mysql-npu mysql
create database fate_flow;
GRANT ALL PRIVILEGES ON fate_flow.* TO 'fate'@'%';
quit
3.2.2 Launch FATE + Eggroll
docker run -u root -it --ipc=host \
--name 10000-eggroll-fate-npu \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /data:/home/HwHiAiUser/data \
-d \
base-fate-npu:latest \
/bin/bash
docker exec -it 10000-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
vim conf/service_conf.yaml and changes the following:
database:
....
passwd: fate_dev
host: <mysql ip of 10000-mysql-npu>
default_engines:
computing: eggroll
federation: eggroll
storage: eggroll
Configure Eggroll
cd ${EGGROLL_HOME}
vim conf/eggroll.properties
-
change the mysql ip and timezone in eggroll.resourcemanager.clustermanager.jdbc.url
- set timezone from Asia/Shanghai to UTC
-
add mysql username(fate) and password(fate_dev).
-
also change
eggroll.resourcemanager.process.tag=10000 eggroll.resourcemanager.bootstrap.egg_pair.venv=/data/projects/venv eggroll.rollsite.party.id=10000
vim conf/route_table.json
- add an item:
"10000": { "default":[ { "port": 9370, "ip": "127.0.0.1" } ], "fateflow":[ { "port": 9360, "ip": "127.0.0.1" } ] }
- if needed, add other parties like:
"9999": { "default":[ { "port": 9370, "ip": "<other party's rollsite ip>" } ] }
vim conf/create-eggroll-meta-tables.sql
- uncomment the create and use database statement
mysql -u fate -pfate_dev -h <mysql_ip of 10000-mysql-npu> < conf/create-eggroll-meta-tables.sql
source $FATE_PROJECT_BASE/bin/init_env.sh
bash bin/eggroll.sh all start
bash bin/eggroll.sh all status
Launch fateflow
cd ${FATE_PROJECT_BASE}
source bin/init_env.sh
bash fateflow/bin/service.sh start
cd python/fate_client/
python setup.py install
cd ../../
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380
Toy test:
flow test toy -gid 10000 -hid 10000
If successful, the screen displays a statement similar to the following:
success to calculate secure_sum, it is 2000.0
Unit tests:
cd ${FATE_PROJECT_BASE}
bash ./python/federatedml/test/run_test.sh
If successful, the screen displays a statement similar to the following:
there are 0 failed test
3.3 Two-Party Test
3.3.1 Configure the 1st Party
docker exec -it 9999-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
cd ${EGGROLL_HOME}
vim conf/route_table.json, add other parties like:
"10000":
{
"default":[
{
"port": 9370,
"ip": "<other party's rollsite ip>"
}
]
}
bash bin/eggroll.sh all start
cd ${FATE_PROJECT_BASE}
bash fateflow/bin/service.sh start
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380
3.3.2 Configure the 2nd Party
docker exec -it 10000-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
cd ${EGGROLL_HOME}
vim conf/route_table.json, add other parties like:
"9999":
{
"default":[
{
"port": 9370,
"ip": "<other party's rollsite ip>"
}
]
}
bash bin/eggroll.sh all start
cd ${FATE_PROJECT_BASE}
bash fateflow/bin/service.sh start
flow init -c conf/service_conf.yaml
pipeline init --ip 127.0.0.1 --port 9380
3.3.3 Two-party test
On 1st party:
docker exec -it 9999-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
flow test toy -gid 9999 -hid 10000
If successful, the screen displays a statement similar to the following:
success to calculate secure_sum, it is 2000.0
On 2nd party:
docker exec -it 10000-eggroll-fate-npu bash
cd /data/projects/fate
source bin/init_env.sh
flow test toy -gid 10000 -hid 9999
If successful, the screen displays a statement similar to the following:
success to calculate secure_sum, it is 2000.0