KubeFATE FAQs - FederatedAI/KubeFATE GitHub Wiki

FAQs

Docker Compose

1. When launching a cluster using docker-compose, the Python container initialization restarts several times.

TBD

2. How to clean up data of FATE cluster created using docker-compose?

Login to the machine where the cluster is located, and run: TBD

3. How to resolve the issue: "can not upload file over 100M with spark backend"

We are planning to fix the issue by Kubefate v1.9.0, before that happens, if you are using docker compose to deploy a FATE cluster, please check the workaround here.

Usage

1. Running toy example failed.

When fail to run run_toy_example, maybe get logs like this:

2019-11-14 07:27:48,165 - task_executor.py[line:127] - ERROR: <_Rendezvous of RPC that terminated with:

status = StatusCode.INTERNAL

details = "172.18.0.8:8011: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

  • First check if all containers are running normally, especially egg, roll, meta-service containers.

  • It may be that computer resources are insufficient, such as memory, check the kernel log to verify if the OMM killer was invoked.

  • The storage-service of egg service requires CPU instruction set like avx2 etc. Please make sure your CPU supports these instructions otherwise the storage-service will fail to start with the following error: Illegal instruction (core dumped)

    use the following commands to check the log of "storage-service":

    $ docker exec -it confs-xxxx_egg_1 bash
    $ cat storage-service-cxx/logs/error.log
    
  • Decrease the processor_count of the egg service, if the job is able to start but can not finished.

    For Docker-Compose deployment:

    1. Check the number(s) of processor with
    $ cat /proc/cpuinfo | grep processor | wc -l
    
    1. Log in to the egg container and update config
    $ docker exec -it confs-xxx_egg_1 bash
    $ vi egg/conf/egg.properties
    
    set `eggroll.computing.processor.session.max.count` to the output of step 1
    
    1. Restart the egg container
    $ docker restart confs-xxx_egg_1
    

    For k8s-deploy, to change the egg setting (e.g. in namespace fate-9999) with,

    kubectl edit configmap egg-config -n fate-9999
    

    by default, the eggroll.computing.processor.session.max.count is set to 16, change it to match your CPU processor, then save it and restart the egg pod. The egg pod with restart by itself.

Kubernetes

Q: How does KubeFATE support PodSecurityPolicy (PSP) of k8s?

A: You can support PSP by configuring the following configuration of cluster.yaml

podSecurityPolicy:
   enabled: true

When deploying, KubeFATE will create PodSecurityPolicy and the corresponding role, rolebinding and serviceaccount resources.

Q: How to break through the DockerHub's limit on the number of times to pull images.

A: You can configure your own imagePullSecrets:

image:
  imagePullSecrets:
  - name: myregistrykey

Then create your own imagePullSecrets under the corresponding namespace, refer to Use image pull secrets.

Q: How to change the resource requirements for computing engines?

With regard to "computing engines", we are talking about eggroll's nodemanger or spark's spark-worker.

For these two components, they could have higher requirements for the resources in the K8s cluster. In the helm chart, we have set a default request for these 2 components: at least 2 cores' CPU and 4GB memory. For more information about resource request, please check the offical doc of K8s. In our cluster.yaml file, you can set the customized resource configuration for each component of FATE, including node manager and spark-worker.

Take this cluster.yaml file as an example, we can add below lines to enlarge the resource request for spark worker:

spark:
  worker:
    replicas: 2
    resources:
      requests:
        cpu: "4"
        memory: "8Gi"