Installing KubeFlow Training with CodeFlare SDK - project-codeflare/codeflare-operator GitHub Wiki

Installing KubeFlow Training with CodeFlare-SDK

Kubeflow Training: https://github.com/kubeflow/training-operator

goal is to get one of these running: https://www.kubeflow.org/docs/components/training/

0. Prerequisistes

0.1 OpenShift

0.2 Logged onto the OC UI

0.3 Also logged into the terminal with oc login

0.4 Have an opendatahub namespace created:

oc create new-project opendatahub

1. Install ODH with Fast Channel

Using your terminal where you're logged in with oc login, issue this command:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/opendatahub-operator.openshift-operators: ""
  name: opendatahub-operator
  namespace: openshift-operators
spec:
  channel: fast
  installPlanApproval: Automatic
  name: opendatahub-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: opendatahub-operator.v2.4.0
EOF

You can check it started with:

oc get pods -n openshift-operators

2. Deploy the DataScienceCluster with:

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/part-of: opendatahub-operator
  name: example-dsc
  namespace: opendatahub
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Removed
    modelmeshserving:
      managementState: Removed
    ray:
      managementState: Managed
    workbenches:
      managementState: Managed
EOF

You can check that the pods all started with:

oc get pods -n opendatahub

and it should look like this:

   READY   STATUS    RESTARTS   AGE
kuberay-operator-5d9567bdf4-7rt2n                  1/1     Running   0          59s
notebook-controller-deployment-6468bbf669-rlt64    1/1     Running   0          71s
odh-dashboard-649fdc86bb-2jdv2                     2/2     Running   0          73s
odh-dashboard-649fdc86bb-pgkzw                     2/2     Running   0          73s
odh-notebook-controller-manager-86d9b47b54-s9g45   1/1     Running   0          72s

3. Install a stable release of Kubeflow Training:

oc apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

Note: I'm trying master branch here:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

Can see that it started with:

oc get pods -n kubeflow

Note, if you're having pull issues from docker.io, you can change your deployment to pull from quay.io instead with this:

oc set image deployment training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n kubeflow

Note: the initContainer pulls from docker.io/alpine:3.10 automatically, which causes trouble on some clusters that are ratelimited to Docker.io. To get around this, you can run the following command to patch the training-operator to use a different repo for the initContainer:

oc patch deployment training-operator -n kubeflow --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager",  "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'

4. Access the spawner page by going to your Open Data Hub dashboard. It'll be in the format of:

https://odh-dashboard-$ODH_NAMESPACE.apps.<your cluster's uri>

You can find it with this command:

oc get route -n opendatahub |grep dash

For example: https://odh-dashboard-odh.apps.jimbig412.cp.fyre.ibm.com/

- If prompted, give it your kubeadmin user and password
- If prompted, grant it access as well

4.1 One the far left, click on "Data Science Projects" and the click on Create a Data Science Project. (This will be a new namespace name)

for example:

Name: demo-dsp
Description: Demo's DSP

Then press "Create"

4.2 Within your new Data Science Project, select "Create workbench"

give it a name, like "demo-wb"
choose "Jupyter Data Science" for the image
click "Create workbench" at the bottom.

4.3 You'll see the status as "Starting" initially.

Once it's in the running status, click on the blue "Open" link in the workbench to get access to the notebook.

4.4 Click on the black "Terminal" under Other section to open up a terminal window.

Inside this terminal, do an "oc login" so that terminal has access to your OpenShift Cluster. For example:

oc login --token=sha256~lamzJ-exoR16UsbltkT-l0nKCL7XTSvLqqB4i54psBM --server=https://api.jimmed414.cp.fyre.ibm.com:6443

4.5 Now you should be able to see the pods on your OpenShift cluster. For example:

oc get pods

Will return the pods in your newly created namespace:

NAME       READY   STATUS    RESTARTS   AGE
demo-wb-0   2/2     Running   0          14m

5. In your Jupyter Notebook image, install the kubeflow-training SDK for KubeFlow Training Operator:

pip install kubeflow-training

Note - if you want the copy from main, then do step 5 and then install by hand

5. Let's try a Pytorch Simple example job:

5.1 In your juptyer Notebook, clone the training-operator repo:

git clone https://github.com/kubeflow/training-operator.git

5.2 If you're installing the sdk by hand, do this

cd /opt/app-root/src/training-operator/sdk/python
pip install -e .

5.2 On the left-hand side, Expand the path to find simple.yaml

--> training-operator --> examples --> pytorch --> simple.yaml

5.3 Open up simple.yaml and change the namespace:

from:
kubeflow
to
demo-dsp

And then save the simple.yaml

Note, if your system is rate-limited pulling images from docker.io, then you can also switch the simple.yaml to use this image"

Change (in 2 places):
docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
to 
quay.io/jbusche/pytorch-mnist:v1beta1-45c5727

5.4 Using the terminal, apply the simple.yaml

cd /opt/app-root/src/training-operator/examples/pytorch
oc apply -f simple.yaml

and then watch it:

watch oc get pods,pytorchjobs -n demo-dsp

and it should look like this when it's all done:

Every 2.0s: oc get pods,pytorchjobs                                                                             demo-wb-0: Thu Feb  1 18:45:46 2024

NAME                          READY   STATUS      RESTARTS   AGE
pod/demo-wb-0                 2/2     Running     0          12m
pod/pytorch-simple-master-0   0/1     Completed   0          6m
pod/pytorch-simple-worker-0   0/1     Completed   0          6m

NAME                                     STATE       AGE
pytorchjob.kubeflow.org/pytorch-simple   Succeeded   6m

5.5 Then you can delete it:

oc delete pytorchjob pytorch-simple

6.0 Run a pytorch CPU job:

Note, for this demo, we had to built an image using the given Dockerfile. I've built the image and have stored it out on Quay.io.

6.1 On the left-hand side, navigate to training-operator --> examples --> pytorch --> cpu-demo

6.2 Double-click on the demo.yaml and change in two places:

image: pytorch-cpu:py3.8
to
image: quay.io/jbusche/pytorch-cpu:py3.8

6.3 Save the demo.yaml and then using the terminal, submit it:

cd /opt/app-root/src/training-operator/examples/pytorch/cpu-demo
oc apply -f demo.yaml

6.4 and then watch it:

watch oc get pods,pytorchjobs -n demo-dsp

and it should look like this when it's all done:

Every 2.0s: oc get pods,pytorchjobs -n demo-dsp                                                                 demo-wb-0: Thu Feb  1 19:07:06 2024

NAME                        READY   STATUS      RESTARTS   AGE
pod/demo-wb-0               2/2     Running     0          34m
pod/torchrun-cpu-master-0   0/1     Completed   0          2m41s
pod/torchrun-cpu-worker-0   0/1     Completed   0          2m41s

NAME                                   STATE       AGE
pytorchjob.kubeflow.org/torchrun-cpu   Succeeded   2m41s

6.5 And then you can delete it when you're done:

oc delete pytorchjob torchrun-cpu

7.0 Let's try a MNIST example: Note, for this demo, we had to built an image using the given Dockerfile. I've built the image and have stored it out on Quay.io.

7.1 On the left-hand side, navigate to training-operator --> examples --> pytorch --> mnist --> v1

7.2 Double-click on the pytorch_job_mnist_gloo.yaml and change in two places:

image: gcr.io/<your_project>/pytorch_dist_mnist:latest
to
image: quay.io/jbusche/pytorch_dist_mnist:latest

and also change the gpu to 0 if you don't have any GPU

nvidia.com/gpu: 1
to
nvidia.com/gpu: 0

7.3 save the file and then using the terminal, submit it:

cd /opt/app-root/src/training-operator/examples/pytorch/mnist/v1
oc apply -f pytorch_job_mnist_gloo.yaml

7.4 and then watch it:

watch oc get pods,pytorchjobs -n demo-dsp

and it should look like this when it's all done (mine took about 9 minutes):

Every 2.0s: oc get pods,pytorchjobs -n demo-dsp                                  demo-wb-0: Thu Feb  1 20:11:02 2024

NAME                                   READY   STATUS      RESTARTS   AGE
pod/demo-wb-0                          2/2     Running     0          98m
pod/pytorch-dist-mnist-gloo-master-0   0/1     Completed   0          8m53s
pod/pytorch-dist-mnist-gloo-worker-0   0/1     Completed   0          8m53s

NAME                                              STATE       AGE
pytorchjob.kubeflow.org/pytorch-dist-mnist-gloo   Succeeded   8m53s

7.5 And then you can delete it when you're done:

oc delete pytorchjob pytorch-dist-mnist-gloo

Getting the training-operator SDK to work from a Jupyter DataScience Notebook

Problems:

Problem 1. The release version of the kubeflow-training SDK doesn't have all the constants needed to run the examples.

Usually you would install the sdk with this command:

pip install kubeflow-training

and you'd get what is currently the 1.7.0 version of the sdk.

But to actually get it to work with the examples, you'd want to build from the main branch:

git clone https://github.com/kubeflow/training-operator.git
cd /opt/app-root/src/training-operator/sdk/python
pip install -e .

Problem 2. We need to add the workbench service account into the training-operator clusterrolebinding so that the pytorchjobs have create permission

oc edit clusterrolebinding training-operator

and at the end, append the service account you're using, for example:

- kind: ServiceAccount
  name: demo-wb
  namespace: demo-dsp

Problem 3. For the sdk to be able to pull the log results, a training-operator clusterrole rule needs to be added under - pods: Do this:

oc edit clusterrole training-operator

and under the - pods resources, add:

  - pods/log

otherwise, we see:

cannot get resource \"pods/log\" in API group \"\" in the namespace \"demo-dsp\"","reason":"Forbidden"

Problem 4. For the sdk to use pvc, appending the following to the training-operator clusterrole fixes that error too: Do this:

oc edit clusterrole training-operator

and under the - pods resources, add:

  - persistentvolumeclaims

otherwise, we see:

forbidden: User \"system:serviceaccount:demo-dsp:demo-wb\" cannot create resource \"persistentvolumeclaims\" in API group \"\" in the namespace \"demo-dsp\"","reason":"Forbidden","details":{"kind":"persistentvolumeclaims"},"code":403}

SDK Demos:

There are four sdk demos: https://github.com/kubeflow/training-operator/tree/master/examples/sdk