Installing ODH on OpenShift and adding LM‐Eval - foundation-model-stack/fms-lm-eval-service GitHub Wiki

Install of ODH on Open Shift Cluster:

Note, this is the method I use to get LM-Eval installed. It's not the only method and I'm not even sure it's the best method, but perhaps you can leverage it to quickly get an environment up and running.

0. Prerequisites

0.1 Need an OpenShift Cluster. (Mine was OpenShift 4.16.11)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Need to be logged into the cluster with oc login. Soemething like this:

oc login --token=sha256~XXXX --server=https://api.jim414fips.cp.fyre.ibm.com:6443

1. Install ODH with Fast Channel

Using your terminal where you're logged in with oc login, issue this command:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/opendatahub-operator.openshift-operators: ""
  name: opendatahub-operator
  namespace: openshift-operators
spec:
  channel: fast
  installPlanApproval: Automatic
  name: opendatahub-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: opendatahub-operator.v2.23.0
EOF

You can check it started with:

watch oc get pods,csv -n openshift-operators

2. Install the DSCI prerequisite Operators

2.1 Install service mesh

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: servicemeshoperator.v2.6.5
EOF

And then check it with:

watch oc get pods,csv -n openshift-operators

2.2 Install the serverless operator

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: serverless-operators
  namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: serverless-operator
  namespace: openshift-serverless
spec:
  channel: stable 
  name: serverless-operator 
  source: redhat-operators 
  sourceNamespace: openshift-marketplace 
EOF

And then check it with:

watch oc get pods -n openshift-serverless

3. Install DSCi

cat << EOF | oc apply -f -
apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
  name: default-dsci
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: dscinitialization
    app.kubernetes.io/part-of: opendatahub-operator
spec:
  applicationsNamespace: opendatahub
  devFlags:
    logmode: production
  monitoring:
    namespace: opendatahub
    managementState: Managed
  serviceMesh:
    auth:
      audiences:
        - 'https://kubernetes.default.svc'
    controlPlane:
      metricsCollection: Istio
      name: data-science-smcp
      namespace: istio-system
    managementState: Managed
  trustedCABundle:
    customCABundle: ''
    managementState: Managed
EOF

And then check it: (It should go into "Ready" state after about a minute or so)

watch oc get dsci

Also note that you'll see the istio control pane start up as well here:

oc get pods -n openshift-operators

4. Install the DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    modelregistry:
      managementState: Removed
    ray:
      managementState: Removed
    trainingoperator:
      managementState: Managed
    trustyai:
      managementState: Managed
    workbenches:
      managementState: Removed
EOF

Check that the pods are running:

watch oc get pods -n opendatahub

You should see these pods:

oc get pods -n opendatahub
NAME                                                            READY   STATUS    RESTARTS   AGE
kserve-controller-manager-5766998974-mjxjc                      1/1     Running   0          21m
kubeflow-training-operator-5dbf85f955-j9cf6                     1/1     Running   0          5h26m
kueue-controller-manager-5449d484c7-phmm6                       1/1     Running   0          5h27m
odh-model-controller-688594d55b-qwfxm                           1/1     Running   0          22m
trustyai-service-operator-controller-manager-5d7f76d9fb-8xc2r   1/1     Running   0          21m

5. Configure your Kueue minimum requirements

Note - skip this for now until we get Kueue going soon... steps to follow later.

cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: default
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

6. Configure LM-Eval

See this Getting started with LM-Eval article for the latest info

Note, for now these steps 6.1-6.3 aren't needed with the latest ODH release

For right now, to get LM-Eval going, we need to swap in a new pod image by doing these steps

6.1 Turn off the ODH operator so the configmap doesn't get overwritten

oc scale deploy opendatahub-operator-controller-manager -n openshift-operators --replicas=0

6.2 Update the configmap to reflect the latest available image:

oc edit cm trustyai-service-operator-config -n opendatahub

and change:

lmes-pod-image: quay.io/trustyai/ta-lmes-job:v1.30.0
to
lmes-pod-image: quay.io/trustyai/ta-lmes-job:v1.31.0

6.3 Then restart your trustai-service-operator by killing the pod. For example:

oc get pods -n opendatahub |grep trustyai

and then

oc delete pod trustyai-service-operator-controller-manager-5f85dfbd95-5cwst -n opendatahub

7. Submit a sample LM-Eval job

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
  namespace: default
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n default

And once it pulls the image and runs for about 5 minutes it should look like this:

oc get pods,lmevaljobs -n default                                                                                                 api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024

NAME                 READY   STATUS    RESTARTS   AGE
pod/evaljob-sample   1/1     Running   0          25s

NAME                                               STATE
lmevaljob.trustyai.opendatahub.io/evaljob-sample   Running

OFFLINE Testing

Essentially I'm following the instructions from here: https://github.com/trustyai-explainability/reference/blob/main/lm-eval/LM-EVAL-NEXT.md#testing-local-mode-offline

0.1 Create a test namespace

oc create namespace test
oc project test

0.2 Create a pvc to contain the offline models/etc.

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml

0.3 check it:

oc get pvc -n test

And it should look something like this:

NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
lmeval-data   Bound    pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae   20Gi       RWO            portworx-watson-assistant-sc   29s

UNITEXT Testing

1.1 Deploy a Pod that will copy the models and datasets to the PVC:

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml

1.2 Check for when it's complete

watch oc get pods -n test

1.3 Delete the lmeval-downloader pod once it's complete:

oc delete pod lmeval-downloader -n test

1.4 Apply the yaml for the ARCEasy

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "lmeval-arceasy-test"
  labels:
    opendatahub.io/dashboard: "true"
    lmevaltests: "vllm"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskNames:
      - "arc_easy"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

It should start up a pod

watch oc get pods -n test

and it'll look like this:

lmeval-arceasy-test         0/1     Completed   0          14m

Testing UNITXT (This takes awhile)

2.1 Using the same pvc as above, apply the unitxt loader

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml

1.2 Check for when it's complete

watch oc get pods -n test

2.3 Delete the lmeval-downloader pod

oc delete pod lmeval-downloader -n test

2.4 Apply the unitxt yaml:

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "lmeval-unitxt-test"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskRecipes:
      - card:
          name: "cards.20_newsgroups_short"
        template: "templates.classification.multi_class.title"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

And the pod should start up:

watch oc get pods -n test

And it'll look like this:

lmeval-unitxt-test         0/1     Completed   0          14m

Cleanup

Cleanup your lmevaljob(s), for example:

oc delete lmevaljob evaljob-sample -n default
oc delete lmevaljob lmeval-arceasy-test lmeval-unitxt-test -n test

Cleanup your unitxt pvc

oc delete pvc lmeval-data -n test

Cleanup of your Kueue resouces, if you want that:

cat <<EOF | kubectl delete -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: default
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

Cleanup of ODH operators (if you want that) Note! The csv versions update on occasion, do oc get csv to get latest version(s).

oc delete sub authorino-operator opendatahub-operator servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv authorino-operator.v0.13.0 opendatahub-operator.v2.23.0 servicemeshoperator.v2.6.5 -n openshift-operators
oc delete csv serverless-operator.v1.35.0 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io  servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io  servicemeshpolicies.authentication.maistra.io  servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io