Installing RHOAI on OpenShift and adding LM‐Eval - foundation-model-stack/fms-lm-eval-service GitHub Wiki

Refer to the Red Hat docs here for more detail: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.12/html-single/installing_and_uninstalling_openshift_ai_self-managed/index

Table of Contents

0. Prerequisites

0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Also logged into the terminal with oc login: For example:

oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443

Note: If you have a GPU cluster:

0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html

1. Install the Red Hat OpenShift AI Operator

1.1 Create a namespace and OperatorGroup:

cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: redhat-ods-operator 
EOF
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
EOF

1.2 Install Servicemesh operator

Note, if you are installing in production, you probably want installPlanApproval: Manual so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server first.

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

and make sure it works:

watch oc get pods,csv -n openshift-operators

and it should look something like this:

NAME                              READY   STATUS    RESTARTS   AGE
istio-operator-6c99f6bf7b-rrh2j   1/1     Running   0          13m

1.3 Install the serverless operator

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: serverless-operators
  namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: serverless-operator
  namespace: openshift-serverless
spec:
  channel: stable 
  name: serverless-operator 
  source: redhat-operators 
  sourceNamespace: openshift-marketplace 
EOF

And then check it with:

watch oc get pods,csv -n openshift-serverless

Note, there might be at least two more pre-reqs for OC 2.19+

Authorinio:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2025-04-16T16:24:30Z"
  generation: 1
  labels:
    operators.coreos.com/authorino-operator.openshift-operators: ""
  name: authorino-operator
  namespace: openshift-operators
  resourceVersion: "366895"
  uid: d23b99b6-4493-45bd-8b74-9ef2a2e1fd9d
spec:
  channel: stable
  installPlanApproval: Automatic
  name: authorino-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

And kiali-ossm

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2025-04-16T16:29:49Z"
  generation: 1
  labels:
    operators.coreos.com/kiali-ossm.openshift-operators: ""
  name: kiali-ossm
  namespace: openshift-operators
  resourceVersion: "367134"
  uid: 1f44a8bf-ca4e-4415-b70c-203c9b4f64bb
spec:
  channel: stable
  installPlanApproval: Automatic
  name: kiali-ossm
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

1.4 Create a subscription (Recommend changing installPlanApproval to Manual in production)

Note: If you want the Stable RHOAI 2.16.x version instead, change the channel below from fast to stable

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator 
spec:
  name: rhods-operator
  channel: fast
  installPlanApproval: Automatic 
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

And watch that it starts:

watch oc get pods,csv -n redhat-ods-operator

2. Monitor DSCI

Watch the dsci until it's complete:

watch oc get dsci

and it'll finish up like this:

NAME           AGE   PHASE   CREATED AT
default-dsci   16m   Ready   2024-07-02T19:56:18Z

3. Install the Red Hat OpenShift AI components via DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      defaultDeploymentMode: RawDeployment
      serving:
        ingressGateway:
          certificate:
            secretName: knative-serving-cert
            type: SelfSigned
        managementState: Managed
        name: knative-serving 
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    ray:
      managementState: Removed
    workbenches:
      managementState: Removed
    trainingoperator:
      managementState: Managed
    trustyai:
      managementState: Managed
EOF

4. Check that everything is running

4.1 Check that your operators are running:

oc get pods -n redhat-ods-operator

Will return:

NAME                              READY   STATUS    RESTARTS   AGE
rhods-operator-7c54d9d6b5-j97mv   1/1     Running   0          22h

4.2 Check that the service mesh operator is running:

oc get pods -n openshift-operators 

Will return:

NAME                              READY   STATUS    RESTARTS        AGE
istio-cni-node-v2-5-9qkw7         1/1     Running   0               84s
istio-cni-node-v2-5-dbtz5         1/1     Running   0               84s
istio-cni-node-v2-5-drc9l         1/1     Running   0               84s
istio-cni-node-v2-5-k4x4t         1/1     Running   0               84s
istio-cni-node-v2-5-pbltn         1/1     Running   0               84s
istio-cni-node-v2-5-xbmz5         1/1     Running   0               84s
istio-operator-6c99f6bf7b-4ckdx   1/1     Running   1 (2m39s ago)   2m56s

4.3 Check that the DSC components are running:

watch oc get pods -n redhat-ods-applications

Will return:

NAME                                                            READY   STATUS    RESTARTS   AGE
kserve-controller-manager-7784c9878b-4fkv9                      1/1     Running   0          51s
kubeflow-training-operator-cb487d469-s78ch                      1/1     Running   0          2m11s
kueue-controller-manager-5fb585c7c4-zpdcj                       1/1     Running   0          4m21s
odh-model-controller-7b57f4b9d8-ztrgx                           1/1     Running   0          5m6s
trustyai-service-operator-controller-manager-5745f74966-2hc2z   1/1     Running   0          2m16

6. Online lm-eval job

See this Getting started with LM-Eval article for the latest info

6.1 Create a test namespace. For all the jobs, you need to run them in a namespace other than default for the moment.

oc create namespace test
oc project test

6.2 For online jobs (that pull the model from hugging face on their own) you need to turn on

  lmes-allow-code-execution: "true"
  lmes-allow-online: "true"

To do that, edit the configmap again to change the "false" to "true" for each line, then restart the trustyai pod.

kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications --type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'

or

oc edit cm trustyai-service-operator-config -n redhat-ods-applications

NOTE: for RHOAI 2.17.0, 2.18.0, 2.19.0, and 2.20.0: you have to change this setting in the cm so the rhoaid operator doesn't overwrite your cm:

oc edit cm trustyai-service-operator-config -n redhat-ods-applications

and append under metatdata:annotations:

opendatahub.io/managed: "false"

Restart the trustyai operator

kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications

6.2. Submit a sample LM-Eval job

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "online-lmeval-glue"
  namespace: test
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      #template: "templates.classification.multi_class.relation.default"
      template:
        name: "templates.classification.multi_class.relation.default"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n test

And once it pulls the image and runs for about 5 minutes it should look like this:

oc get pods,lmevaljobs -n test                                                                                                 api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024

NAME                 READY   STATUS    RESTARTS   AGE
pod/online-lmeval-glue   1/1     Running   0          25s

NAME                                               STATE
lmevaljob.trustyai.opendatahub.io/lmeval-glue   Running

To clean it up, do this:

oc delete lmevaljob online-lmeval-glue -n test

Another test is "Online-unitxt" you can try:

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "online-unitxt"
  namespace: test
spec:
  allowOnline: true
  model: hf
  modelArgs:
    - name: pretrained
      value: "google/flan-t5-base"
  taskList:
    taskRecipes:
      - card:
          name: "cards.20_newsgroups_short"
        #template: "templates.classification.multi_class.title"
        template:
          name: "templates.classification.multi_class.title"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n test

To clean it up, do this:

oc delete lmevaljob online-unitxt -n test

7. Offline Testing with Unitxt

7.0.1 First change the configmap back to false for online execution and restart the trustyai pod

kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications --type merge -p '{"data":{"lmes-allow-online":"false","lmes-allow-code-execution":"false"}}'

Then restart the trustai pod so it uses the updated configmap with either a restart:

kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications

or scaling the deployment down and up again:

oc scale deploy trustyai-service-operator-controller-manager -n redhat-ods-applications --replicas=0

and then start it back up again:

oc scale deploy trustyai-service-operator-controller-manager -n redhat-ods-applications --replicas=1

7.0.2 Create a test namespace

oc create namespace test
oc project test

7.0.3 Create a pvc to contain the offline models/etc.

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml

7.0.4 Check the pvc

oc get pvc -n test

And it should look something like this:

NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
lmeval-data   Bound    pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae   20Gi       RWO            portworx-watson-assistant-sc   29s

7.1 UNITEXT Testing

7.1.1 Deploy a Pod that will copy the models and datasets to the PVC:

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml

7.1.2 Check for when it's complete

watch oc get pods -n test

7.1.3 Delete the lmeval-downloader pod once it's complete:

oc delete pod lmeval-downloader -n test

7.1.4 Apply the yaml for the ARCEasy

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "lmeval-arceasy-test"
  labels:
    opendatahub.io/dashboard: "true"
    lmevaltests: "vllm"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskNames:
      - "arc_easy"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

It should start up a pod

watch oc get pods -n test

and it'll look like this:

lmeval-arceasy-test         0/1     Completed   0          14m

7.1.5 Even though the pod is done, it doesn't release the pvc, so the next test might fail if the first pod still exists and the loader job uses a different worker node. Therefore you should delete the first offline job once you're sure it's completed.

oc delete lmevaljob lmeval-arceasy-test -n test

7.2 Testing UNITXT (This takes awhile)

7.2.1 Using the same pvc as above, apply the unitxt loader

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml

7.2.2 Check for when it's complete

watch oc get pods -n test

7.2.3 Delete the lmeval-downloader pod

oc delete pod lmeval-downloader -n test

7.2.4 Apply the unitxt yml

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "lmeval-unitxt-test"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskRecipes:
      - card:
          name: "cards.20_newsgroups_short"
        #template: "templates.classification.multi_class.title"
        template:
          name: "templates.classification.multi_class.title"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

And the pod should start up:

watch oc get pods -n test

And it'll look like this eventually when it's done:

lmeval-unitxt-test         0/1     Completed   0          14m

8. Cleanup

8.1 Cleanup of your lmevaljob(s), for example

oc delete lmevaljob evaljob-sample -n test
oc delete lmevaljob lmeval-arceasy-test lmeval-unitxt-test -n test

8.2 Cleanup of your Kueue resouces, if you want that:

oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2

8.3 Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

8.4 Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

8.5 Cleanup of the Operators (if you want that)

oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv servicemeshoperator.v2.6.5 -n openshift-operators
oc delete csv serverless-operator.v1.35.0 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io  servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io  servicemeshpolicies.authentication.maistra.io  servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv rhods-operator.2.16.1 -n redhat-ods-operator

8.6 Cleanup of the operatorgroup

oc delete OperatorGroup rhods-operator -n redhat-ods-operator