Installing RHOAI on OpenShift and adding LM‐Eval - foundation-model-stack/fms-lm-eval-service GitHub Wiki
Refer to the Red Hat docs here for more detail: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.12/html-single/installing_and_uninstalling_openshift_ai_self-managed/index
Table of Contents
- Prerequisites
- Install the Red Hat OpenShift AI Operator
- Monitor DSCI
- Install the Red Hat OpenShift AI components via DSC
- Check that everything is running
- TBD - Kueue setup
- Online LM-Eval Job
- Offline testing with unitxt
- Cleanup
0. Prerequisites
0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Also logged into the terminal with oc login: For example:
oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443
Note: If you have a GPU cluster:
0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html
1. Install the Red Hat OpenShift AI Operator
1.1 Create a namespace and OperatorGroup:
cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: redhat-ods-operator
EOF
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: rhods-operator
namespace: redhat-ods-operator
EOF
1.2 Install Servicemesh operator
installPlanApproval: Manual
so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server first.
Note, if you are installing in production, you probably want cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
and make sure it works:
watch oc get pods,csv -n openshift-operators
and it should look something like this:
NAME READY STATUS RESTARTS AGE
istio-operator-6c99f6bf7b-rrh2j 1/1 Running 0 13m
1.3 Install the serverless operator
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: serverless-operators
namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: serverless-operator
namespace: openshift-serverless
spec:
channel: stable
name: serverless-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And then check it with:
watch oc get pods,csv -n openshift-serverless
Note, there might be at least two more pre-reqs for OC 2.19+
Authorinio:
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
creationTimestamp: "2025-04-16T16:24:30Z"
generation: 1
labels:
operators.coreos.com/authorino-operator.openshift-operators: ""
name: authorino-operator
namespace: openshift-operators
resourceVersion: "366895"
uid: d23b99b6-4493-45bd-8b74-9ef2a2e1fd9d
spec:
channel: stable
installPlanApproval: Automatic
name: authorino-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And kiali-ossm
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
creationTimestamp: "2025-04-16T16:29:49Z"
generation: 1
labels:
operators.coreos.com/kiali-ossm.openshift-operators: ""
name: kiali-ossm
namespace: openshift-operators
resourceVersion: "367134"
uid: 1f44a8bf-ca4e-4415-b70c-203c9b4f64bb
spec:
channel: stable
installPlanApproval: Automatic
name: kiali-ossm
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
1.4 Create a subscription (Recommend changing installPlanApproval to Manual in production)
Note: If you want the Stable RHOAI 2.16.x version instead, change the channel below from fast
to stable
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
name: rhods-operator
channel: fast
installPlanApproval: Automatic
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And watch that it starts:
watch oc get pods,csv -n redhat-ods-operator
2. Monitor DSCI
Watch the dsci until it's complete:
watch oc get dsci
and it'll finish up like this:
NAME AGE PHASE CREATED AT
default-dsci 16m Ready 2024-07-02T19:56:18Z
3. Install the Red Hat OpenShift AI components via DSC
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Removed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Managed
defaultDeploymentMode: RawDeployment
serving:
ingressGateway:
certificate:
secretName: knative-serving-cert
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
ray:
managementState: Removed
workbenches:
managementState: Removed
trainingoperator:
managementState: Managed
trustyai:
managementState: Managed
EOF
4. Check that everything is running
4.1 Check that your operators are running:
oc get pods -n redhat-ods-operator
Will return:
NAME READY STATUS RESTARTS AGE
rhods-operator-7c54d9d6b5-j97mv 1/1 Running 0 22h
4.2 Check that the service mesh operator is running:
oc get pods -n openshift-operators
Will return:
NAME READY STATUS RESTARTS AGE
istio-cni-node-v2-5-9qkw7 1/1 Running 0 84s
istio-cni-node-v2-5-dbtz5 1/1 Running 0 84s
istio-cni-node-v2-5-drc9l 1/1 Running 0 84s
istio-cni-node-v2-5-k4x4t 1/1 Running 0 84s
istio-cni-node-v2-5-pbltn 1/1 Running 0 84s
istio-cni-node-v2-5-xbmz5 1/1 Running 0 84s
istio-operator-6c99f6bf7b-4ckdx 1/1 Running 1 (2m39s ago) 2m56s
4.3 Check that the DSC components are running:
watch oc get pods -n redhat-ods-applications
Will return:
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-7784c9878b-4fkv9 1/1 Running 0 51s
kubeflow-training-operator-cb487d469-s78ch 1/1 Running 0 2m11s
kueue-controller-manager-5fb585c7c4-zpdcj 1/1 Running 0 4m21s
odh-model-controller-7b57f4b9d8-ztrgx 1/1 Running 0 5m6s
trustyai-service-operator-controller-manager-5745f74966-2hc2z 1/1 Running 0 2m16
6. Online lm-eval job
See this Getting started with LM-Eval article for the latest info
6.1 Create a test namespace. For all the jobs, you need to run them in a namespace other than default for the moment.
oc create namespace test
oc project test
6.2 For online jobs (that pull the model from hugging face on their own) you need to turn on
lmes-allow-code-execution: "true"
lmes-allow-online: "true"
To do that, edit the configmap again to change the "false" to "true" for each line, then restart the trustyai pod.
kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications --type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'
or
oc edit cm trustyai-service-operator-config -n redhat-ods-applications
NOTE: for RHOAI 2.17.0, 2.18.0, 2.19.0, and 2.20.0: you have to change this setting in the cm so the rhoaid operator doesn't overwrite your cm:
oc edit cm trustyai-service-operator-config -n redhat-ods-applications
and append under metatdata:annotations:
opendatahub.io/managed: "false"
Restart the trustyai operator
kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
6.2. Submit a sample LM-Eval job
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "online-lmeval-glue"
namespace: test
spec:
allowOnline: true
allowCodeExecution: true
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: "cards.wnli"
#template: "templates.classification.multi_class.relation.default"
template:
name: "templates.classification.multi_class.relation.default"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n test
And once it pulls the image and runs for about 5 minutes it should look like this:
oc get pods,lmevaljobs -n test api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024
NAME READY STATUS RESTARTS AGE
pod/online-lmeval-glue 1/1 Running 0 25s
NAME STATE
lmevaljob.trustyai.opendatahub.io/lmeval-glue Running
To clean it up, do this:
oc delete lmevaljob online-lmeval-glue -n test
Another test is "Online-unitxt" you can try:
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "online-unitxt"
namespace: test
spec:
allowOnline: true
model: hf
modelArgs:
- name: pretrained
value: "google/flan-t5-base"
taskList:
taskRecipes:
- card:
name: "cards.20_newsgroups_short"
#template: "templates.classification.multi_class.title"
template:
name: "templates.classification.multi_class.title"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n test
To clean it up, do this:
oc delete lmevaljob online-unitxt -n test
7. Offline Testing with Unitxt
7.0.1 First change the configmap back to false for online execution and restart the trustyai pod
kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications --type merge -p '{"data":{"lmes-allow-online":"false","lmes-allow-code-execution":"false"}}'
Then restart the trustai pod so it uses the updated configmap with either a restart:
kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
or scaling the deployment down and up again:
oc scale deploy trustyai-service-operator-controller-manager -n redhat-ods-applications --replicas=0
and then start it back up again:
oc scale deploy trustyai-service-operator-controller-manager -n redhat-ods-applications --replicas=1
7.0.2 Create a test namespace
oc create namespace test
oc project test
7.0.3 Create a pvc to contain the offline models/etc.
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml
7.0.4 Check the pvc
oc get pvc -n test
And it should look something like this:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
lmeval-data Bound pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae 20Gi RWO portworx-watson-assistant-sc 29s
7.1 UNITEXT Testing
7.1.1 Deploy a Pod that will copy the models and datasets to the PVC:
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml
7.1.2 Check for when it's complete
watch oc get pods -n test
7.1.3 Delete the lmeval-downloader pod once it's complete:
oc delete pod lmeval-downloader -n test
7.1.4 Apply the yaml for the ARCEasy
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "lmeval-arceasy-test"
labels:
opendatahub.io/dashboard: "true"
lmevaltests: "vllm"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskNames:
- "arc_easy"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
It should start up a pod
watch oc get pods -n test
and it'll look like this:
lmeval-arceasy-test 0/1 Completed 0 14m
7.1.5 Even though the pod is done, it doesn't release the pvc, so the next test might fail if the first pod still exists and the loader job uses a different worker node. Therefore you should delete the first offline job once you're sure it's completed.
oc delete lmevaljob lmeval-arceasy-test -n test
7.2 Testing UNITXT (This takes awhile)
7.2.1 Using the same pvc as above, apply the unitxt loader
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml
7.2.2 Check for when it's complete
watch oc get pods -n test
7.2.3 Delete the lmeval-downloader pod
oc delete pod lmeval-downloader -n test
7.2.4 Apply the unitxt yml
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "lmeval-unitxt-test"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskRecipes:
- card:
name: "cards.20_newsgroups_short"
#template: "templates.classification.multi_class.title"
template:
name: "templates.classification.multi_class.title"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
And the pod should start up:
watch oc get pods -n test
And it'll look like this eventually when it's done:
lmeval-unitxt-test 0/1 Completed 0 14m
8. Cleanup
8.1 Cleanup of your lmevaljob(s), for example
oc delete lmevaljob evaljob-sample -n test
oc delete lmevaljob lmeval-arceasy-test lmeval-unitxt-test -n test
8.2 Cleanup of your Kueue resouces, if you want that:
oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2
8.3 Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
8.4 Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
8.5 Cleanup of the Operators (if you want that)
oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv servicemeshoperator.v2.6.5 -n openshift-operators
oc delete csv serverless-operator.v1.35.0 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io servicemeshpolicies.authentication.maistra.io servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv rhods-operator.2.16.1 -n redhat-ods-operator
8.6 Cleanup of the operatorgroup
oc delete OperatorGroup rhods-operator -n redhat-ods-operator