Installing ODH on OpenShift and adding LM‐Eval - foundation-model-stack/fms-lm-eval-service GitHub Wiki
Install of ODH on Open Shift Cluster:
Note, this is the method I use to get LM-Eval installed. It's not the only method and I'm not even sure it's the best method, but perhaps you can leverage it to quickly get an environment up and running.
Table of Contents
- Prerequisites
- Install ODH with Fast Channel
- Install the DSCI prerequisite Operators
- Install DSCi
- Install the DSC
- Configure your Kueue minimum requirements
- Configure LM-Eval
- Submit a sample LM-Eval job
- OFFLINE Testing
- Cleanup
0. Prerequisites
0.1 Need an OpenShift Cluster. (Mine was OpenShift 4.16.11)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Need to be logged into the cluster with oc login. Soemething like this:
oc login --token=sha256~XXXX --server=https://api.jim414fips.cp.fyre.ibm.com:6443
1. Install ODH with Fast Channel
Using your terminal where you're logged in with oc login, issue this command:
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/opendatahub-operator.openshift-operators: ""
name: opendatahub-operator
namespace: openshift-operators
spec:
channel: fast
installPlanApproval: Automatic
name: opendatahub-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: opendatahub-operator.v2.23.0
EOF
You can check it started with:
watch oc get pods,csv -n openshift-operators
2. Install the DSCI prerequisite Operators
2.1 Install service mesh
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: servicemeshoperator.v2.6.5
EOF
And then check it with:
watch oc get pods,csv -n openshift-operators
2.2 Install the serverless operator
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: serverless-operators
namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: serverless-operator
namespace: openshift-serverless
spec:
channel: stable
name: serverless-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And then check it with:
watch oc get pods -n openshift-serverless
3. Install DSCi
cat << EOF | oc apply -f -
apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
name: default-dsci
labels:
app.kubernetes.io/created-by: opendatahub-operator
app.kubernetes.io/instance: default
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: dscinitialization
app.kubernetes.io/part-of: opendatahub-operator
spec:
applicationsNamespace: opendatahub
devFlags:
logmode: production
monitoring:
namespace: opendatahub
managementState: Managed
serviceMesh:
auth:
audiences:
- 'https://kubernetes.default.svc'
controlPlane:
metricsCollection: Istio
name: data-science-smcp
namespace: istio-system
managementState: Managed
trustedCABundle:
customCABundle: ''
managementState: Managed
EOF
And then check it: (It should go into "Ready" state after about a minute or so)
watch oc get dsci
Also note that you'll see the istio control pane start up as well here:
oc get pods -n openshift-operators
4. Install the DSC
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Removed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Managed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
modelregistry:
managementState: Removed
ray:
managementState: Removed
trainingoperator:
managementState: Managed
trustyai:
managementState: Managed
workbenches:
managementState: Removed
EOF
Check that the pods are running:
watch oc get pods -n opendatahub
You should see these pods:
oc get pods -n opendatahub
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-5766998974-mjxjc 1/1 Running 0 21m
kubeflow-training-operator-5dbf85f955-j9cf6 1/1 Running 0 5h26m
kueue-controller-manager-5449d484c7-phmm6 1/1 Running 0 5h27m
odh-model-controller-688594d55b-qwfxm 1/1 Running 0 22m
trustyai-service-operator-controller-manager-5d7f76d9fb-8xc2r 1/1 Running 0 21m
5. Configure your Kueue minimum requirements
Note - skip this for now until we get Kueue going soon... steps to follow later.
cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
- name: "nvidia.com/gpu"
nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: default
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
6. Configure LM-Eval
See this Getting started with LM-Eval article for the latest info
Note, for now these steps 6.1-6.3 aren't needed with the latest ODH release
For right now, to get LM-Eval going, we need to swap in a new pod image by doing these steps
6.1 Turn off the ODH operator so the configmap doesn't get overwritten
oc scale deploy opendatahub-operator-controller-manager -n openshift-operators --replicas=0
6.2 Update the configmap to reflect the latest available image:
oc edit cm trustyai-service-operator-config -n opendatahub
and change:
lmes-pod-image: quay.io/trustyai/ta-lmes-job:v1.30.0
to
lmes-pod-image: quay.io/trustyai/ta-lmes-job:v1.31.0
6.3 Then restart your trustai-service-operator by killing the pod. For example:
oc get pods -n opendatahub |grep trustyai
and then
oc delete pod trustyai-service-operator-controller-manager-5f85dfbd95-5cwst -n opendatahub
7. Submit a sample LM-Eval job
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
namespace: default
spec:
allowOnline: true
allowCodeExecution: true
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: "cards.wnli"
template: "templates.classification.multi_class.relation.default"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n default
And once it pulls the image and runs for about 5 minutes it should look like this:
oc get pods,lmevaljobs -n default api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024
NAME READY STATUS RESTARTS AGE
pod/evaljob-sample 1/1 Running 0 25s
NAME STATE
lmevaljob.trustyai.opendatahub.io/evaljob-sample Running
OFFLINE Testing
Essentially I'm following the instructions from here: https://github.com/trustyai-explainability/reference/blob/main/lm-eval/LM-EVAL-NEXT.md#testing-local-mode-offline
0.1 Create a test namespace
oc create namespace test
oc project test
0.2 Create a pvc to contain the offline models/etc.
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml
0.3 check it:
oc get pvc -n test
And it should look something like this:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
lmeval-data Bound pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae 20Gi RWO portworx-watson-assistant-sc 29s
- UNITEXT Testing
1.1 Deploy a Pod that will copy the models and datasets to the PVC:
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml
1.2 Check for when it's complete
watch oc get pods -n test
1.3 Delete the lmeval-downloader pod once it's complete:
oc delete pod lmeval-downloader -n test
1.4 Apply the yaml for the ARCEasy
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "lmeval-arceasy-test"
labels:
opendatahub.io/dashboard: "true"
lmevaltests: "vllm"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskNames:
- "arc_easy"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
It should start up a pod
watch oc get pods -n test
and it'll look like this:
lmeval-arceasy-test 0/1 Completed 0 14m
- Testing UNITXT (This takes awhile)
2.1 Using the same pvc as above, apply the unitxt loader
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml
1.2 Check for when it's complete
watch oc get pods -n test
2.3 Delete the lmeval-downloader pod
oc delete pod lmeval-downloader -n test
2.4 Apply the unitxt yaml:
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "lmeval-unitxt-test"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskRecipes:
- card:
name: "cards.20_newsgroups_short"
template: "templates.classification.multi_class.title"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
And the pod should start up:
watch oc get pods -n test
And it'll look like this:
lmeval-unitxt-test 0/1 Completed 0 14m
Cleanup
Cleanup your lmevaljob(s), for example:
oc delete lmevaljob evaljob-sample -n default
oc delete lmevaljob lmeval-arceasy-test lmeval-unitxt-test -n test
Cleanup your unitxt pvc
oc delete pvc lmeval-data -n test
Cleanup of your Kueue resouces, if you want that:
cat <<EOF | kubectl delete -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: default
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
Cleanup of ODH operators (if you want that) Note! The csv versions update on occasion, do oc get csv
to get latest version(s).
oc delete sub authorino-operator opendatahub-operator servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv authorino-operator.v0.13.0 opendatahub-operator.v2.23.0 servicemeshoperator.v2.6.5 -n openshift-operators
oc delete csv serverless-operator.v1.35.0 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io servicemeshpolicies.authentication.maistra.io servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io