Caetano discussed his progress testing gantry on our staging cluster.
To make further progress, he will need a prometheus instance on staging to ingest job metrics. We are going to help him get this set up.
We also discussed his current method of detecting and recording jobs that were OOMKilled. None of us could think of any better way to achieve this beyond scraping log output.
GitLab CI issues
Peter discussed CI issues he encountered while reviewing PR #46992 , which included a full "rebuild everything" pipeline.
Peter expressed an interest in having a scheduled "rebuild everything" job so any problems in a so we can better identity and address any issues affecting full rebuild pipelines.
We think this is a good idea. The current plan is to schedule a "rebuild everything" pipeline to populate a separate binary mirror, which would allow us to compare it against our weekly snapshot mirrors in the case of regressions.
We also spent a little time using metabase to calculate how expensive a "rebuild everything" pipeline is. For a single, recent "rebuild everything" pipeline, the answer we found was ~$50. This does not include resource costs from UO.
Peter also wanted a way to search GitLab CI job logs for a given error string to help answer the question, "is this the first time this failure has occurred, or is it more widespread?"
We demonstrated how to do this using opensearch. The particular error affecting PR #46992 ("Process terminated due to timeout") wasn't specific to this topic branch.