GSoC 2023 Term: Optimize Eclipse Adoptium Pipelines - adoptium/adoptium GitHub Wiki

GSoC 2023 Term: Optimize Eclipse Adoptium Pipelines (Techniques for Reducing Test Node Usage)

Eclipse Adoptium has a large inventory of infrastructure to support builds and tests. We gather metrics on machine utilization and store that in an application written as part of the Eclipse AQAvit project (TRSS/Test Results Summary Service, a live instance found here: https://trss.adoptium.net/, source code within: https://github.com/adoptium/aqa-test-tools).

We want to extend the information that we gather and put in the TRSS database (to include queue times and machine idleness), and assess various approaches to optimize our use of the limited nodes we have, including the use of dynamic agents, trialing different parallelism strategies, and different scheduling tactics to minimize idle times). This project will have aspects of data gathering, analysis and hands-on pipeline changes to trial and measure new variations of node usage.

Expected outcomes

We expect that this project will provide us with metrics to be able to refine our build and test pipelines to more optimally use our limited machine resources. An assessment of current CI/CD best practices and comparative analysis with several variations should be one of the concrete outcomes of this work. Outcomes from this project will also be valuable for sharing with other projects. Another outcome will be a dynamic map of our infrastructure network, or other visualizations of the data (represented in the TRSS application). Ideally the findings will allow us to reduce our infrastructure costs and maintenance requirements.

Skills required/preferred Successful candidates should have a good understanding of the JavaScript programming language and be prepared to gather and analyze large amounts of data and present regularly on findings. Skills in data visualization are also helpful.

Project size - proposed to be 350 hours

Inspect, measure and assess current machine usage, and trial various different 'best practice' approaches to compare utilization KPIs to see if there are better ways to utilize our resources.

Define a set of KPIs to use for comparison during trials
- queue time, overall time to complete test pipelines (total execution time), machine utilization, etc
- https://ci.adoptium.net/load-statistics
Establish a 'baseline' of pipeline information to compare against (average values over XX amount of time of the KPIs)
Enhance and utilize certain tools we have at the project for measurement
Modify and run variants of pipelines, measure and compare against the current approach, modifications should include trialing some features that are already supported by TKG (like dynamic parallelization and dynamic compilation).
Communicate findings via a short, and preferably visual, presentation.

Resources

Example Types of test automation optimizations to vary in trials (from browserstack)

Target acceptable test execution time (see dynamic parallelization feature)
Configure CI setup for efficiency (scheduling and test frequency)
Keep tests short (assess what are the largest test targets and figure out if they can be decomposed into smaller bits, some work has already been done in this area, particularly for jck test targets)
Execute tests in parallel (we have several ways of running in parallel which we should compare)
Set up alerting, monitoring and bug tracking (we have several ways of doing this, to contrast and compare)
Keep test hygiene in check (we have established naming conventions, keep SHAs of each run, etc, are we optimizing the use of these hygienic practices as much as we could?)

Pulling metrics from TRSS: https://github.com/smlambert/aqastats

Latest work in machine monitoring: https://github.com/adoptium/infrastructure/wiki/Outreachy:-Open-Infrastructure-Monitoring-Configuration:-Project-MonCon

Example of CI/CD best practices: https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/

Dynamic Parallelism: https://github.com/adoptium/TKG/pull/67

Dynamic Compile: https://github.com/adoptium/TKG/pull/238

Related Issues

Assess test target execution time & define test schedule: https://github.com/adoptium/aqa-tests/issues/2037

Change-based testing EPIC: https://github.com/adoptium/aqa-tests/issues/2186

Gather queue times in TRSS: https://github.com/adoptium/aqa-test-tools/issues/777

For questions and discussion, please join the #gsoc channel in our Slack workspace (https://adoptium.net/slack/).

Mentors: Shelley Lambert, Scott Fryer, Sophia Guo

Additional consultant mentors: Lan Xia, Renfei Wang