Commissioning on the grid - PanDAWMS/panda-harvester GitHub Wiki

This page describes commissioning efforts on the grid with HTCondor plugins which use Condor-C to feed pilots to HTCondor, Cream and ARC CEs.

Running a few thousands of workers on a single Panda queue (PQ)

Goal

To see if a harvester instance is degraded when it runs a few thousands of workers at a PQ

Result

Done at CERN public resources with pull (Jan 2018). Several bug fixes but no performance issue was found. The number of running jobs at CERN-PROD-preprod was well hovered around 2k.


Testing unified pilot streaming (UPS) with a small number of workers

Goal

To see if UPS properly works.

Preparation

UPS was implemented. A unified PQ was defined at CERN public resources.

Results

Done at CERN-PROD-DEV_UCORE (Feb 2018). Multiple flavors of pilots were submitted to the PQ and they shared underlying CPU resources. The number of single core jobs was well limited. More details in the presentation


Running a harvester with hundreds of PQs

Goals

To see if there is a performance issue in the harvester instance and to check if pilots are automatically submitted with proper attributes to each PQ without any manual intervention.

Preparation

A harvester instance will be configured to submit pilots to hundreds of PQs where APF are submitting pilots. A small number of pilots for each PQ. Pilots from harvester and APF will go to the same PQs. SchedulerID=harvester-cern_cloud is set to harvester pilots, so that they are distinguished from APF pilots and the summary is available in a pandamon page. A couple of CondorCE and CreamCE PQs in Aisa, EU and US should be tried before going through all PQs. The same pilot wrapper is used both for all PQs, i.e., no special wrapper for US.

Results

Completed in the middle of Mar 2018. All types of CEs work fine except GT5 CEs which were retiring and thus were ignored. Several changes were added to harvester to decrease CPU consumption.


Switching a big site to Harvester from APF

Goal

To establish migration procedures to Harvester from APF. To keep enough number of running jobs at the site. To check performance of harvester instance such as CPU and memory usage, stability and robustness of the service.

Preparation

To setup one more harvester node to avoid single point of failure.

Monitoring

Job monitoring, Harvester monitoring, and Node monitoring.

Results

Started at BNL on 20th Mar 2018. One harvester instance feeding pilots+jobs to BNL_PROD with PUSH to avoid empty pilots, while 4 APFs feeding pilots with PULL to the same PQ. It was confirmed on 27th Mar that harvester was running 800 jobs while APFs are running 900 jobs. Some jobs failed with Condor HoldReason: CE job in status 1 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route or entry in JOB_ROUTER_ENTRIES. ; Condor RemoveReason: via condor_rm (by user atlpan) since they were HELD for 6 hours and then got killed, but this kind of failure will disappear once the PQ is changed to use PULL. After that, it was decided to have a separate PQ, sharing the same slots allocation with BNL_PROD, which is served only by harvester. TBD: UPS or PULL for the PQ to avoid empty pilots.


Testing UPS at a large site

Goals

To confirm UPS works properly and to see if harvester can automatically discover a proper queue limit.

Preparation

A new UCORE PQ will be created somewhere, where gradually harvester pilot submission rate will be increased. Eventually, existing PQs will be set offline to let UPS manage all CPU resources. The underlying batch system needs to be reconfigured if necessary.

Results

To be done


Migration of all PULL PQs to Harvester

Goals

Full migration to harvester for PULL PQs which are served by APF now.

Results

To be done.