Condor troubleshooting - PanDAWMS/panda-harvester GitHub Wiki

This page is to collect some Condor knowledge and troubleshooting help

Configuration

There are some example templates for the HTCondor plugin here. In order to properly point the configuration to your Condor resources, you need to:

  • Look in AGIS (replace the name of your PanDA queue in the URL) the name of the condor-ce endpoint.
  • Lookup the schedd name:
$ condor_q -p <ce-endpoint> -g | grep Schedd
-- Schedd: <schedd>...
  • Change the grid_resource line in the sdf template to
grid_resource = condor <schedd> <ce-endpoint>

Troubleshooting

When job/worker submission stops...

Harvester may stop submitting workers for many reasons. But it may also relate to abnormal events on condor schedd.

Many Condor Jobs Held

It is a bad sign when there are many condor job held. For example:

[root@aipanda024 ~]# condor_q -nob

-- Schedd: aipanda024.cern.ch : <137.138.157.183:19696> @ 07/20/18 10:30:07
 ID       OWNER            SUBMITTED     RUN_TIME ST PRI SIZE    CMD
...
 5400.0   atlpan          7/18 14:18   1+20:08:57 R  0       0.0 runpilot3-wrapper.sh -s RRC-KI-T1 -h RRC-KI-T1 -p 25443 -w https://pandaserver.cern.ch -u manag
 6575.0   atlpan          7/18 20:31   0+18:31:09 R  0       0.0 runpilot3-wrapper.sh -s UKI-SCOTGRID-ECDF_MCORE_SL7 -h UKI-SCOTGRID-ECDF_MCORE_SL7 -p 25443 -w 
 8341.0   atlpan          7/19 06:07   0+16:03:32 R  0       0.0 runpilot3-wrapper.sh -s FMPhI-UNIBA_MCORE -h FMPhI-UNIBA-all-prod-CEs_MCORE -p 25443 -w https:/
 9482.0   atlpan          7/19 09:15   1+00:17:21 R  0    1221.0 runpilot3-wrapper.sh -s BNL_PROD -h BNL_PROD-condor -p 25443 -w https://pandaserver.cern.ch -u 
 9662.0   atlpan          7/19 09:18   0+00:00:00 H  0       0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
 9665.0   atlpan          7/19 09:18   0+05:36:20 H  0    2198.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
 9666.0   atlpan          7/19 09:18   0+00:00:00 H  0       0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
 9670.0   atlpan          7/19 09:18   0+00:00:00 H  0       0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
 9671.0   atlpan          7/19 09:18   0+00:00:00 H  0       0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
 9673.0   atlpan          7/19 09:18   0+05:35:16 H  0    2686.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
 9682.0   atlpan          7/19 09:18   0+00:00:00 H  0       0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
...
 9998.0   atlpan          7/19 09:20   0+00:00:00 H  0       0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
10038.0   atlpan          7/19 09:21   1+00:15:13 R  0     733.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
10147.0   atlpan          7/19 09:24   0+05:36:20 H  0    2198.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern

Since held status of condor job is not a final state (can later become idle or running), harvester will thread held condor jobs as submitted workers until it reaches timeout (default 2 hours) and then cancelled those workers.

Thus, too many held jobs can lead to no new worker submission when the limit of nQueueLimitWorkers reaches.

One can check condor HoldReason of the condor job for more detail. E.g.

[root@aipanda024 ~]# condor_q -l -constraint " JobStatus == 5" | egrep 'GridResource |HoldReason '
GridResource = "condor ce516.cern.ch ce516.cern.ch:9619"
HoldReason = "Error connecting to schedd ce516.cern.ch: SECMAN:2007:Failed to received post-auth 
ClassAd|AUTHENTICATE:1004:Failed to authenticate using FS"
GridResource = "condor ce507.cern.ch ce507.cern.ch:9619"
HoldReason = "CE job in status 1 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route in JOB_ROUTER_ENTRIES or route job limit."
...

Handle the issues indicated in HoldReason.

Problems of Communication with CE or remote schedd

When condor jobs submitted with grid universe, communication error to remote CE or schedd may cause the condor job held.

One can check the condor GridmanagerLog for details: say, /var/log/condor/GridmanagerLog.atlpan

⚠️ **GitHub.com Fallback** ⚠️