Torque Process Tracking - adaptivecomputing/torque GitHub Wiki

Default

By default, Torque will track all processes that it launches, as well as children processes that share the same session id as processes launched by Torque.

When a job is launched, the master process for the job is a child of the pbs_mom daemon on the mother superior node. If that process forks, the child/children will share a session id with the master process, and these process(es) will automatically be tracked by Torque.

The Task Manager API

If a job uses the Task Manager (TM) API to launch a process, then that process will also be automatically tracked, along with its children. Most, if not all, MPI libraries can be built to interact with Torque. When properly configured to do so, MPI libraries will either launch processes in the job through Torque or inform Torque that it has launched a new process that should be part of the job.

Launching Through Torque

The TM API provides the function tm_spawn(). If this function is invoked by an MPI library or some other program, it will send the executable path or name with all of its arguments and environment to the local pbs_mom, along with instructions for where the process should be launched and some data for identifying and tracking that process. The local mom will then launch the process if it is local, or send the information to a remote mom to launch the process if it should be launched on another host that is part of the job.

The pbsdsh command that comes with Torque uses tm_spawn() to launch processes that will be part of the job. If you are doing simple proof-of-concept work, pbsdsh is a built-in launcher that offers some simple options for launching processes within a Torque job.

Informing Torque of Other Processes

Another option available for making a process part of a job is the tm_adopt() function. Some MPI implementations have their own launching mechanism for starting processes - whether remote or local - and use this instead of the one provided by Torque. To accommodate this behavior, the tm_adopt() function can be used to inform the mom that it should track the process as part of the job. NOTE: the tm_adopt() function must be called on the host where the process has been launched.

The pbs_track() command can be used to launch a process that will be adopted by a specified Torque job, or it can be used to inform the local mom that an existing process should be adopted by a specified Torque job. In either case, the specified Torque job must be currently executing on the local mom.

With Cgroups (cpusets work the same way if you are using those instead)

With cgroups, tracked processes generally work the same way, but instead of following session ids, any process that is part of the job's cgroup is considered part of the job. Generally speaking, processes launched by processes within a cgroup inherit their parent's cgroup, but this part is managed by the operating system. In the case of processes that are launched or adopted by the TM API, the mom daemon will add these processes to the job's cgroup.

If a process is launched in some way that is exterior to Torque and avoids the cgroup, then it will not be restricted by the job's cgroup. The only way to guarantee that jobs are properly restricted is to ensure that process launchers (usually MPI implementations) are properly configured to either launch through Torque or inform Torque of the processes that they launch.