Instrumentation and monitoring tool - openucx/ucx GitHub Wiki
Introduction
UCX library provides a tool to analyze UCX-based applications in runtime. The tool creates a representation of each process that uses UCX library in Virtual Filesystem (VFS). The VFS hierarchy of directories shows relations between objects of UCX library. Files grouped in directories describe properties of UCX library object. The file content characterizes a specific property of the object.
How to
The tool is based on Filesystem in Userspace (FUSE) interface. FUSE v3 development package is required to build the tool.
If the tool was successfully built, there will be a binary file in the UCX install directory. Launch a daemon process to enable analysis of UCX-based applications using the following command:
$ <path_to_ucx_install_dir>/bin/ucx_vfs
Each running process, that uses UCX library, has corresponding directory in /tmp/ucx/<PID>.
Stop the daemon, if you don’t want to analyze your applications anymore using the following command:
$ <path_to_ucx_install_dir>/bin/ucx_vfs stop
VFS hierarchy
Directory /tmp/ucx/<PID> represents usage of UCX library by corresponding process. The directory contains three grouping sub-directories: UCP, UCT, UCS. A directory represents a UCX library object, or combines to groups objects of the same type, or properties of an object. A file describes a UCX library object property.
Files in VFS
UCP
context
File name
Description
mem_address
Memory address of the pointer
endpoint
File name
Description
error_mode
Error handling mode
local_address/IPv[4|6] *
Local address: ip
local_address/port *
Local address: port
mem_address
Memory address of the pointer
peer_name
Remote worker address name
remote_address/IPv[4|6] *
Peer address: ip
remote_address/port *
Peer address: port
[local|remote]_address directory is created only for endpoints created in client-server mode.
listener
File name
Description
ip
Listening socket address: IP address
port
Listening socket address: Port number
worker
File name
Description
address_name
Worker address name composed of host name and process id
counters/ep_closures
Number of endpoint closures
counters/ep_creations
Number of requests to create endpoint
counters/ep_creation_failures
Number of failed requests to create endpoint
counters/ep_failures
Number of failed endpoints
keepalive/ep_count
Keepalive: Number of endpoints processed in current time slot
keepalive/round_count
Keepalive: Number of rounds done
mem_address
Memory address of the pointer
num_all_eps
Number of all endpoints (except internal endpoints)
thread_mode
Thread safety mode which worker and the associated resources should be created with
UCS
global_opts
File name
Description
log_level
Log level above which log messages will be printed
memtrack
File name
Description
all
Memory tracking output. Count and size of objects created by the library
rcache
File name
Description
gc_list/length
Number of regions to destroy, regions could not be destroyed from memhook
inv_q/length
Number of regions which were invalidated during memory events
max_regions
Maximum number of regions
max_size
Maximum total size of regions
num_regions
Total number of managed regions
regions_distribution/threshold/count
Number of regions with a size smaller than threshold
regions_distribution/threshold/total_size
Total size of regions with a size smaller than threshold
total_size
Total size of registered memory
UCT
dct
File name
Description
qp_num
Number of queue pairs
dci
File name
Description
available
Number of available queue pairs
unsignaled
Number of unsignaled completion
qp_num
Number of queue pairs
sw_pi
Producer index for next work queue entry
prev_sw_pi
Producer index where last WQE started
qstart
Pointer to the begining of queue
qend
Pointer to the end of queue
bb_max
Maximum building block number
sig_pi
Producer index for last signaled WQE
hw_ci
Consumer index
iface
File name
Description
rx_available
Available credit for rx queue (UD only)
rx_qp_len
Length of qp rx queue (UD only)
tx_available
Available credit for tx queue (UD only)
tx_qp_len
Length of qp tx queue (UD only)
capability/flag
The presence of the file means that the interface supports the feature.
File name
Description
am_bcopy
Buffered active message
am_dup
Active messages may be received with duplicates
am_short
Short active message
am_zcopy
Zero-copy active message
atomic_cpu
Atomic communications are consistent with respect to CPU operations
atomic_device
Atomic communications are consistent only with respect to other atomics on the same device
cb_async
Supports setting a callback which will be invoked within a reasonable amount of time if uct_worker_progress() is not being called
cb_sync
Supports setting a callback which is invoked only from the calling context of uct_worker_progress()
connect_to_ep
Supports connecting to specific endpoint
connect_to_iface
Supports connecting to interface
connect_to_sockaddr
Supports connecting to sockaddr
ep_check
Endpoint check
ep_keepalive
Transport endpoint has built-in keepalive feature
errhandle_am_id
Invalid AM id on remote
errhandle_bcopy_buf
Invalid buffer for buffered operation
errhandle_bcopy_len
Invalid length for buffered operation
errhandle_peer_failure
Remote peer failures/outage
errhandle_remote_mem
Remote memory access
errhandle_short_buf
Invalid buffer for short operation
errhandle_zcopy_buf
Invalid buffer for zero copy operation
get_bcopy
Buffered get
get_short
Short get
get_zcopy
Zero-copy get
pending
Pending operations
put_bcopy
Buffered put
put_short
Short put
put_zcopy
Zero-copy put
tag_eager_bcopy
Hardware tag matching buffered eager support
tag_eager_short
Hardware tag matching short eager support
tag_eager_zcopy
Hardware tag matching zero-copy eager support
tag_rndv_zcopy
Hardware tag matching rendezvous zero-copy support
capability/am
File name
Description
align_mtu
MTU used for alignment
max_bcopy
Total maximum size (including header) for buffered active message
max_hdr
Maximum header size for zero-copy active message
max_iov
Maximum number of elements in iov for zero-copy active message
max_short
Total maximum size (including header) for short active message
max_zcopy
Total maximum size (including header) for zero-copy active message
min_zcopy
Minimum size for zero-copy active message
opt_zcopy_align
Optimal alignment for zero-copy buffer address
capability/get
File name
Description
align_mtu
MTU used for alignment
max_bcopy
Total maximum size (including header) for buffered get
max_iov
Maximum number of elements in iov for zero-copy get
max_short
Total maximum size (including header) for short get
max_zcopy
Total maximum size (including header) for zero-copy get
min_zcopy
Minimum size for zero-copy get
opt_zcopy_align
Optimal alignment for zero-copy buffer address
capability/put
File name
Description
align_mtu
MTU used for alignment
max_bcopy
Total maximum size (including header) for buffered put
max_iov
Maximum number of elements in iov for zero-copy put
max_short
Total maximum size (including header) for short put
max_zcopy
Total maximum size (including header) for zero-copy put
min_zcopy
Minimum size for zero-copy put
opt_zcopy_align
Optimal alignment for zero-copy buffer address
rx_available - Available credit for RX
memory_domain
File name
Description
local_cpus
Mask of CPUs near the resource
reg_cost
Memory registration cost estimation (time, seconds) as a linear function of the buffer size
rkey_packed_size
Size of buffer needed for packed rkey
capability
File name
Description
access_mem_types
Memory types that Memory Domain can access
alloc_mem_types
Bitmap of memory types that Memory Domain can allocate memory on
detect_mem_types
Bitmap of memory types that Memory Domain can detect if address belongs to it
max_alloc
Maximum allocation size
max_reg
Maximum registration size
reg_mem_types
Bitmap of memory types that Memory Domain can be registered with
capability/flag
File name
Description
advise
Memory advice support
alloc
Memory allocation support
fixed
Memory allocation with fixed address support
invalidate
Memory invalidation support
need_memh
The transport needs a valid local memory handle for zero-copy operations
need_rkey
The transport needs a valid remote memory key for remote memory operations
reg
Memory registration support
rkey_ptr
Direct access to remote memory via a pointer that is returned by uct_rkey_ptr
sockaddr
Client-server connection establishment via sockaddr support