Instrumentation and monitoring tool - openucx/ucx GitHub Wiki

Introduction

UCX library provides a tool to analyze UCX-based applications in runtime. The tool creates a representation of each process that uses UCX library in Virtual Filesystem (VFS). The VFS hierarchy of directories shows relations between objects of UCX library. Files grouped in directories describe properties of UCX library object. The file content characterizes a specific property of the object.

How to

The tool is based on Filesystem in Userspace (FUSE) interface. FUSE v3 development package is required to build the tool. If the tool was successfully built, there will be a binary file in the UCX install directory. Launch a daemon process to enable analysis of UCX-based applications using the following command:

$ <path_to_ucx_install_dir>/bin/ucx_vfs

Each running process, that uses UCX library, has corresponding directory in /tmp/ucx/<PID>. Stop the daemon, if you don’t want to analyze your applications anymore using the following command:

$ <path_to_ucx_install_dir>/bin/ucx_vfs stop

VFS hierarchy

Directory /tmp/ucx/<PID> represents usage of UCX library by corresponding process. The directory contains three grouping sub-directories: UCP, UCT, UCS. A directory represents a UCX library object, or combines to groups objects of the same type, or properties of an object. A file describes a UCX library object property.

Files in VFS

UCP

context

File name Description
mem_address Memory address of the pointer

endpoint

File name Description
error_mode Error handling mode
local_address/IPv[4|6] * Local address: ip
local_address/port * Local address: port
mem_address Memory address of the pointer
peer_name Remote worker address name
remote_address/IPv[4|6] * Peer address: ip
remote_address/port * Peer address: port
  • [local|remote]_address directory is created only for endpoints created in client-server mode.

listener

File name Description
ip Listening socket address: IP address
port Listening socket address: Port number

worker

File name Description
address_name Worker address name composed of host name and process id
counters/ep_closures Number of endpoint closures
counters/ep_creations Number of requests to create endpoint
counters/ep_creation_failures Number of failed requests to create endpoint
counters/ep_failures Number of failed endpoints
keepalive/ep_count Keepalive: Number of endpoints processed in current time slot
keepalive/round_count Keepalive: Number of rounds done
mem_address Memory address of the pointer
num_all_eps Number of all endpoints (except internal endpoints)
thread_mode Thread safety mode which worker and the associated resources should be created with

UCS

global_opts

File name Description
log_level Log level above which log messages will be printed

memtrack

File name Description
all Memory tracking output. Count and size of objects created by the library

rcache

File name Description
gc_list/length Number of regions to destroy, regions could not be destroyed from memhook
inv_q/length Number of regions which were invalidated during memory events
max_regions Maximum number of regions
max_size Maximum total size of regions
num_regions Total number of managed regions
regions_distribution/threshold/count Number of regions with a size smaller than threshold
regions_distribution/threshold/total_size Total size of regions with a size smaller than threshold
total_size Total size of registered memory

UCT

dct

File name Description
qp_num Number of queue pairs

dci

File name Description
available Number of available queue pairs
unsignaled Number of unsignaled completion
qp_num Number of queue pairs
sw_pi Producer index for next work queue entry
prev_sw_pi Producer index where last WQE started
qstart Pointer to the begining of queue
qend Pointer to the end of queue
bb_max Maximum building block number
sig_pi Producer index for last signaled WQE
hw_ci Consumer index

iface

File name Description
rx_available Available credit for rx queue (UD only)
rx_qp_len Length of qp rx queue (UD only)
tx_available Available credit for tx queue (UD only)
tx_qp_len Length of qp tx queue (UD only)

capability/flag

The presence of the file means that the interface supports the feature.

File name Description
am_bcopy Buffered active message
am_dup Active messages may be received with duplicates
am_short Short active message
am_zcopy Zero-copy active message
atomic_cpu Atomic communications are consistent with respect to CPU operations
atomic_device Atomic communications are consistent only with respect to other atomics on the same device
cb_async Supports setting a callback which will be invoked within a reasonable amount of time if uct_worker_progress() is not being called
cb_sync Supports setting a callback which is invoked only from the calling context of uct_worker_progress()
connect_to_ep Supports connecting to specific endpoint
connect_to_iface Supports connecting to interface
connect_to_sockaddr Supports connecting to sockaddr
ep_check Endpoint check
ep_keepalive Transport endpoint has built-in keepalive feature
errhandle_am_id Invalid AM id on remote
errhandle_bcopy_buf Invalid buffer for buffered operation
errhandle_bcopy_len Invalid length for buffered operation
errhandle_peer_failure Remote peer failures/outage
errhandle_remote_mem Remote memory access
errhandle_short_buf Invalid buffer for short operation
errhandle_zcopy_buf Invalid buffer for zero copy operation
get_bcopy Buffered get
get_short Short get
get_zcopy Zero-copy get
pending Pending operations
put_bcopy Buffered put
put_short Short put
put_zcopy Zero-copy put
tag_eager_bcopy Hardware tag matching buffered eager support
tag_eager_short Hardware tag matching short eager support
tag_eager_zcopy Hardware tag matching zero-copy eager support
tag_rndv_zcopy Hardware tag matching rendezvous zero-copy support

capability/am

File name Description
align_mtu MTU used for alignment
max_bcopy Total maximum size (including header) for buffered active message
max_hdr Maximum header size for zero-copy active message
max_iov Maximum number of elements in iov for zero-copy active message
max_short Total maximum size (including header) for short active message
max_zcopy Total maximum size (including header) for zero-copy active message
min_zcopy Minimum size for zero-copy active message
opt_zcopy_align Optimal alignment for zero-copy buffer address

capability/get

File name Description
align_mtu MTU used for alignment
max_bcopy Total maximum size (including header) for buffered get
max_iov Maximum number of elements in iov for zero-copy get
max_short Total maximum size (including header) for short get
max_zcopy Total maximum size (including header) for zero-copy get
min_zcopy Minimum size for zero-copy get
opt_zcopy_align Optimal alignment for zero-copy buffer address

capability/put

File name Description
align_mtu MTU used for alignment
max_bcopy Total maximum size (including header) for buffered put
max_iov Maximum number of elements in iov for zero-copy put
max_short Total maximum size (including header) for short put
max_zcopy Total maximum size (including header) for zero-copy put
min_zcopy Minimum size for zero-copy put
opt_zcopy_align Optimal alignment for zero-copy buffer address

rx_available - Available credit for RX

memory_domain

File name Description
local_cpus Mask of CPUs near the resource
reg_cost Memory registration cost estimation (time, seconds) as a linear function of the buffer size
rkey_packed_size Size of buffer needed for packed rkey

capability

File name Description
access_mem_types Memory types that Memory Domain can access
alloc_mem_types Bitmap of memory types that Memory Domain can allocate memory on
detect_mem_types Bitmap of memory types that Memory Domain can detect if address belongs to it
max_alloc Maximum allocation size
max_reg Maximum registration size
reg_mem_types Bitmap of memory types that Memory Domain can be registered with

capability/flag

File name Description
advise Memory advice support
alloc Memory allocation support
fixed Memory allocation with fixed address support
invalidate Memory invalidation support
need_memh The transport needs a valid local memory handle for zero-copy operations
need_rkey The transport needs a valid remote memory key for remote memory operations
reg Memory registration support
rkey_ptr Direct access to remote memory via a pointer that is returned by uct_rkey_ptr
sockaddr Client-server connection establishment via sockaddr support
⚠️ **GitHub.com Fallback** ⚠️