building_running_nec - noma/ham GitHub Wiki

Building and Running: NEC SX-Aurora TSUBASA (Vector Engine)

HAM-Offload has initial support for the NEC SX-Aurora TSUBASA platform since v0.3.

The steps in this document were tested with VEOS 2.1.3 and NEC C++ Compiler version 2.3.1.

Download

git clone https://github.com/noma/ham
cd ham

Building

We need to create two builds:

one for the Vector Host (VH), i.e. our CPU, using the system's C++ compiler (e.g. GCC)
one for the Vector Engine (VE), i.e. the accelerator card, using the NEC C++ compiler

Applications consist of the VH executable and a VE library, both generated from the same source, containing the same code. The VE library can be either statically (recommended) or dynamically linked. The provided CMake files use static linking. See the "Details" section below for further information.

Assumption: you are on a terminal inside the folder of the just cloned HAM repository.

Vector Host:

mkdir build.vh
cd build.vh
cmake ../ham
make -j
cd ..

Vector Engine:

mkdir build.ve
cd build.ve
CC=/opt/nec/ve/bin/ncc CXX=/opt/nec/ve/bin/nc++ cmake -DCMAKE_CXX_COMPILE_FEATURES='cxx_auto_type;cxx_range_for;cxx_variadic_templates' ../ham
make -j
cd ..

HAM-Offload provides two communication backends for the NEC SX-Aurora TSUBASA:

VEO-only (slower reference, uses only VEO calls)
VEO + VEDMA (fast, uses Vector Engine user DMA and special load/store instructions for host memory access)

Binary naming scheme:

<application/library>_<veo|vedma>_<ve|vh>

build.vh/inner_product_veo_vh
build.vh/inner_product_vedma_vh
build.ve/inner_product_veo_ve
build.ve/inner_product_vedma_ve
...

Note: The CMake support of the NEC compiler, which pretends to be a GCC but then does not behave like one in CMake's compiler tests, requires that cumbersome command line above. Alternatively, there is an initial CMake toolchain-file available. However, the line above currently is the easiest way.

Running

In general, the application is started on the host with the corresponding executable, while the VE library is passed as a command line argument. Data transfer bandwidths can benefit from using hugectl to allocate huge memory pages for the heap and shared memory regions.

[hugectl --heap --shm] <host_binary> --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib <ve_binary>

Running the inner-product example, could look like this:

./build.vh/inner_product_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_inner_product_vedma_ve

Option	Description
`--ham-cpu-affinity`	Pin the host process to a specific core (reduces latencies, especially the correct socket with respect to the PCIe topology between CPUs and VE cards matters here).
`--ham-process-count`	Start 2 processes (1 VH + 1 VE process). (NOTE: For now, the underlying VEO only supports using a single VE).
`--ham-veo-ve-nodes`	Comma separated list (no spaces) of VE nodes to use as targets with `--ham-process-count` - 1 entries. See note above.
`--ham-veo-ve-lib`	The VE binary.

See also Runtime Configuration.

Further examples:

# inner_product VEO-only
./build.vh/inner_product_veo_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_inner_product_veo_ve
# inner_product VEO + VEDMA
./build.vh/inner_product_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_inner_product_vedma_ve

# ham_offload_test VEO-only
./build.vh/ham_offload_test_veo_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_ham_offload_test_veo_ve
# ham_offload_test VEO + VEDMA
./build.vh/ham_offload_test_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_ham_offload_test_vedma_ve

# ham_offload_test_explicit VEO-only
./build.vh/ham_offload_test_explicit_veo_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_ham_offload_test_explicit_veo_ve
# ham_offload_test_explicit VEO + VEDMA
./build.vh/ham_offload_test_explicit_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_ham_offload_test_explicit_vedma_ve

# test_argument_transfer VEO-only
./build.vh/test_argument_transfer_veo_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_test_argument_transfer_veo_ve
# test_argument_transfer VEO + VEDMA
./build.vh/test_argument_transfer_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_test_argument_transfer_vedma_ve

# test_data_transfer VEO-only
./build.vh/test_data_transfer_veo_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_test_data_transfer_veo_ve
# test_data_transfer VEO + VEDMA
./build.vh/test_data_transfer_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_test_data_transfer_vedma_ve

# test_multiple_targets VEO-only
./build.vh/test_multiple_targets_veo_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_test_multiple_targets_veo_ve
# test_multiple_targets VEO + VEDMA
./build.vh/test_multiple_targets_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_test_multiple_targets_vedma_ve

Details

Since we are using NEC VEO, HAM-Offload applications are built and run like VEO applications.

See this README for examples on how to manually build VEO applications on the command line. This basically is what the provided CMake files do for the applications of the repository. The library targets are built as libraries.

The VE-part of the application can be built as a static or dynamic library. The latter has some caveats, while static building requires using the mk_veorun_static tool to generate the VE binary from the compiler-generated static library. The provided CMake files perform a static build and call the mk_veorun_static as needed. The HAM-Offload libraries are built with HAM_COMM_VEO_STATIC. This switches between the two corresponding initialisation paths of VEO, i.e. veo_proc_create_static(...) vs. veo_proc_create(...).

Debugging

The underlying NEC VEO and VEOS use log4c, which can be enabled by putting a .log4crc with the content below into the home directory. See the VEO documentation for more details.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE log4c SYSTEM "">
<log4c>
        <config>
                <bufsize>1024</bufsize>
                <debug level="0"/>
                <nocleanup>0</nocleanup>
        </config>
        <category name="veos.veo" priority="DEBUG" appender="veo_appender" />
        <appender name="veo_appender" layout="ve" type="rollingfile" rollingpolicy="veo_rp" logdir="." prefix="veo.log"/>
        <rollingpolicy name="veo_rp" type="sizewin" maxsize="4194304" maxnum="10" />
        <layout name="ve" type="ve_layout"/>
</log4c>