OpenMP Target Offload - khuck/apex-tutorial GitHub Wiki

The following example will introduce APEX using OpenMP target offload examples.

Source Code

Any of the following examples can be used with the following instructions:

The following examples will use the ompt_target_matmult.c example.

Running the OpenMP example

Running one of the examples (we recommend setting the number of threads):

[khuck@gilgamesh apex-tutorial]$ export OMP_NUM_THREADS=4
[khuck@gilgamesh apex-tutorial]$ ./build/bin/ompt_target_matmult
Iteration 0 of 3:...
compute_target
Iteration 1 of 3:...
compute_target
Iteration 2 of 3:...
compute_target
Done.

Running the OpenMP example with APEX and OpenMP support

As described in C POSIX Pthreads and Standard C++ threads, APEX provides the apex_exec wrapper script to preload the APEX measurement library and set appropriate environment variables. There are three options that are relevant to OpenMP support:

    --apex:ompt                   enable OpenMP profiling (requires runtime support)
    --apex:ompt_simple            only enable OpenMP Tools required events
    --apex:ompt_details           enable all OpenMP Tools events

As mentioned in the help message, OpenMP Tool (OMPT) support requires the necessary support in the OpenMP runtime as indicated in the OpenMP 5.0 specification. APEX provides the tool implementation that matches up with the runtime support. Compiler vendors that have known OMPT support include:

  • LLVM-based compilers, like Clang, Cray, AMD Clang, others
  • Intel OneAPI compilers
  • NVHPC version 22+ (special flags required at link time)
  • IBM XL compilers You'll notice that GCC is not on this list - currently there is no known effort to provide OMPT support in the GCC compilers.

NOTE: Not all compilers/runtimes support the full OMPT set of events (some events are optional), and not all events support them correctly yet. The following examples are using the AMD Clang/Clang++ 5.2.0 compilers.

Enabling OMPT support in APEX requires one of the three flags specified above. For example, basic support is provided with the --apex:ompt flag (for information about the other APEX flags, see the C POSIX Pthreads and Standard C++ threads tutorials):

[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:ompt --apex:ompt_details --apex:tasktree ./build/bin/ompt_target_matmult
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
libomp --> OMPT: Connecting with libomptarget
libomp --> OMPT: Exit libomp_ompt_connect
Iteration 0 of 3:...
compute_target
Iteration 1 of 3:...
compute_target
Iteration 2 of 3:...
compute_target
Done.

Start Date/Time: 26/02/2023 13:12:41
Elapsed time: 0.214143 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 5
Available CPU time on rank 0: 1.07072 seconds
Available CPU time on all ranks: 1.07072 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                       GPU: OpenMP Target Data Alloc :      9 1.05e+06 1.05e+06
                      GPU: OpenMP Target Data Delete :      9     0.00     0.00
                 GPU: OpenMP Target DataOp BW (MB/s) :     18   261.56  1588.00
                     GPU: OpenMP Target DataOp Bytes :     18 5.24e+05 1.05e+06
Iterations: OpenMP Work Loop: .omp_outlined.:0x206b… :     36   512.00   512.00
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.18e+05 4.18e+05
                                     status:VmExe kB :      1    32.00    32.00
                                     status:VmHWM kB :      1 7.73e+04 7.73e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 2.65e+05 2.65e+05
                                     status:VmPTE kB :      1   580.00   580.00
                                    status:VmPeak kB :      1 8.71e+05 8.71e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 7.73e+04 7.73e+04
                                    status:VmSize kB :      1 8.51e+05 8.51e+05
                                     status:VmStk kB :      1   136.00   136.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     1.00     1.00
                      status:voluntary_ctxt_switches :      1    15.00    15.00
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
GPU: OpenMP Target Submit: compute_target(float*, f… :      3     0.00     0.00
GPU: OpenMP Target DataOp Delete: do_work():0x207cc8 :      3     0.00     0.00
 GPU: OpenMP Target DataOp Alloc: do_work():0x207b08 :      3     0.00     0.00
                     GPU: OpenMP Target DataOp Alloc :      1     0.00     0.00
GPU: OpenMP Target DataOp Delete: do_work():0x207c28 :      3     0.00     0.00
 GPU: OpenMP Target DataOp Alloc: do_work():0x207a5e :      2     0.00     0.00
 GPU: OpenMP Target DataOp Alloc: do_work():0x207ab3 :      3     0.00     0.00
GPU: OpenMP Target DataOp Delete: do_work():0x207c78 :      3     0.00     0.00
GPU: OpenMP Target: compute_target(float*, float*, … :      3     0.00     0.00
GPU: OpenMP Target: compute_target(float*, float*, … :      3     0.00     0.00
GPU: OpenMP Target: compute_target(float*, float*, … :      3     0.00     0.00
              GPU: OpenMP Target: do_work():0x207a5e :      2     0.00     0.00
              GPU: OpenMP Target: do_work():0x207ab3 :      3     0.00     0.00
              GPU: OpenMP Target: do_work():0x207b08 :      3     0.00     0.00
              GPU: OpenMP Target: do_work():0x207c28 :      3     0.00     0.00
              GPU: OpenMP Target: do_work():0x207cc8 :      3     0.00     0.00
              GPU: OpenMP Target: do_work():0x207c78 :      3     0.00     0.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.21     0.21
        int apex_preload_main(int, char **, char **) :      1     0.01     0.01
            OpenMP Implicit Task: do_work():0x207bb6 :      9     0.00     0.01
           OpenMP Work Loop: .omp_outlined.:0x206b33 :     36     0.00     0.01
         OpenMP Implicit Barrier: do_work():0x207bb6 :      9     0.00     0.01
    OpenMP Implicit Barrier Wait: do_work():0x207bb6 :      9     0.00     0.01
            OpenMP Implicit Task: do_work():0x207b7c :     12     0.00     0.00
            OpenMP Implicit Task: do_work():0x207b42 :     12     0.00     0.00
OpenMP Target: compute_target(float*, float*, float… :      3     0.00     0.00
                   OpenMP Target: do_work():0x207a5e :      3     0.00     0.00
          OpenMP Parallel Region: do_work():0x207b42 :      3     0.00     0.00
          OpenMP Parallel Region: do_work():0x207bb6 :      3     0.00     0.00
          OpenMP Parallel Region: do_work():0x207b7c :      3     0.00     0.00
         OpenMP Implicit Barrier: do_work():0x207b7c :     12     0.00     0.00
         OpenMP Implicit Barrier: do_work():0x207b42 :     12     0.00     0.00
    OpenMP Implicit Barrier Wait: do_work():0x207b7c :     12     0.00     0.00
    OpenMP Implicit Barrier Wait: do_work():0x207b42 :     12     0.00     0.00
                   OpenMP Target: do_work():0x207cc8 :      3     0.00     0.00
                   OpenMP Target: do_work():0x207b08 :      3     0.00     0.00
                   OpenMP Target: do_work():0x207c28 :      3     0.00     0.00
OpenMP Target: compute_target(float*, float*, float… :      3     0.00     0.00
                   OpenMP Target: do_work():0x207ab3 :      3     0.00     0.00
                   OpenMP Target: do_work():0x207c78 :      3     0.00     0.00
OpenMP Target: compute_target(float*, float*, float… :      3     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 219
Writing: .//apex_tasktree.csv
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 43 rows
Found 0 ranks, with max graph node index of 42 and depth of 5
building common tree...
Rank 0 ...
1-> 0.214 - 100.000% [1] {min=0.214, max=0.214, mean=0.214, threads=1} APEX MAIN
1 |-> 0.010 - 4.602% [1] {min=0.010, max=0.010, mean=0.010, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.002 - 0.906% [3] {min=0.002, max=0.002, mean=0.001, threads=1} OpenMP Target: compute_target(float*, float*, float*, int, int, int):0x207616
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: compute_target(float*, float*, float*, int, int, int):0x207616
1 | | | |-> 0.002 - 0.797% [3] {min=0.002, max=0.002, mean=0.001, threads=1} GPU: OpenMP Target Submit: compute_target(float*, float*, float*, int, int, int):0x207616
1 | |-> 0.001 - 0.442% [3] {min=0.001, max=0.001, mean=0.000, threads=1} OpenMP Target: do_work():0x207a5e
1 | | |-> 0.214 - 100.000% [2] {min=0.214, max=0.214, mean=0.107, threads=1} GPU: OpenMP Target: do_work():0x207a5e
1 | | | |-> 0.000 - 0.003% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: OpenMP Target DataOp Alloc: do_work():0x207a5e
1 | |-> 0.001 - 0.425% [3] {min=0.001, max=0.001, mean=0.000, threads=1} OpenMP Parallel Region: do_work():0x207b42
1 | | |-> 0.002 - 1.088% [12] {min=0.002, max=0.002, mean=0.000, threads=4} OpenMP Implicit Task: do_work():0x207b42
1 | | | |-> 0.002 - 0.875% [12] {min=0.002, max=0.002, mean=0.000, threads=4} OpenMP Work Loop: .omp_outlined.:0x206b33
1 | | | |-> 0.000 - 0.152% [12] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Implicit Barrier: do_work():0x207b42
1 | | | | |-> 0.000 - 0.113% [12] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Implicit Barrier Wait: do_work():0x207b42
1 | |-> 0.001 - 0.351% [3] {min=0.001, max=0.001, mean=0.000, threads=1} OpenMP Parallel Region: do_work():0x207bb6
1 | | |-> 0.008 - 3.657% [9] {min=0.008, max=0.008, mean=0.001, threads=4} OpenMP Implicit Task: do_work():0x207bb6
1 | | | |-> 0.006 - 2.706% [9] {min=0.006, max=0.006, mean=0.001, threads=4} OpenMP Implicit Barrier: do_work():0x207bb6
1 | | | | |-> 0.006 - 2.682% [9] {min=0.006, max=0.006, mean=0.001, threads=4} OpenMP Implicit Barrier Wait: do_work():0x207bb6
1 | | | |-> 0.003 - 1.219% [12] {min=0.003, max=0.003, mean=0.000, threads=4} OpenMP Work Loop: .omp_outlined.:0x206b33
1 | |-> 0.001 - 0.278% [3] {min=0.001, max=0.001, mean=0.000, threads=1} OpenMP Parallel Region: do_work():0x207b7c
1 | | |-> 0.002 - 1.094% [12] {min=0.002, max=0.002, mean=0.000, threads=4} OpenMP Implicit Task: do_work():0x207b7c
1 | | | |-> 0.002 - 0.903% [12] {min=0.002, max=0.002, mean=0.000, threads=4} OpenMP Work Loop: .omp_outlined.:0x206b33
1 | | | |-> 0.000 - 0.154% [12] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Implicit Barrier: do_work():0x207b7c
1 | | | | |-> 0.000 - 0.123% [12] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Implicit Barrier Wait: do_work():0x207b7c
1 | |-> 0.000 - 0.074% [3] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Target: do_work():0x207cc8
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: do_work():0x207cc8
1 | | | |-> 0.000 - 0.070% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: OpenMP Target DataOp Delete: do_work():0x207cc8
1 | |-> 0.000 - 0.036% [3] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Target: do_work():0x207b08
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: do_work():0x207b08
1 | | | |-> 0.000 - 0.031% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: OpenMP Target DataOp Alloc: do_work():0x207b08
1 | |-> 0.000 - 0.014% [3] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Target: do_work():0x207c28
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: do_work():0x207c28
1 | | | |-> 0.000 - 0.004% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: OpenMP Target DataOp Delete: do_work():0x207c28
1 | |-> 0.000 - 0.007% [3] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Target: compute_target(float*, float*, float*, int, int, int):0x207529
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: compute_target(float*, float*, float*, int, int, int):0x207529
1 | |-> 0.000 - 0.007% [3] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Target: do_work():0x207ab3
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: do_work():0x207ab3
1 | | | |-> 0.000 - 0.002% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: OpenMP Target DataOp Alloc: do_work():0x207ab3
1 | |-> 0.000 - 0.006% [3] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Target: do_work():0x207c78
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: do_work():0x207c78
1 | | | |-> 0.000 - 0.002% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: OpenMP Target DataOp Delete: do_work():0x207c78
1 | |-> 0.000 - 0.003% [3] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Target: compute_target(float*, float*, float*, int, int, int):0x2076cc
1 | | |-> 0.214 - 100.000% [3] {min=0.214, max=0.214, mean=0.071, threads=1} GPU: OpenMP Target: compute_target(float*, float*, float*, int, int, int):0x2076cc
1 |-> 0.000 - 0.008% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: OpenMP Target DataOp Alloc
44 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.

DOT task graph of ompt_target_matmult example