TensileCreateLibrary - ROCm/Tensile GitHub Wiki

Warning

This wiki is obsolete. For the latest documentation, go to rocm.docs.amd.com/projects/Tensile

Definitions

Benchmark Run: Given a benchmark config (see Tensile/Configs/ for examples), generate a set of potential suitor solutions to solve a problem defined by parameterized properties. Run the generated kernels with specified sizes / ranges and record their performance. The best performing kernels for the given benchmark are selected and written to out as Library Logic.

Library Logic: One or more .yaml files containing benchmarking output results (usually one per gfx architecture, per benchmark problem). They consist of kernel meta-data and mappings to problems that it can solve, and performance data at particular sizes.

Solution: Solutions are basically a parameterized representations of a kernels intended to solve a specific problem. In the case we are solving a GEMM type of problem, a solution will encapsulate a set of fixed parameters that are applied to a generalized GEMM algorithm.

Predicate: A predicate is basically a test to either affirm or negate a condition.

Rationale

Every problem that can be solved with a generalized algorithm can be customized based on the application. In this case, we are solving problems on AMDGPUs which are highly configurable and can get things done in many ways. We have many choices of distributing workload, hardware resources, vector parallelization or unrolling loops just to name a few. How do we know which are the best choices, guaranteeing correct results with the best performance?

One way that we can deal with this is by considering parameterized benchmarks. We choose parameters and problem sizes that we are interested in, and try different combinations of these values to gain insight on their effects of correctness and performance. Once validated for correctness and show relative performance advantages, each of these combinations is considered a valid 'solution' to the problem.

Introduction

TensileCreateLibrary has a very important purpose in Tensile to generate and organize libraries of kernels to run on AMDGPU architectures. Given a set of previously defined solutions to a problem, each solution metadata is used to generate concrete GPU code (kernel) either in assembly or c++ source. The kernels themselves are basic operation building blocks that form the foundation of solving more complex problems and are common to many different applications. For example, kernels that implement the Generalized Matrix-Matrix Multiplication (GEMM) could ultimately end up being used in more complex machine learning or image processing techniques.

Of course we can define any solution meta-data that we want, but we have an aim to generate the most efficient and highest performing kernels possible. The benchmarking process that we use helps us find optimal kernels by testing out many varieties of parameterized solutions. Components from TensileCreateLibraries are used to generate the initial set of test solutions from the parameterized meta-data. After testing, the most optimal solutions are then selected and stored as configuration files. These configuration files are passed to TensileCreateLibraries to finally become the Master Solution Library.

Benchmarking process

Components from TensileCreateLibrary are used in a few different steps of the benchmarking process. Let's review:

Benchmark config files (.yaml) are created by hand. Problem type and properties are defined, then interesting solution properties are parameterized to ranges of values that are applicable. Example: see Tensile/Configs
Tensile benchmark script is called with a benchmark config as input E.g. Tensile/bin/Tensile path/to/config.yaml output/dir/
Benchmark config is read, and several 'potential' solutions are generated to test their performance in solving the problem.
Kernels are generated from the solutions and linked into a test solution library.
A client application is generated to load the test solution library with Tensile API and run performance benchmarks.
Performance results are written to file and are analyzed for optimal 'winning' kernels to given problem sizes.
Winning kernel meta-data (solutions) are written as Logic Files (.yaml) with performance data.
Repeat process for all benchmark configs.
Optimal solutions are then read from all Logic File outputs.
Kernel libraries are generated from the solutions and their meta-data combined into a master solution library (TensileLibrary.dat)

More specific details about the benchmark configs and protocol can be found here.

The final kernel libraries and master solution library can then be loaded by a client using the Tensile API. Tensile API will map user defined problems to specific kernels that are optimally suited to get the job done.

Where does TensileCreateLibrary come in? TensileCreateLibrary contains components that are suitable for processing solution metadata to generate kernel sources, assemble/compile them, and link them into code object libraries. It is also suitable for combining and managing solution meta-data and their problem mappings.

TensileCreateLibrary is mainly used to aggregate a given set of solutions into a master solution library and to generate and bundle associated kernels into code object libraries. The outputs are specifically the master solution library (TensileLibrary.dat) and the code object libraries (.co/.hsaco).

Code Object Libraries

Kernels are compiled into shared object libraries. These are identified by their file extensions .co/.hsaco. The only difference between them is that .co libs are ASM kernels and .hsaco libs are SOURCE kernels. Same ABI, same format.

ASM Kernels

Solutions that have the KernelLanguage: Assembly property use the KernelWriterAssembly object to generate the actual kernel code. This object has the very complicated task of translating all of the solution properties, or meta-data, into a sequence of code modules that are eventually rendered into a string of targeted asm code and saved to file as assembly. The ISA property determines the target graphics architecture of the assembly which is especially important for the object to choose the correct assembly instruction set. For example, the KernelWriterAssembly may dynamically test the assembler for v_mfma instructions for ISA (9,0,6) and will find alternatives if unsupported.

ASM kernels are assembled into .o object files and finally linked into .co files. The file names are obtained from the Solution's Name property.

Source Kernels

Solutions that have the KernelLanguage: Source property use the KernelWriterSource object to generate the actual kernel code. This object has the very complicated task of translating all of the solution properties, or meta-data, into a sequence of code modules that are eventually rendered into a string of C++ code and saved to file as .h and .cpp sources. The ISA property for source kernels is (0,0,0) which means that source kernels are compiled for all architecture targets.

Source kernels are compiled and assembled into .o files and extracted into .hsaco code modules per architecture. The file names are obtained from the Solution's Name property and decorated with the architecture target.

Final Code Object Libraries

Depending on TensileCreateLibrary state for 'MergeFiles', the code object libraries may be linked together into monolithic libraries for each architecture. This just affects the number of library files in the final result of TensileCreateLibrary. Again, there .co and .hsaco files have basically the same format, however the extensions allow distinguishing between ASM and SOURCE kernels.

Usage:

$>Tensile/bin/TensileCreateLibrary <Options...> <LogicPath> <OutputPath> <RuntimeLanguage>

[-h] Display usage help

[--cxx-compiler {hcc,hipcc}] Compiler override (default hipcc).

[--code-object-version {V2,V3}] Code object version override (default V3).

[--architecture {all,gfx000,gfx803,gfx900,gfx906,gfx908}] Architecture override (default all).

[--merge-files] / [--no-merge-files] If merge, all kernels are merged to one code object file per architecture (default on).

[--short-file-names] / [--no-short-file-names] Windows only option for shortening output files names. Currently disabled.

[--library-print-debug] / [--no-library-print-debug] Solutions will print enqueue info when enqueueing a kernel (default off)

[--no-enumerate] Disable enumeration of host graphics architecture.

[--package-library]

[--no-legacy-components] Don't generate solution source code for old client.

[--embed-library EMBEDLIBRARY] Embed (new) library files into static variables. Specify the name of the library.

[--embed-library-key EMBEDLIBRARYKEY] Prefix embedded symbols with EMBEDLIBRARYKEY.

[--new-client-only] Only build libraries for the new client.

[--version VERSION] Embed version into library files.

[--generate-manifest-and-exit] In the output directory, create a manifest file of expected outputs only.

[--library-format {yaml,msgpack}] Choose format of output library (default msgpack). Respective file extensions {.yaml,.dat}

LogicPath Path to LibraryLogic .yaml files.

OutputPath Output directory.

{OCL,HIP,HSA} Chosen runtime language.

Tensile API

So we put Library Logic files into TensileCreateLibrary and it spits out a Master Solution Library and Code Object Libraries. Now who uses them?

'Clients' can invoke the Tensile API to run kernels on their data and solve the problems they are interested in. It was previously mentioned that clients, or Tensile API users don't need expert domain knowledge on AMDGPUs to run the best possible kernels to solve interesting problems. All they need is a domain knowledge on the problem they are trying to solve, and Tensile takes care of the rest.

The Tensile API provides a C++ entrypoint to load TensileCreateLibrary outputs and to describe problems and invoke kernels. It also has support to accommodate both HIP and OpenCL backends to suit the users' workflow.

Loading Master Solution Library

Tensile API conveniently has a function to read in the master solution library to memory:

 #include "Tensile/Tensile.hpp"

 ...

 auto library = LoadLibraryFile<ContractionProblem>(envDir.file("TensileLibrary").native());
 ASSERT_NE(library, nullptr);

Loading Code Object Files

Tensile API conveniently has wrapper classes to read code object files using either HIP or OpenCL backends under the hood:

 #include Tensile/Hip/HipSolutionAdapter.h"
 auto adapter = std::make_shared<hip::SolutionAdapter>(debug, "TENSILE_TEST_LIBRARY");

 // Alternatively, in OpenCL
 // #include Tensile/ocl/OclSolutionAdapter.h"
 // auto adapter = std::make_shared<ocl::SolutionAdapter>(debug, "TENSILE_TEST_LIBRARY");

 ASSERT_NE(adapter, nullptr);

 for(auto file : envDir.glob("*.co"))
 {
     adapter->loadCodeObjectFile(file.native());
 }

 for(auto file : envDir.glob("*.hsaco"))
 {
     try
     {
         adapter->loadCodeObjectFile(file.native());
     }
     catch(std::logic_error& exc)
     {
         ...
     }
 }

Initialize Hardware

Enumerate the current hardware architecture with HIP or OpenCL.

#include "hip/HipHardware.hpp"
auto hardware = hip::GetCurrentDevice();

// Alternatively, in OpenCL
// #include "ocl/OclHardware.hpp"
// auto hardware = ocl::GetCurrentDevice();

ASSERT_NE(hardware, nullptr);

Defining the Problem

A domain expert in the problem of interest can define its properties using the Tensile API:

#include "Tensile/ContractionProblem.hpp"

auto problem = ContractionProblem::GEMM(...);

Initialize Resources and Data

Domain expert would need to allocate and initialize their host and device memory.

using TypedInputs = TypedContractionInputs<AType, BType, CType, DType, AlphaType, BetaType> 

TypedInputs hostData;
TypedInputs deviceInputs;

// Initialize host data and init values
hostData.a = new AType[aSize];
foreach(a = hostData.a, aSize)
{
    a = ...
}

// Initialize device allocations and data
HIP_CHECK_EXC(hipMalloc(deviceInputs.a, aSize));

// Alternatively, in OpenCL
// cl::Buffer devA = {cl::Context::getDefault(), CL_MEM_READ_WRITE, aSize};
// deviceInputs.a = static_cast<AType*>(devA());

...

// Copy data to GPU, if necessary
HIP_CHECK_EXC(
        hipMemcpy(deviceInputs.a, hostData.a, aSize, hipMemcpyHostToDevice));

// Alternatively, in OpenCL
// cl::CommandQueue::getDefault().enqueueWriteBuffer(devA, CL_TRUE, 0, aSize, hostData.a);

...

Solve the Problem

The domain expert would then use the objects already attained from the API to solve the problem and launch the kernels. library->findBestSolution evaluates the Master Solution Library predicate tree to find the optimal solution for the problem. The solution then provides the kernel meta-data as KernelInvocation objects. Each one represents a single kernel. The adapter then finds the kernel code inside the loaded hip module with via hipModuleGetFunction(). Finally, the kernel is invoked by the adapter via hipExtModuleLaunchKernel().

// Use the master solution library to find the best solution to the problem
auto solution = library->findBestSolution(problem, hardware*);

// Build the input parameters to pass to the kernel
std::vector<KernelInvocation> result = solution->solve(problem, inputs_d, *hardware);

// Launch the kernels
adapter->launchKernels(result);

// Copy the data back to host from GPU
HIP_CHECK_EXC(
        hipMemcpy(hostData.a, deviceInputs.a, aSize, hipMemcpyDeviceToHost));

// Alternatively, in OpenCL
// cl::CommandQueue::getDefault().enqueueReadBuffer(devA, CL_TRUE, 0, aSize, hostData.a);

...
// Release memory
delete[] hostData.a;
hipFree(deviceInputs.a);

// Alternatively, in OpenCL C++ RAII destructor will clean up
...

Conclusion

This page has in a nutshell described the context surrounding TensileCreateLibrary and it's role in benchmarking, generating kernel libraries and solution libraries, and how they are consumed by a client application using the Tensile API. Code snippets in this document should serve as pseudo code only as the API may change or evolve. It nevertheless serves as an overview of potential usage. Specific examples of this process can be seen in more complex projects such as HostLibraryTests/RunGEMMKernel_test.cpp among others.