Quick Start Guide - agency-library/agency GitHub Wiki

Introduction

Agency is a C++ template library for parallel programs based on a modular abstraction of execution. Agency allows you to build parallel algorithms using programming primitives that, while low-level, remain portable across different platforms.

This document describes how to develop parallel algorithms with Agency. The tutorial is intended to be accessible, even if you have limited C++ or parallel programming experience.

Prerequisites

Agency is compatible with recent, C++11 or later compilers.

Simple Example

Let's compile a simple program with Agency to ensure that all prerequisites are satisfied. Save the following source code to a file named version.cpp.

#include <agency/version.hpp>
#include <iostream>

int main()
{
  int major = AGENCY_MAJOR_VERSION;
  int minor = AGENCY_MINOR_VERSION;

  std::cout << "Agency v" << major << "." << minor << std::endl;

  return 0;
}

Now compile version.cpp with g++. If Agency is in the current directory, the following commands should work:

$ ls
agency version.cpp
$ g++ -std=c++11 -I. version.cpp -o version
$ ./version
Agency v0.0

If the top-level agency directory is placed somewhere else, use the -I option to tell g++ where to look. For example, if agency was placed in ~/include then the following command should be used.

$ g++ -std=c++11 -I ~/include version.cpp -o version

Agency may also be compiled with nvcc to create programs which may execute on GPUs. In this case, a .cu file extension should be used:

$ nvcc -std=c++11 -I. version.cu -o version

Or we can also tell nvcc to treat the .cpp file as a CUDA language source file:

$ nvcc -std=c++11 -I. -x cu version.cpp -o version

Other C++11 compilers, such as clang and icc, are also known to work with Agency.

Execution Agents

When programming algorithms with Agency, you create groups of execution agents and have them execute tasks. In Agency, an execution agent is a stand-in for some underlying physical thing -- maybe a thread, a SIMD lane, or a process -- which actually executes our task of interest. Programming algorithms in terms of execution agents instead of something really specific -- like threads -- makes our programs portable across different kinds of hardware platforms.

Hello World

In Agency, a task is simply any function whose first parameter is some type of execution agent:

void hello(agency::sequenced_agent& self)
{
  std::cout << "Hello, world from agent " << self.index() << std::endl;
}

The type of execution agent this hello() task requires is a sequenced_agent. This type of agent executes our task in sequence, that is, in order with the other sequenced_agents in its group. The reason we create agents in groups is so that we can coordinate their efforts. Imposing the structure of a group onto our collections of agents makes it easier for us to reason about their behavior, and also makes it easier for Agency to map the agents onto the underlying system.

bulk_invoke()

Now that we know how to define a task, how do we actually create execution agents to execute it? That's the job of functions like bulk_invoke. The following line of code calls bulk_invoke() to create a group of ten sequenced_agents which invoke the hello() function in bulk:

agency::bulk_invoke(agency::seq(10), hello);

Try it yourself! Compile and execute the hello_world.cpp example program found in Agency's examples directory:

$ g++ -std=c++11 -I. examples/hello_world.cpp -o hello_world
$ ./hello_world 
Hello, world from agent 0
Hello, world from agent 1
Hello, world from agent 2
Hello, world from agent 3
Hello, world from agent 4
Hello, world from agent 5
Hello, world from agent 6
Hello, world from agent 7
Hello, world from agent 8
Hello, world from agent 9

You can see that the hello() task was invoked ten times -- once for each sequenced_agent in the group we created with bulk_invoke().

Lambda Functions

If you've programmed with C++ recently, you know that lambda functions are a nice way to define little functions like hello() inline, at their point of use, without having to stick them in some faraway place in the code. Lambdas are nice when the function you want to call will only be called once -- just like in our hello world program. Instead of defining a named function hello(), we could have used a lambda with bulk_invoke() directly:

agency::bulk_invoke(agency::seq(10), [](agency::sequenced_agent& self)
{
  std::cout << "Hello world from agent " << self.index() << std::endl;
});

You can read more about C++ lambda functions on Wikipedia.

Of course, Agency is fully interoperable with lambdas -- here's a complete Agency hello world program which uses a lambda function to define our task instead of a named function like hello():

#include <agency/execution_policy.hpp>
#include <iostream>

int main()
{
  // create 10 sequenced_agents to greet us in bulk
  agency::bulk_invoke(agency::seq(10), [](agency::sequenced_agent& self)
  {
    std::cout << "Hello, world from agent " << self.index() << std::endl;
  });

  return 0;
}

For the rest of the guide, unless there's a reason not to, we'll stick to using lambdas.

Execution Policies

You may have noticed that in our Hello World program we passed bulk_invoke() two parameters:

agency::bulk_invoke(agency::seq(10), task);

The second parameter was the lambda function defining the task we wanted to execute, but what does seq(10) mean? You may have guessed from the 10 that this thing has something to do with the number of execution agents bulk_invoke() will create.

Sequenced Execution with seq

In fact, seq is a type of execution policy describing the group of execution agents to create, and their relationship to one another within the group. Execution policies describe the size of the group of agents, as well as their execution ordering semantics. In this case, the seq execution policy tells bulk_invoke() to create a group of sequenced_agents that will execute in sequence, or one after the other. However, there are other possibilities.

Parallel Execution with par

Sometimes we don't care so much about the order in which agents execute. For example, suppose we want to write a really simple program that just fills an array with some number. All we care about is that the array gets filled with the correct value -- the order in which array elements are accessed just doesn't matter. We could start setting elements from the beginning and walk through the array to the end, or we could start from the end and walk backwards, or we could go in a totally random order -- or even no order at all. At the end of the day, the result will be the same.

An easy way to use Agency to do this would be to create a whole bunch of execution agents -- one agent for each element in the array. The task for each agent simply would be to store the number to the corresponding array element:

// create an array
std::vector<int> array(100);

// create a group of parallel agents -- one agent per array element
agency::bulk_invoke(agency::par(array.size()), [&array](agency::parallel_agent& self)
{
  // to find the corresponding array element, use self.index()
  array[self.index()] = 42;
});

This program uses the par execution policy to create a group of parallel agents to fill the array. In Agency, parallel is a weaker form of execution than sequenced. Basically, par weakens the ordering guarantees that we require of our agents and allows Agency to schedule execution agents onto threads. This means that if Agency happens to schedule execution agents on different threads, they will execute in an unordered fashion. Since the relaxed ordering guarantees of parallelism allows us to use more than one thread in our program, there's a chance that the program will execute faster than it would otherwise -- which is the whole point of introducing parallelism in the first place!

When writing Agency programs, it's useful to distinguish types of execution agents from one another based on their abilities and what assumptions they make regarding execution. You might have noticed that our latest program uses a different type of execution agent than we used in our Hello World programs -- parallel_agent instead of sequenced_agent. When we use an execution policy with bulk_invoke(), we need to be sure that the type of execution agent in our lambda matches. This is to make sure that our lambda doesn't violate any guarantees made by the execution policy. In other words, when we ask Agency for parallel execution agents, our lambda promises not to try to do anything a parallel_agent can't do. We'll see when that matters in the next section.

Since Agency is allowed to use threads when it sees par, we need to build our program in the correct manner. When compiling with g++, this means we should link against the pthread library:

$ g++ -std=c++11 -I. examples/fill.cpp -pthread -o fill
$ ./fill 
OK

SAXPY

Now that we know how to program very simple parallel algorithms with Agency, let's use it to implement a computation. SAXPY, a simple vector arithmetic kernel, is one that many folks interested in high performance computing should be familiar with. Let's use bulk_invoke() to implement it:

void saxpy(size_t n, float a, const float* x, const float* y, float* z)
{
  using namespace agency;

  bulk_invoke(par(n), [=](parallel_agent &self)
  {
    int i = self.index();
    z[i] = a * x[i] + y[i];
  });
}

Our SAXPY program works a lot like our previous fill example. For each element in the output array, we create a parallel_agent whose only task is to execute a single sum.

So how fast is this program? Let's measure its achieved memory bandwidth by calling saxpy() several times in a loop and dividing the size of the input and output by the mean time it takes to execute:

std::vector<float> x(n, 1), y(n, 2), z(n);
float a = 13.;

size_t num_trials = 20;

auto start = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < num_trials; ++i)
{
  saxpy(n, a, x.data(), y.data(), z.data());
}
std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start;

double seconds = elapsed.count() / num_trials;
double gigabytes = double(3 * n * sizeof(float)) / (1 << 30);
double bandwidth = gigabytes / seconds;

On my system, saxpy's bandwidth is around 12 GB/s:

$ g++ -std=c++11 -I. -O3 -pthread examples/saxpy.cpp -o saxpy
$ ./saxpy 
Measuring performance...
SAXPY Bandwidth: 11.8806 GB/s

Can we make it go any faster? One of the benefits of Agency is that it allows us to take our parallel program and port it easily to other platforms. My system happens to have an NVIDIA Tesla GPU inside, which should have quite a bit more memory bandwidth available to it. Let's target it with our saxpy program by making a few changes.

The most important modification to make is to change the execution policy the saxpy() function uses when calling bulk_invoke():

void saxpy(size_t n, float a, const float* x, const float* y, float* z)
{
  using namespace agency;

  bulk_invoke(cuda::par(n), [=] __device__ (parallel_agent &self)
  {
    int i = self.index();
    z[i] = a * x[i] + y[i];
  });
}

Instead of simply par, we've used cuda::par to create our execution agents. cuda::par is an execution policy just like regular par, except it gives Agency special license to use CUDA threads that execute on the GPU. The other minor change is to annotate our lambda function with the __device__ keyword. This tells the NVIDIA compiler to compile the lambda in a special way to allow it to execute on a GPU device.

The other change to our program is to move where the data is stored. NVIDIA GPUs can't access normal system memory without special programmer intervention. To keep things simple, we should put our vectors in the GPU's memory. In our new program, we've replaced our use of vector with Thrust's device_vector, but you could use any allocation scheme you like.

Let's compile our new program with the NVIDIA compiler and see how it runs:

$ nvcc -std=c++11 -expt-extended-lambda -O3 -I. examples/saxpy.cu -o gpu_saxpy
$ ./gpu_saxpy 
Measuring performance...
SAXPY Bandwidth: 129.432 GB/s

It's around ten times faster! Of course, the comparison isn't completely fair -- we had to move the data to the GPU, which would have non-negligible cost in a real world application. And anyway, who cares how fast SAXPY can go? That's not really the point. What you should takeaway from this exercise is that Agency can give you the flexibility of moving execution around in a modular way when there is some benefit in doing so.

Concurrent Execution with con

Part of the power of structuring our execution agents into groups is that it allows us to coordinate their efforts. Coordinating execution agents often requires communication -- that is, the ability to send and receive messages between agents within the same group. Communication requires that agents execute at the same time. That is, we require a guarantee of concurrency.

This is not what we are guaranteed by seq nor par. Sequenced agents are guaranteed not to be concurrent, because they execute their tasks in order. Likewise, parallel agents may or may not execute concurrently. For example, a parallel execution agent A may happen to be scheduled on thread A and another parallel execution agent B may happen to be scheduled on thread B. By pure luck, the two parallel agents may happen to be scheduled concurrently, but we can't rely on it.

Writing correct programs that rely on the concurrent execution of agents requires a strong guarantee of concurrency. This is what the con execution policy guarantees. Let's use con to create some concurrent_agents to play a game of ping pong:

int ball = 0;
std::string names[2] = {"ping", "pong"};
std::mutex mut;

// create two concurrent agents
agency::bulk_invoke(agency::con(2), [&](agency::concurrent_agent& self)
{
  auto name = names[self.index()];

  // play for 20 volleys
  for(int next_state = self.index(); next_state < 20; next_state += 2)
  {
    // wait for the next volley
    while(ball != next_state)
    {
      mut.lock();
      std::cout << name << " waiting for return" << std::endl;
      mut.unlock();
        
      std::this_thread::sleep_for(std::chrono::seconds(1));
    }

    // return the ball
    mut.lock();
    ball += 1;
    std::cout << name << "! ball is now " << ball << std::endl;
    mut.unlock();

    std::this_thread::sleep_for(std::chrono::seconds(1));
  }
});

The two concurrent_agents, which we've named ping and pong, communicate through the ball variable to let the other know when it is their turn to update its value. After an agent updates ball, they wait in a while loop for the next state. Because the agents share access to the ball variable, they use another mutex variable to get exclusive access. Otherwise, ping could update ball while pong is trying to look at its value; or vice versa.

The exact output of the program is non-deterministic and depends on how the underlying platform decides to schedule the activity of the two concurrent_agents. Sometimes one agent may need to wait a little while for the other to return the ball. However, because we've used the con execution policy to create these agents, we're guaranteed that they will execute concurrently. Try it out yourself to watch the game unfold.

Here's how one game played out on my system:

$ g++ -std=c++11 -I. -pthread examples/concurrent_ping_pong.cpp -o concurrent_ping_pong
$ ./concurrent_ping_pong 
pong waiting for return
ping! ball is now 1
pong! ball is now 2
ping! ball is now 3
pong! ball is now 4
ping waiting for return
pong waiting for return
ping! ball is now 5
pong! ball is now 6
ping! ball is now 7
pong! ball is now 8
ping! ball is now 9
pong! ball is now 10
ping! ball is now 11
pong! ball is now 12
ping! ball is now 13
pong! ball is now 14
ping waiting for return
pong waiting for return
ping! ball is now 15
pong! ball is now 16
ping waiting for return
ping! ball is now 17
pong waiting for return
pong! ball is now 18
ping waiting for return
ping! ball is now 19
pong waiting for return
pong! ball is now 20

So what would have happened if we tried to play this game with either sequenced or parallel agents? In the case of sequenced_agents, we can be sure that the game would have never gotten started. The first sequenced_agent, ping, would have waited forever for pong's first return. Since the agents must execute in sequence, only one is allowed on the court at a time.

For parallel_agents, it's hard to say how the game would play out without trying it since Agency has a lot of leeway to schedule parallelism. Here's what happened on my system:

$ g++ -std=c++11 -I. -pthread parallel_ping_pong.cpp -o parallel_ping_pong
$ ./parallel_ping_pong 
ping! ball is now 1
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
...

Apparently, both ping and pong happened to be scheduled in sequence. pong never made it onto the court. When programming algorithms like our ping pong game where one or more agents needs to be able to block the forward progress of another, it's critical that we create concurrency to be sure our program has a change of executing correctly.

For Future Reference: The execution policies we've covered so far illustrate a few of the basic different ways in which agents may execute. In addition to these simple policies, Agency provides policies which can create agents which execute on a GPU, agents which execute on SIMD lanes, and even tools for creating your own policies. We'll revisit these topics later in this guide.

Asynchrony

Up to this point in the guide, we've focused on using bulk_invoke() to create a group of agents to invoke a function en masse. So far, our usage of bulk_invoke() works a lot like a for loop -- when we call bulk_invoke() or begin executing the iterations of a for loop, we have to wait until everything finishes before we can proceed. This is because both of these constructs are synchronous -- the caller of bulk_invoke() synchronizes with the completion of the execution agents created inside.

But sometimes we don't want to wait. For example, maybe we have a few different tasks we want to start working on. If these tasks don't depend on each other, there's really no reason to wait for one to finish before starting the next. In this case, we desire asynchrony, which gives us the flexibility to wait on the completion of a task at our leisure. In other words, asynchrony allows us to decouple execution agent creation from synchronization.

bulk_async()

In Agency, we use the bulk_async() function to create a group of execution agents to invoke a function in bulk, asynchronously. This program creates two Hello World tasks asynchronously:

using namespace agency;

std::cout << "Starting two tasks asynchronously..." << std::endl;

std::mutex mut;

// asynchronously create 5 agents to greet us in bulk
auto f1 = bulk_async(par(5), [&](parallel_agent& self)
{
  mut.lock();
  std::cout << "Hello, world from agent " << self.index() << " in task 1" << std::endl;
  mut.unlock();
});

// asynchronously create 5 agents to greet us in bulk
auto f2 = bulk_async(par(5), [&](parallel_agent& self)
{
  mut.lock();
  std::cout << "Hello, world from agent " << self.index() << " in task 2" << std::endl;
  mut.unlock();
});

std::cout << "Sleeping before waiting on the tasks..." << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(1));
std::cout << "Woke up, waiting for the tasks to complete..." << std::endl;

// wait for tasks 1 & 2 to complete before continuing
f1.wait();
f2.wait();

std::cout << "OK" << std::endl;

bulk_async() is just like bulk_invoke() except that it is asynchronous. While the agents execute, the caller of bulk_async() is free to do other things, for example, make more calls to bulk_async(). It's up to the caller to decide when to synchronize through the future object bulk_async() returns as a result. In our program above, these are the calls to f1.wait() and f2.wait(). The future's .wait() function causes the caller to wait until the execution agents associated with this future are finished executing before proceeding.

Because we see par used in the program above, we know that the execution agents created by the calls to bulk_async() may execute in parallel within the two groups. But what about between the two groups? What is the relationship between the two groups? When we use bulk_async(), we're signalling that a task is immediately ready to execute and has no dependency on anything else. For our example, this means that the two groups of execution agents created by bulk_async() may execute in parallel since they don't depend on each other. On my system, when I ran this program, everything happened in sequence:

$ g++ -std=c++11 -I. -pthread examples/hello_async.cpp -o hello_async
$ ./hello_async 
Starting two tasks asynchronously...
Sleeping before waiting on the tasks...
Woke up, waiting for the tasks to complete...
Hello, world from agent 0 in task 1
Hello, world from agent 1 in task 1
Hello, world from agent 2 in task 1
Hello, world from agent 3 in task 1
Hello, world from agent 4 in task 1
Hello, world from agent 0 in task 2
Hello, world from agent 1 in task 2
Hello, world from agent 2 in task 2
Hello, world from agent 3 in task 2
Hello, world from agent 4 in task 2
OK

As you can see, the two groups of agents didn't even begin executing until the calls to .wait() -- and that's OK. As long as the agents are finished executing after calling .wait(), that's all that matters.

For Future Reference: Later in this guide, we'll see how to get fine-grained control over the ordering relationships between different groups of execution agents by creating execution agent hierarchies.

Shared Parameters

So far, all the examples of tasks we've shown in the guide take a single argument: a reference to some type of execution agent. In general, the functions invoked by bulk_invoke() and bulk_async() may take any number of arguments. For instance, we could rewrite our saxpy() function to pass our arrays into the lambda as parameters rather than capture everything by value:

void verbose_saxpy(size_t n, float a, const float* x, const float* y, float* z)
{
  using namespace agency;

  bulk_invoke(par(n),
    [](parallel_agent& self,
       size_t n, float a, const float* x, const float* y, float* z)
  {
    int i = self.index();
    z[i] = a * x[i] + y[i];
  },
  n, a, x, y, z);
}

In this verbose_saxpy() function, instead of capturing the parameters of the SAXPY operation in the lambda, we've explicitly passed them through bulk_invoke() as additional parameters given to the lambda. This is a lot more verbose than simply capturing them, but the effect is exactly the same -- the parameters get replicated across the execution agents. Each agent receives its own private copy of all the parameters. Since it's so easy to just capture and copy variables with a lambda, there's usually no benefit to this kind of parameter passing.

Sometimes we don't want each agent to have its own private copy of something. Instead, we'd like all agents to share ownership of a single object. Shared parameters also pop up when a group of agents needs to communicate among themselves. Since every agent sharing a parameter can see and modify the same object, a shared parameter is a medium for communication. Because every agent is potentially able to observe and mutate shared parameters at the same time, we usually need to take special care to synchronize their access to it to ensure the program is free of data races.

Let's see how to use a shared vector of integers as a scratchpad while computing the sum of its elements:

int result = 0;

bulk_invoke(con(data.size()),
  [&](concurrent_agent& self, std::vector<int>& scratch)
{
  auto i = self.index();
  auto n = scratch.size();

  while(n > 1)
  {
    if(i < n/2)
    {
      scratch[i] += scratch[n - i - 1];
    }

    // wait for every agent in the group to reach this point
    self.wait();

    // cut the number of active agents in half
    n -= n/2;
  }

  // the first agent stores the result
  if(i == 0)
  {
    result = scratch[0];
  }
},
share<0>(data));

This program computes a reduction over the elements in the input data using a shared parameter named scratch. While the reduction executes, scratch acts as storage for partial sums. To create the scratch parameter, we pass a third parameter to the call to bulk_invoke() using the share() function. We tell bulk_invoke() to share a copy of the input data at the 0th level of the execution hierarchy (more on hierarchies later). Before the concurrent_agents begin executing, bulk_invoke() will create a copy of data and pass it to the lambda through the parameter named scratch.

The code implementing the reduction is pretty simple. The idea is that the agents proceed in phases while they sum up the array. During each phase, it's each active agent's task to compute a single partial sum at location i. To compute the sum, the agent "leap frogs" over the other active agents to find the array element which will be the right hand side of the sum:

scratch[i] += scratch[n - i - 1];

After storing the new partial sum, each agent has to wait for the entire phase to finish before the next phase can proceed:

self.wait();

This kind of waiting operation is called a barrier. Its purpose in our program is to ensure that there are no data races while when reading and writing the scratch array. Without it, an agent could race through all its phases, computing sums before the operands were ready. Using a barrier ensures that no agent gets ahead of itself. concurrent_agent is special in this way -- no other type of execution agent has a barrier because concurrency is required.

In each phase, half of the agents of the previous phase drop out, until there is only a single agent left which stores the total sum to scratch[0]. After the loop, it's also his job to copy this total to the result variable.

You may be wondering why we created a special shared parameter for scratch when we could have made the lambda capture an equivalent variable by reference. After all, this program captures the result parameter by reference, couldn't we have done the same for the scratch variable?

When we use a shared parameter, we're giving Agency important information about locality which allows bulk_invoke() to intelligently locate data used during the computation. When we reference capture some variable bar with a lambda, what happens under the covers is that a pointer to bar gets stored inside the lambda object. When agents use bar during the computation, they dereference this pointer to find it. Depending on what kind of agents we're dealing with, and where they are executing, this deference operation could be expensive. This is especially true if we require repeated access to the variable, as we do in our sum example. Using a shared parameter allows Agency to place the parameter nearby the agents to ensure access and sharing is efficient as possible.

TODO: redo this section when bulk_invoke() & bulk_async() are able to return single results. Doing the reduction example with bulk_async() will make it a lot more clear why shared parameters are necessary.

Continuations

TODO: describe bulk_then()

Execution Hierarchies

So far, all of the computational kernels we've written have looked a lot like dressed-up for loops. We define a loop body using a lambda, and then request some sort of ordering guarantee on the invocations of the loop body. Maybe we even execute the loop asynchronously if we're feeling adventurous. Lots of programming systems work this way.

What makes Agency unique is the ability to create execution hierarchies -- that is, multiple groups of execution agents with structured, hierarchical relationships. We've seen how we can reason about the internal relationships describing how execution agents execute within their own group. We can go a step further and reason about the external relationships between agent groups in a hierarchical fashion.

In order to understand how to structure the execution of our code hierarchically, let's revisit our concurrent ping pong match program. Before, we had a pair of concurrent execution agents simulate a single game of ping pong. This time, we'll simulate an entire ping pong tournament. To keep things simple, let's have our execution agents play two semifinal matches followed by a final match. In total, four players will compete in the tournament.

The basic idea of the program will be to create two separate pairs (groups) of concurrent execution agents. Not only will each pair of agents execute concurrently within their group, but their group as a whole will execute concurrently with respect to the other group. In other words, there will be two levels of concurrency -- inner and outer. Each of the two inner groups will begin by playing two separate ping pong matches. After playing the semifinals matches, the two inner groups will synchronize to communicate who won. Then, a single group of execution agents will play the final match.

Hierarchical Tasks

To allow us to focus our attention on the structure of the execution hierarchy, the code which implements the ping pong match is encapsulated within the ping_pong_match() function. ping_pong_tournament() is the function coordinating the high-level tasks of the tournament. Let's step through each part:

void ping_pong_tournament(concurrent_group<concurrent_agent>& self,
                          const vector<vector<string>>& semifinalists,
                          int num_volleys,
                          movable_mutex& mut,
                          vector<string>& final_contestants,
                          int& ball)

First, let's decode the function's parameters. Like all Agency tasks, the first parameter is a handle to the execution agent which executes the task. Unlike our previous examples, because we've arranged our agents into a hierarchy, the agent handle is itself hierarchical. The outer concurrent_group<> template is a handle to the outermost level of the execution hierarchy, which is a concurrent group of a group of agents. The inner concurrent_agent type is what we've seen before -- this is a handle to the innermost level of the execution hierarchy.

Next, semifinalists is a vector of vectors of strings naming the four contestants in the semifinals matches. This data structure is hierarchical just like the agent handle -- the inner strings are the names we've given to the four execution agents.

The num_volleys variable is just like what we saw in the previous example -- it's simply a limit on the number of ping pong iterations to execute before finishing the match.

mut is our first shared variable. We use it during the tournament to guard access to std::cout and ensure that when we stream output from the concurrent matches there is no data race. Otherwise, the printed output would be a jumbled match. There's a single mutex for the entire tournament guarding std::cout.

finalists is a shared container which collects the names of the two players who will compete during the final match. Since there is only a single final match, there is only a single instance of this vector which holds two strings.

ball is our last shared variable which counts the number of volleys during each match. Unlike mut, there will actually be two instances of the ball variable during the tournament -- one per group of execution agents. Since there will be two semifinals matches playing concurrently, we need two separate ball variables to track their status. For the final match, we'll only use one of these variables.

Creating Execution Hierarchies

Before digging into the implementation of ping_pong_tournament(), let's look at how we bulk_invoke() it so we can see the details of how to create a hierarchy along with shared parameters:

bulk_invoke(con(2, con(2)),
            ping_pong_tournament,
            semifinalists,
            num_volleys,
            share<0,movable_mutex>(),
            share<0,vector<string>>(2),
            share<1,int>());

The first thing to notice is that the execution policy passed to bulk_invoke() is hierarchical -- we've wrapped one policy inside another. The outer con(2, ...) policy tells bulk_invoke() to create the outermost concurrent_group<...> with two groups nested inside. The inner con(2) policy gives the parameters of the inner group -- two concurrent_agents. The net result is a two-level hierarchy of two inner groups of concurrent agents executing concurrently at the outer level.

The next three parameters are given just like the previous examples we've seen. ping_pong_tournament is the name of the function to invoke, and semifinalists and num_volleys are parameters that become replicated as per-agent private variables within the task.

Hierarchies and Shared Parameters

The last three parameters are shared. The scope of sharing is given by the first template parameter to the share() function. For example, the first two calls to share() instruct bulk_invoke() to create a movable_mutex and vector<string> to share across the outermost group of the execution hierarchy. Since calls create the parameters mut and final_contestants which all execution agents in the task may access, because they create single instances of these parameters shared at level 0.

The parameters to share<i,T>() are interpreted as constructor parameters. They are simply forwarded along to T's constructor. For example, when we call share<0,vector<string>>(2), this constructs a vector containing two empty strings. If we don't pass any parameters to share<i,T>(), bulk_invoke() just uses the default constructor.

The last call to share<1,int>() instructs bulk_invoke() to create an int for each group of agents at level 1 -- the inner level of the hierarchy. There are two of these, so bulk_invoke() creates two instances of the ball parameter.

Hierarchy Coordination

Now let's step into the implementation of ping_pong_tournament(). The first thing we do is announce the beginning of the semifinals matches. We only want one announcement, so we need to elect a single agent to act as the announcer. We'll choose agent 0 of group 0 arbitrarily. To ensure that the announcement happens before semifinals begin, we need to make everyone wait. Here's how to do it:

if(self.inner().index() == 0)
{
  if(self.outer().index() == 0)
  {
    std::cout << "Starting semifinal matches..." << std::endl;
  }

  self.outer().wait();
}

self.inner().wait();

Global Synchronization

We use .inner().index() and .outer().index() to access the agent's inner and outer indices, respectively and compare them to 0 to find the zeroth agent of the zeroth group. After announcing semifinals, we need to synchronize all the agents before the matches begin. To synchronize globally across all execution agents, we perform a hierarchical barrier by first having agent 0 of each group wait on the outer group with .outer().wait(). Then, we have every agent wait on the agents within their own inner group with .inner().wait(). The effect is that all agents globally synchronize.

Modularity

After the announcement, we invoke the ping_pong_match() function to execute the semifinals matches:

auto semifinal_winner_idx =
  ping_pong_match(self.inner(),
                  semifinalists[self.outer().index()],
                  num_volleys,
                  mut,
                  ball);

The first parameter of the ping_pong_match() function is a handle to a concurrent_agent participating in the match. Because the two matches are independent of one another, they don't require any information about the organization of the execution hierarchy above the innermost groups. When we call self.inner(), we temporarily discard this extra information and pass a reference to the individual inner concurrent_agents. Structuring our code this way means we can build modular, general purpose functions like ping_pong_match() that don't require global visibility of the entire execution hierarchy. ping_pong_match()'s special first parameter ensures that we can't mistakenly call it without presenting proof that right type of execution agents is currently executing.

After executing the semifinals matches, we need to determine the names of the winners and record them in the finalists array. There is one winner per match, so we have both agent 0s of the inner groups record the names of the finalists. At the same time, we perform another global barrier before beginning the final match:

if(self.inner().index() == 0)
{
  // the first agent of each group reports who won the semifinal match
  auto semifinal_winner =
    semifinalists[self.outer().index()][semifinal_winner_idx];

  finalists[self.outer().index()] = semifinal_winner;

  // have the first player of each group wait for the other group
  self.outer().wait();
}

// have each inner group wait for each other
self.inner().wait();

Finally, we'll execute the final match. Because there is only one final match, only one group of execution agents needs to participate. We'll nominate group 0 for this job -- the other group can exit the task. We'll announce the final match just like before and ensure that the announcement happens before the match begins:

// group 0 plays the final match while group 1 sits it out
if(self.outer().index() == 0)
{
  // agent 0 initializes the contestant names for the final match
  if(self.inner().index() == 0)
  {
    std::cout << std::endl << finalists[0] << " and " << finalists[1] << " starting the final match..." << std::endl;
  }

  // wait until agent 0 initializes the contestant names before starting the final match
  self.inner().wait();

  // play the final match
  auto final_winner_idx =
    ping_pong_match(self.inner(), finalists, num_volleys, mut, ball);

  // agent 0 of group 0 reports the winner
  if(self.inner().index() == 0)
  {
    std::cout << std::endl << finalists[final_winner_idx] << " is the tournament champion!" << std::endl;
  }
}

This bit of code uses the same principles just demonstrated to execute and synchronize one last ping pong match.

Try the program yourself! Here's the output on my system:

$ g++ -std=c++11 -I. -pthread examples/ping_pong_tournament.cpp -o ping_pong_tournament
$ ./ping_pong_tournament 
Starting semifinal matches...
pong waiting for return
foo! ball is now 1
ping! ball is now 1
bar waiting for return
pong! ball is now 2
foo waiting for return
ping! ball is now 3
bar! ball is now 2
pong! ball is now 4
foo! ball is now 3
ping! ball is now 5
bar waiting for return
pong! ball is now 6
foo waiting for return
ping! ball is now 7
bar! ball is now 4
pong! ball is now 8
whiff... foo loses!
ping! ball is now 9
pong! ball is now 10
ping! ball is now 11
whiff... pong loses!

ping and bar starting the final match...
bar waiting for return
ping! ball is now 1
bar! ball is now 2
ping waiting for return
bar waiting for return
ping! ball is now 3
bar! ball is now 4
ping! ball is now 5
bar! ball is now 6
ping! ball is now 7
bar! ball is now 8
ping waiting for return
bar waiting for return
whiff... ping loses!
bar wins!

bar is the tournament champion!

You can see that the output from the semifinal matches are interleaved in the output. This shouldn't be surprising because both semifinal matches execute concurrently.

Why Hierarchy Matters

We've seen that organizing our computation into a hierarchical structure helps ensure that our code is correct by embedding scopes of sharing and synchronization into the C++ type system. Calling a function that demands a particular variety of execution agent requires that we present adequate identification. In Agency, there is no way to forge a counterfeit ID. The structure execution hierarchies impose on our code encourages correctness.

At the same time, execution hierarchies are important tools that help us achieve performance. Manufacturers tend to deliver increased hardware performance upon generational updates by building wider computational units and deeper memories. We observe these changes as increased core counts, wider vector widths, and NUMA memories. Each individual processor core may be grouped together with a number of SIMD lanes or on-chip memories. Hardware architectures are naturally hierarchical.

The performance of an application often depends critically on the existence a good mapping of the computation onto the target architecture. To enable this mapping, Agency allows the programmer to expose the hierarchical organization of the execution agents involved. Because hardware architectures have similar hierarchical organization, there's often a straightforward mapping between the two.

Matrix Transpose

To see this principle in action, let's look at an example of mapping a computation onto a GPU. NVIDIA GPUs are organized into a two-level hierarchy: a GPU is a collection of processing cores each containing a number of finer-grained processing units on which threads execute. For our application, we'd like to take a square, two-dimensional matrix and transpose it. One way to approach this problem is to break the matrix into smaller tiles and transpose each of them in parallel.

Here's how to do it with Agency:

agency::cuda::future<void> async_square_transpose(size_t matrix_dim,
                                                  const float* input_matrix,
                                                  float* transposed_matrix)
{
  using namespace agency;

  int tile_dim = 32;
  int num_rows_per_group = 8;

  size2 outer_shape{matrix_dim/tile_dim, matrix_dim/tile_dim};
  size2 inner_shape{tile_dim, num_rows_per_group};

  return cuda::bulk_async(grid(outer_shape, inner_shape),
    [=] __device__ (parallel_group_2d<cuda::concurrent_agent_2d>& self)
  {
    auto idx = tile_dim * self.outer().index() + self.inner().index();

    for(int j = 0; j < tile_dim; j += num_rows_per_group)
    {
      transposed_matrix[idx[0]*matrix_dim + (idx[1]+j)] =
        input_matrix[(idx[1]+j)*matrix_dim + idx[0]];
    }
  });
}

The async_square_transpose() function receives a square input_matrix and returns its transpose through the transpose_matrix parameter. The idea of the implementation is to break the input into square, 32 by 32 shaped tiles, and create a group of concurrent execution agents for each. Because our problem is naturally two-dimensional, we'll create execution agents that have two-dimensional indices.

We need to create one group of execution agents per tile, so the outer_shape of the agent hierarchy is a square matrix_dim/tile_dim on a side. Rather than create an inner execution agent per element of the tile, we'll create rectangular inner groups with dimensions {tile_dim, num_rows_per_group}, where num_rows_per_group is chosen arbitrarily to be 8. This means that each execution agent will perform several matrix element swaps instead of just 1.

To create our execution hierarchy, we use the grid() execution policy:

cuda::bulk_async(grid(outer_shape, inner_shape), lambda);

The expression

grid(outer_shape, inner_shape)

is shorthand for

cuda::par(outer_shape, cuda::con(inner_shape))

and creates a "grid" of CUDA threads. In this case, because both the outer_shape and inner_shape are two dimensional, the execution agents created by the use of grid() is as well.

The execution agent parameter is

parallel_group_2d<cuda::concurrent_agent_2d>& self

whose type distinguishes it from the one-dimensional agents we've seen in previous examples. It works just like what we've seen previously, except that its .index() and .group_shape() functions return two-dimensional values.

The body of the lambda is pretty simple. It's each agent's job to copy tile_dim/num_rows_per_group matrix elements to the output, while transposing the element's position along the way. To compute where to begin copying, we first compute the origin of the group's tile:

auto idx = tile_dim * self.outer().index() ...

and add an offset for the individual agent:

... + self.inner().index()

The result is the two-dimensional index idx.

Executors

TODO

  • demonstrate .on(exec) syntax
  • show how we can write a single kernel and move it around using .on(...)
  • show how to make a really simple user-defined executor based on OpenMP
⚠️ **GitHub.com Fallback** ⚠️