Quick Start Guide - agency-library/agency GitHub Wiki
Agency is a C++ template library for parallel programs based on a modular abstraction of execution. Agency allows you to build parallel algorithms using programming primitives that, while low-level, remain portable across different platforms.
This document describes how to develop parallel algorithms with Agency. The tutorial is intended to be accessible, even if you have limited C++ or parallel programming experience.
Agency is compatible with recent, C++11 or later compilers.
Let's compile a simple program with Agency to ensure that all prerequisites are satisfied. Save the following source code to a file named version.cpp
.
#include <agency/version.hpp>
#include <iostream>
int main()
{
int major = AGENCY_MAJOR_VERSION;
int minor = AGENCY_MINOR_VERSION;
std::cout << "Agency v" << major << "." << minor << std::endl;
return 0;
}
Now compile version.cpp
with g++
. If Agency is in the current directory, the following commands should work:
$ ls
agency version.cpp
$ g++ -std=c++11 -I. version.cpp -o version
$ ./version
Agency v0.0
If the top-level agency
directory is placed somewhere else, use the -I
option to tell g++
where to look. For example, if agency
was placed in ~/include
then the following command should be used.
$ g++ -std=c++11 -I ~/include version.cpp -o version
Agency may also be compiled with nvcc
to create programs which may execute on GPUs. In this case, a .cu
file extension should be used:
$ nvcc -std=c++11 -I. version.cu -o version
Or we can also tell nvcc
to treat the .cpp
file as a CUDA language source file:
$ nvcc -std=c++11 -I. -x cu version.cpp -o version
Other C++11 compilers, such as clang
and icc
, are also known to work with Agency.
When programming algorithms with Agency, you create groups of execution agents and have them execute tasks. In Agency, an execution agent is a stand-in for some underlying physical thing -- maybe a thread, a SIMD lane, or a process -- which actually executes our task of interest. Programming algorithms in terms of execution agents instead of something really specific -- like threads -- makes our programs portable across different kinds of hardware platforms.
In Agency, a task is simply any function whose first parameter is some type of execution agent:
void hello(agency::sequenced_agent& self)
{
std::cout << "Hello, world from agent " << self.index() << std::endl;
}
The type of execution agent this hello()
task requires is a sequenced_agent
. This type of agent executes our task in sequence, that is, in order with the other sequenced_agent
s in its group. The reason we create agents in groups is so that we can coordinate their efforts. Imposing the structure of a group onto our collections of agents makes it easier for us to reason about their behavior, and also makes it easier for Agency to map the agents onto the underlying system.
Now that we know how to define a task, how do we actually create execution agents to execute it? That's the job of functions like bulk_invoke
. The following line of code calls bulk_invoke()
to create a group of ten sequenced_agent
s which invoke the hello()
function in bulk:
agency::bulk_invoke(agency::seq(10), hello);
Try it yourself! Compile and execute the hello_world.cpp
example program found in Agency's examples
directory:
$ g++ -std=c++11 -I. examples/hello_world.cpp -o hello_world
$ ./hello_world
Hello, world from agent 0
Hello, world from agent 1
Hello, world from agent 2
Hello, world from agent 3
Hello, world from agent 4
Hello, world from agent 5
Hello, world from agent 6
Hello, world from agent 7
Hello, world from agent 8
Hello, world from agent 9
You can see that the hello()
task was invoked ten times -- once for each sequenced_agent
in the group we created with bulk_invoke()
.
If you've programmed with C++ recently, you know that lambda functions are a nice way to define little functions like hello()
inline, at their point of use, without having to stick them in some faraway place in the code. Lambdas are nice when the function you want to call will only be called once -- just like in our hello world program. Instead of defining a named function hello()
, we could have used a lambda with bulk_invoke()
directly:
agency::bulk_invoke(agency::seq(10), [](agency::sequenced_agent& self)
{
std::cout << "Hello world from agent " << self.index() << std::endl;
});
You can read more about C++ lambda functions on Wikipedia.
Of course, Agency is fully interoperable with lambdas -- here's a complete Agency hello world program which uses a lambda function to define our task instead of a named function like hello()
:
#include <agency/execution_policy.hpp>
#include <iostream>
int main()
{
// create 10 sequenced_agents to greet us in bulk
agency::bulk_invoke(agency::seq(10), [](agency::sequenced_agent& self)
{
std::cout << "Hello, world from agent " << self.index() << std::endl;
});
return 0;
}
For the rest of the guide, unless there's a reason not to, we'll stick to using lambdas.
You may have noticed that in our Hello World program we passed bulk_invoke()
two parameters:
agency::bulk_invoke(agency::seq(10), task);
The second parameter was the lambda function defining the task we wanted to execute, but what does seq(10)
mean? You may have guessed from the 10
that this thing has something to do with the number of execution agents bulk_invoke()
will create.
In fact, seq
is a type of execution policy describing the group of execution agents to create, and their relationship to one another within the group. Execution policies describe the size of the group of agents, as well as their execution ordering semantics. In this case, the seq
execution policy tells bulk_invoke()
to create a group of sequenced_agent
s that will execute in sequence, or one after the other. However, there are other possibilities.
Sometimes we don't care so much about the order in which agents execute. For example, suppose we want to write a really simple program that just fills an array with some number. All we care about is that the array gets filled with the correct value -- the order in which array elements are accessed just doesn't matter. We could start setting elements from the beginning and walk through the array to the end, or we could start from the end and walk backwards, or we could go in a totally random order -- or even no order at all. At the end of the day, the result will be the same.
An easy way to use Agency to do this would be to create a whole bunch of execution agents -- one agent for each element in the array. The task for each agent simply would be to store the number to the corresponding array element:
// create an array
std::vector<int> array(100);
// create a group of parallel agents -- one agent per array element
agency::bulk_invoke(agency::par(array.size()), [&array](agency::parallel_agent& self)
{
// to find the corresponding array element, use self.index()
array[self.index()] = 42;
});
This program uses the par
execution policy to create a group of parallel agents to fill the array. In Agency, parallel is a weaker form of execution than sequenced. Basically, par
weakens the ordering guarantees that we require of our agents and allows Agency to schedule execution agents onto threads. This means that if Agency happens to schedule execution agents on different threads, they will execute in an unordered fashion. Since the relaxed ordering guarantees of parallelism allows us to use more than one thread in our program, there's a chance that the program will execute faster than it would otherwise -- which is the whole point of introducing parallelism in the first place!
When writing Agency programs, it's useful to distinguish types of execution agents from one another based on their abilities and what assumptions they make regarding execution. You might have noticed that our latest program uses a different type of execution agent than we used in our Hello World programs -- parallel_agent
instead of sequenced_agent
. When we use an execution policy with bulk_invoke()
, we need to be sure that the type of execution agent in our lambda matches. This is to make sure that our lambda doesn't violate any guarantees made by the execution policy. In other words, when we ask Agency for parallel execution agents, our lambda promises not to try to do anything a parallel_agent
can't do. We'll see when that matters in the next section.
Since Agency is allowed to use threads when it sees par
, we need to build our program in the correct manner. When compiling with g++
, this means we should link against the pthread
library:
$ g++ -std=c++11 -I. examples/fill.cpp -pthread -o fill
$ ./fill
OK
Now that we know how to program very simple parallel algorithms with Agency, let's use it to implement a computation. SAXPY, a simple vector arithmetic kernel, is one that many folks interested in high performance computing should be familiar with. Let's use bulk_invoke()
to implement it:
void saxpy(size_t n, float a, const float* x, const float* y, float* z)
{
using namespace agency;
bulk_invoke(par(n), [=](parallel_agent &self)
{
int i = self.index();
z[i] = a * x[i] + y[i];
});
}
Our SAXPY program works a lot like our previous fill example. For each element in the output array, we create a parallel_agent
whose only task is to execute a single sum.
So how fast is this program? Let's measure its achieved memory bandwidth by calling saxpy()
several times in a loop and dividing the size of the input and output by the mean time it takes to execute:
std::vector<float> x(n, 1), y(n, 2), z(n);
float a = 13.;
size_t num_trials = 20;
auto start = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < num_trials; ++i)
{
saxpy(n, a, x.data(), y.data(), z.data());
}
std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start;
double seconds = elapsed.count() / num_trials;
double gigabytes = double(3 * n * sizeof(float)) / (1 << 30);
double bandwidth = gigabytes / seconds;
On my system, saxpy
's bandwidth is around 12 GB/s:
$ g++ -std=c++11 -I. -O3 -pthread examples/saxpy.cpp -o saxpy
$ ./saxpy
Measuring performance...
SAXPY Bandwidth: 11.8806 GB/s
Can we make it go any faster? One of the benefits of Agency is that it allows us to take our parallel program and port it easily to other platforms. My system happens to have an NVIDIA Tesla GPU inside, which should have quite a bit more memory bandwidth available to it. Let's target it with our saxpy
program by making a few changes.
The most important modification to make is to change the execution policy the saxpy()
function uses when calling bulk_invoke()
:
void saxpy(size_t n, float a, const float* x, const float* y, float* z)
{
using namespace agency;
bulk_invoke(cuda::par(n), [=] __device__ (parallel_agent &self)
{
int i = self.index();
z[i] = a * x[i] + y[i];
});
}
Instead of simply par
, we've used cuda::par
to create our execution agents. cuda::par
is an execution policy just like regular par
, except it gives Agency special license to use CUDA threads that execute on the GPU. The other minor change is to annotate our lambda function with the __device__
keyword. This tells the NVIDIA compiler to compile the lambda in a special way to allow it to execute on a GPU device.
The other change to our program is to move where the data is stored. NVIDIA GPUs can't access normal system memory without special programmer intervention. To keep things simple, we should put our vectors in the GPU's memory. In our new program, we've replaced our use of vector
with Thrust's device_vector
, but you could use any allocation scheme you like.
Let's compile our new program with the NVIDIA compiler and see how it runs:
$ nvcc -std=c++11 -expt-extended-lambda -O3 -I. examples/saxpy.cu -o gpu_saxpy
$ ./gpu_saxpy
Measuring performance...
SAXPY Bandwidth: 129.432 GB/s
It's around ten times faster! Of course, the comparison isn't completely fair -- we had to move the data to the GPU, which would have non-negligible cost in a real world application. And anyway, who cares how fast SAXPY can go? That's not really the point. What you should takeaway from this exercise is that Agency can give you the flexibility of moving execution around in a modular way when there is some benefit in doing so.
Part of the power of structuring our execution agents into groups is that it allows us to coordinate their efforts. Coordinating execution agents often requires communication -- that is, the ability to send and receive messages between agents within the same group. Communication requires that agents execute at the same time. That is, we require a guarantee of concurrency.
This is not what we are guaranteed by seq
nor par
. Sequenced agents are guaranteed not to be concurrent, because they execute their tasks in order. Likewise, parallel agents may or may not execute concurrently. For example, a parallel execution agent A may happen to be scheduled on thread A and another parallel execution agent B may happen to be scheduled on thread B. By pure luck, the two parallel agents may happen to be scheduled concurrently, but we can't rely on it.
Writing correct programs that rely on the concurrent execution of agents requires a strong guarantee of concurrency. This is what the con
execution policy guarantees. Let's use con
to create some concurrent_agent
s to play a game of ping pong:
int ball = 0;
std::string names[2] = {"ping", "pong"};
std::mutex mut;
// create two concurrent agents
agency::bulk_invoke(agency::con(2), [&](agency::concurrent_agent& self)
{
auto name = names[self.index()];
// play for 20 volleys
for(int next_state = self.index(); next_state < 20; next_state += 2)
{
// wait for the next volley
while(ball != next_state)
{
mut.lock();
std::cout << name << " waiting for return" << std::endl;
mut.unlock();
std::this_thread::sleep_for(std::chrono::seconds(1));
}
// return the ball
mut.lock();
ball += 1;
std::cout << name << "! ball is now " << ball << std::endl;
mut.unlock();
std::this_thread::sleep_for(std::chrono::seconds(1));
}
});
The two concurrent_agent
s, which we've named ping
and pong
, communicate through the ball
variable to let the other know when it is their turn to update its value. After an agent updates ball
, they wait in a while
loop for the next state. Because the agents share access to the ball
variable, they use another mutex
variable to get exclusive access. Otherwise, ping
could update ball
while pong
is trying to look at its value; or vice versa.
The exact output of the program is non-deterministic and depends on how the underlying platform decides to schedule the activity of the two concurrent_agent
s. Sometimes one agent may need to wait a little while for the other to return the ball. However, because we've used the con
execution policy to create these agents, we're guaranteed that they will execute concurrently. Try it out yourself to watch the game unfold.
Here's how one game played out on my system:
$ g++ -std=c++11 -I. -pthread examples/concurrent_ping_pong.cpp -o concurrent_ping_pong
$ ./concurrent_ping_pong
pong waiting for return
ping! ball is now 1
pong! ball is now 2
ping! ball is now 3
pong! ball is now 4
ping waiting for return
pong waiting for return
ping! ball is now 5
pong! ball is now 6
ping! ball is now 7
pong! ball is now 8
ping! ball is now 9
pong! ball is now 10
ping! ball is now 11
pong! ball is now 12
ping! ball is now 13
pong! ball is now 14
ping waiting for return
pong waiting for return
ping! ball is now 15
pong! ball is now 16
ping waiting for return
ping! ball is now 17
pong waiting for return
pong! ball is now 18
ping waiting for return
ping! ball is now 19
pong waiting for return
pong! ball is now 20
So what would have happened if we tried to play this game with either sequenced or parallel agents? In the case of sequenced_agent
s, we can be sure that the game would have never gotten started. The first sequenced_agent
, ping
, would have waited forever for pong
's first return. Since the agents must execute in sequence, only one is allowed on the court at a time.
For parallel_agent
s, it's hard to say how the game would play out without trying it since Agency has a lot of leeway to schedule parallelism. Here's what happened on my system:
$ g++ -std=c++11 -I. -pthread parallel_ping_pong.cpp -o parallel_ping_pong
$ ./parallel_ping_pong
ping! ball is now 1
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
ping waiting for return
...
Apparently, both ping
and pong
happened to be scheduled in sequence. pong
never made it onto the court. When programming algorithms like our ping pong game where one or more agents needs to be able to block the forward progress of another, it's critical that we create concurrency to be sure our program has a change of executing correctly.
For Future Reference: The execution policies we've covered so far illustrate a few of the basic different ways in which agents may execute. In addition to these simple policies, Agency provides policies which can create agents which execute on a GPU, agents which execute on SIMD lanes, and even tools for creating your own policies. We'll revisit these topics later in this guide.
Up to this point in the guide, we've focused on using bulk_invoke()
to create a group of agents to invoke a function en masse. So far, our usage of bulk_invoke()
works a lot like a for
loop -- when we call bulk_invoke()
or begin executing the iterations of a for
loop, we have to wait until everything finishes before we can proceed. This is because both of these constructs are synchronous -- the caller of bulk_invoke()
synchronizes with the completion of the execution agents created inside.
But sometimes we don't want to wait. For example, maybe we have a few different tasks we want to start working on. If these tasks don't depend on each other, there's really no reason to wait for one to finish before starting the next. In this case, we desire asynchrony, which gives us the flexibility to wait on the completion of a task at our leisure. In other words, asynchrony allows us to decouple execution agent creation from synchronization.
In Agency, we use the bulk_async()
function to create a group of execution agents to invoke a function in bulk, asynchronously. This program creates two Hello World tasks asynchronously:
using namespace agency;
std::cout << "Starting two tasks asynchronously..." << std::endl;
std::mutex mut;
// asynchronously create 5 agents to greet us in bulk
auto f1 = bulk_async(par(5), [&](parallel_agent& self)
{
mut.lock();
std::cout << "Hello, world from agent " << self.index() << " in task 1" << std::endl;
mut.unlock();
});
// asynchronously create 5 agents to greet us in bulk
auto f2 = bulk_async(par(5), [&](parallel_agent& self)
{
mut.lock();
std::cout << "Hello, world from agent " << self.index() << " in task 2" << std::endl;
mut.unlock();
});
std::cout << "Sleeping before waiting on the tasks..." << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(1));
std::cout << "Woke up, waiting for the tasks to complete..." << std::endl;
// wait for tasks 1 & 2 to complete before continuing
f1.wait();
f2.wait();
std::cout << "OK" << std::endl;
bulk_async()
is just like bulk_invoke()
except that it is asynchronous. While the agents execute, the caller of bulk_async()
is free to do other things, for example, make more calls to bulk_async()
. It's up to the caller to decide when to synchronize through the future object bulk_async()
returns as a result. In our program above, these are the calls to f1.wait()
and f2.wait()
. The future's .wait()
function causes the caller to wait until the execution agents associated with this future are finished executing before proceeding.
Because we see par
used in the program above, we know that the execution agents created by the calls to bulk_async()
may execute in parallel within the two groups. But what about between the two groups? What is the relationship between the two groups? When we use bulk_async()
, we're signalling that a task is immediately ready to execute and has no dependency on anything else. For our example, this means that the two groups of execution agents created by bulk_async()
may execute in parallel since they don't depend on each other. On my system, when I ran this program, everything happened in sequence:
$ g++ -std=c++11 -I. -pthread examples/hello_async.cpp -o hello_async
$ ./hello_async
Starting two tasks asynchronously...
Sleeping before waiting on the tasks...
Woke up, waiting for the tasks to complete...
Hello, world from agent 0 in task 1
Hello, world from agent 1 in task 1
Hello, world from agent 2 in task 1
Hello, world from agent 3 in task 1
Hello, world from agent 4 in task 1
Hello, world from agent 0 in task 2
Hello, world from agent 1 in task 2
Hello, world from agent 2 in task 2
Hello, world from agent 3 in task 2
Hello, world from agent 4 in task 2
OK
As you can see, the two groups of agents didn't even begin executing until the calls to .wait()
-- and that's OK. As long as the agents are finished executing after calling .wait()
, that's all that matters.
For Future Reference: Later in this guide, we'll see how to get fine-grained control over the ordering relationships between different groups of execution agents by creating execution agent hierarchies.
So far, all the examples of tasks we've shown in the guide take a single argument: a reference to some type of execution agent. In general, the functions invoked by bulk_invoke()
and bulk_async()
may take any number of arguments. For instance, we could rewrite our saxpy()
function to pass our arrays into the lambda as parameters rather than capture everything by value:
void verbose_saxpy(size_t n, float a, const float* x, const float* y, float* z)
{
using namespace agency;
bulk_invoke(par(n),
[](parallel_agent& self,
size_t n, float a, const float* x, const float* y, float* z)
{
int i = self.index();
z[i] = a * x[i] + y[i];
},
n, a, x, y, z);
}
In this verbose_saxpy()
function, instead of capturing the parameters of the SAXPY operation in the lambda, we've explicitly passed them through bulk_invoke()
as additional parameters given to the lambda. This is a lot more verbose than simply capturing them, but the effect is exactly the same -- the parameters get replicated across the execution agents. Each agent receives its own private copy of all the parameters. Since it's so easy to just capture and copy variables with a lambda, there's usually no benefit to this kind of parameter passing.
Sometimes we don't want each agent to have its own private copy of something. Instead, we'd like all agents to share ownership of a single object. Shared parameters also pop up when a group of agents needs to communicate among themselves. Since every agent sharing a parameter can see and modify the same object, a shared parameter is a medium for communication. Because every agent is potentially able to observe and mutate shared parameters at the same time, we usually need to take special care to synchronize their access to it to ensure the program is free of data races.
Let's see how to use a shared vector
of integers as a scratchpad while computing the sum of its elements:
int result = 0;
bulk_invoke(con(data.size()),
[&](concurrent_agent& self, std::vector<int>& scratch)
{
auto i = self.index();
auto n = scratch.size();
while(n > 1)
{
if(i < n/2)
{
scratch[i] += scratch[n - i - 1];
}
// wait for every agent in the group to reach this point
self.wait();
// cut the number of active agents in half
n -= n/2;
}
// the first agent stores the result
if(i == 0)
{
result = scratch[0];
}
},
share<0>(data));
This program computes a reduction over the elements in the input data
using a shared parameter named scratch
. While the reduction executes, scratch
acts as storage for partial sums. To create the scratch
parameter, we pass a third parameter to the call to bulk_invoke()
using the share()
function. We tell bulk_invoke()
to share a copy of the input data
at the 0
th level of the execution hierarchy (more on hierarchies later). Before the concurrent_agent
s begin executing, bulk_invoke()
will create a copy of data
and pass it to the lambda through the parameter named scratch
.
The code implementing the reduction is pretty simple. The idea is that the agents proceed in phases while they sum up the array. During each phase, it's each active agent's task to compute a single partial sum at location i
. To compute the sum, the agent "leap frogs" over the other active agents to find the array element which will be the right hand side of the sum:
scratch[i] += scratch[n - i - 1];
After storing the new partial sum, each agent has to wait for the entire phase to finish before the next phase can proceed:
self.wait();
This kind of waiting operation is called a barrier. Its purpose in our program is to ensure that there are no data races while when reading and writing the scratch
array. Without it, an agent could race through all its phases, computing sums before the operands were ready. Using a barrier ensures that no agent gets ahead of itself. concurrent_agent
is special in this way -- no other type of execution agent has a barrier because concurrency is required.
In each phase, half of the agents of the previous phase drop out, until there is only a single agent left which stores the total sum to scratch[0]
. After the loop, it's also his job to copy this total to the result
variable.
You may be wondering why we created a special shared parameter for scratch
when we could have made the lambda capture an equivalent variable by reference. After all, this program captures the result
parameter by reference, couldn't we have done the same for the scratch
variable?
When we use a shared parameter, we're giving Agency important information about locality which allows bulk_invoke()
to intelligently locate data used during the computation. When we reference capture some variable bar
with a lambda, what happens under the covers is that a pointer to bar
gets stored inside the lambda object. When agents use bar
during the computation, they dereference this pointer to find it. Depending on what kind of agents we're dealing with, and where they are executing, this deference operation could be expensive. This is especially true if we require repeated access to the variable, as we do in our sum example. Using a shared parameter allows Agency to place the parameter nearby the agents to ensure access and sharing is efficient as possible.
TODO: redo this section when bulk_invoke()
& bulk_async()
are able to return single results. Doing the reduction example with bulk_async()
will make it a lot more clear why shared parameters are necessary.
TODO: describe bulk_then()
So far, all of the computational kernels we've written have looked a lot like dressed-up for
loops. We define a loop body using a lambda, and then request some sort of ordering guarantee on the invocations of the loop body. Maybe we even execute the loop asynchronously if we're feeling adventurous. Lots of programming systems work this way.
What makes Agency unique is the ability to create execution hierarchies -- that is, multiple groups of execution agents with structured, hierarchical relationships. We've seen how we can reason about the internal relationships describing how execution agents execute within their own group. We can go a step further and reason about the external relationships between agent groups in a hierarchical fashion.
In order to understand how to structure the execution of our code hierarchically, let's revisit our concurrent ping pong match program. Before, we had a pair of concurrent execution agents simulate a single game of ping pong. This time, we'll simulate an entire ping pong tournament. To keep things simple, let's have our execution agents play two semifinal matches followed by a final match. In total, four players will compete in the tournament.
The basic idea of the program will be to create two separate pairs (groups) of concurrent execution agents. Not only will each pair of agents execute concurrently within their group, but their group as a whole will execute concurrently with respect to the other group. In other words, there will be two levels of concurrency -- inner and outer. Each of the two inner groups will begin by playing two separate ping pong matches. After playing the semifinals matches, the two inner groups will synchronize to communicate who won. Then, a single group of execution agents will play the final match.
To allow us to focus our attention on the structure of the execution hierarchy, the code which implements the ping pong match is encapsulated within the ping_pong_match()
function. ping_pong_tournament()
is the function coordinating the high-level tasks of the tournament. Let's step through each part:
void ping_pong_tournament(concurrent_group<concurrent_agent>& self,
const vector<vector<string>>& semifinalists,
int num_volleys,
movable_mutex& mut,
vector<string>& final_contestants,
int& ball)
First, let's decode the function's parameters. Like all Agency tasks, the first parameter is a handle to the execution agent which executes the task. Unlike our previous examples, because we've arranged our agents into a hierarchy, the agent handle is itself hierarchical. The outer concurrent_group<>
template is a handle to the outermost level of the execution hierarchy, which is a concurrent group of a group of agents. The inner concurrent_agent
type is what we've seen before -- this is a handle to the innermost level of the execution hierarchy.
Next, semifinalists
is a vector
of vector
s of string
s naming the four contestants in the semifinals matches. This data structure is hierarchical just like the agent handle -- the inner string
s are the names we've given to the four execution agents.
The num_volleys
variable is just like what we saw in the previous example -- it's simply a limit on the number of ping pong iterations to execute before finishing the match.
mut
is our first shared variable. We use it during the tournament to guard access to std::cout
and ensure that when we stream output from the concurrent matches there is no data race. Otherwise, the printed output would be a jumbled match. There's a single mutex
for the entire tournament guarding std::cout
.
finalists
is a shared container which collects the names of the two players who will compete during the final match. Since there is only a single final match, there is only a single instance of this vector
which holds two string
s.
ball
is our last shared variable which counts the number of volleys during each match. Unlike mut
, there will actually be two instances of the ball
variable during the tournament -- one per group of execution agents. Since there will be two semifinals matches playing concurrently, we need two separate ball
variables to track their status. For the final match, we'll only use one of these variables.
Before digging into the implementation of ping_pong_tournament()
, let's look at how we bulk_invoke()
it so we can see the details of how to create a hierarchy along with shared parameters:
bulk_invoke(con(2, con(2)),
ping_pong_tournament,
semifinalists,
num_volleys,
share<0,movable_mutex>(),
share<0,vector<string>>(2),
share<1,int>());
The first thing to notice is that the execution policy passed to bulk_invoke()
is hierarchical -- we've wrapped one policy inside another. The outer con(2, ...)
policy tells bulk_invoke()
to create the outermost concurrent_group<...>
with two groups nested inside. The inner con(2)
policy gives the parameters of the inner group -- two concurrent_agent
s. The net result is a two-level hierarchy of two inner groups of concurrent agents executing concurrently at the outer level.
The next three parameters are given just like the previous examples we've seen. ping_pong_tournament
is the name of the function to invoke, and semifinalists
and num_volleys
are parameters that become replicated as per-agent private variables within the task.
The last three parameters are shared. The scope of sharing is given by the first template parameter to the share()
function. For example, the first two calls to share()
instruct bulk_invoke()
to create a movable_mutex
and vector<string>
to share across the outermost group of the execution hierarchy. Since calls create the parameters mut
and final_contestants
which all execution agents in the task may access, because they create single instances of these parameters shared at level 0.
The parameters to share<i,T>()
are interpreted as constructor parameters. They are simply forwarded along to T
's constructor. For example, when we call share<0,vector<string>>(2)
, this constructs a vector
containing two empty string
s. If we don't pass any parameters to share<i,T>()
, bulk_invoke()
just uses the default constructor.
The last call to share<1,int>()
instructs bulk_invoke()
to create an int
for each group of agents at level 1 -- the inner level of the hierarchy. There are two of these, so bulk_invoke()
creates two instances of the ball
parameter.
Now let's step into the implementation of ping_pong_tournament()
. The first thing we do is announce the beginning of the semifinals matches. We only want one announcement, so we need to elect a single agent to act as the announcer. We'll choose agent 0 of group 0 arbitrarily. To ensure that the announcement happens before semifinals begin, we need to make everyone wait. Here's how to do it:
if(self.inner().index() == 0)
{
if(self.outer().index() == 0)
{
std::cout << "Starting semifinal matches..." << std::endl;
}
self.outer().wait();
}
self.inner().wait();
We use .inner().index()
and .outer().index()
to access the agent's inner and outer indices, respectively and compare them to 0
to find the zeroth agent of the zeroth group. After announcing semifinals, we need to synchronize all the agents before the matches begin. To synchronize globally across all execution agents, we perform a hierarchical barrier by first having agent 0 of each group wait on the outer group with .outer().wait()
. Then, we have every agent wait on the agents within their own inner group with .inner().wait()
. The effect is that all agents globally synchronize.
After the announcement, we invoke the ping_pong_match()
function to execute the semifinals matches:
auto semifinal_winner_idx =
ping_pong_match(self.inner(),
semifinalists[self.outer().index()],
num_volleys,
mut,
ball);
The first parameter of the ping_pong_match()
function is a handle to a concurrent_agent
participating in the match. Because the two matches are independent of one another, they don't require any information about the organization of the execution hierarchy above the innermost groups. When we call self.inner()
, we temporarily discard this extra information and pass a reference to the individual inner concurrent_agents
. Structuring our code this way means we can build modular, general purpose functions like ping_pong_match()
that don't require global visibility of the entire execution hierarchy. ping_pong_match()
's special first parameter ensures that we can't mistakenly call it without presenting proof that right type of execution agents is currently executing.
After executing the semifinals matches, we need to determine the names of the winners and record them in the finalists
array. There is one winner per match, so we have both agent 0s of the inner groups record the names of the finalists. At the same time, we perform another global barrier before beginning the final match:
if(self.inner().index() == 0)
{
// the first agent of each group reports who won the semifinal match
auto semifinal_winner =
semifinalists[self.outer().index()][semifinal_winner_idx];
finalists[self.outer().index()] = semifinal_winner;
// have the first player of each group wait for the other group
self.outer().wait();
}
// have each inner group wait for each other
self.inner().wait();
Finally, we'll execute the final match. Because there is only one final match, only one group of execution agents needs to participate. We'll nominate group 0 for this job -- the other group can exit the task. We'll announce the final match just like before and ensure that the announcement happens before the match begins:
// group 0 plays the final match while group 1 sits it out
if(self.outer().index() == 0)
{
// agent 0 initializes the contestant names for the final match
if(self.inner().index() == 0)
{
std::cout << std::endl << finalists[0] << " and " << finalists[1] << " starting the final match..." << std::endl;
}
// wait until agent 0 initializes the contestant names before starting the final match
self.inner().wait();
// play the final match
auto final_winner_idx =
ping_pong_match(self.inner(), finalists, num_volleys, mut, ball);
// agent 0 of group 0 reports the winner
if(self.inner().index() == 0)
{
std::cout << std::endl << finalists[final_winner_idx] << " is the tournament champion!" << std::endl;
}
}
This bit of code uses the same principles just demonstrated to execute and synchronize one last ping pong match.
Try the program yourself! Here's the output on my system:
$ g++ -std=c++11 -I. -pthread examples/ping_pong_tournament.cpp -o ping_pong_tournament
$ ./ping_pong_tournament
Starting semifinal matches...
pong waiting for return
foo! ball is now 1
ping! ball is now 1
bar waiting for return
pong! ball is now 2
foo waiting for return
ping! ball is now 3
bar! ball is now 2
pong! ball is now 4
foo! ball is now 3
ping! ball is now 5
bar waiting for return
pong! ball is now 6
foo waiting for return
ping! ball is now 7
bar! ball is now 4
pong! ball is now 8
whiff... foo loses!
ping! ball is now 9
pong! ball is now 10
ping! ball is now 11
whiff... pong loses!
ping and bar starting the final match...
bar waiting for return
ping! ball is now 1
bar! ball is now 2
ping waiting for return
bar waiting for return
ping! ball is now 3
bar! ball is now 4
ping! ball is now 5
bar! ball is now 6
ping! ball is now 7
bar! ball is now 8
ping waiting for return
bar waiting for return
whiff... ping loses!
bar wins!
bar is the tournament champion!
You can see that the output from the semifinal matches are interleaved in the output. This shouldn't be surprising because both semifinal matches execute concurrently.
We've seen that organizing our computation into a hierarchical structure helps ensure that our code is correct by embedding scopes of sharing and synchronization into the C++ type system. Calling a function that demands a particular variety of execution agent requires that we present adequate identification. In Agency, there is no way to forge a counterfeit ID. The structure execution hierarchies impose on our code encourages correctness.
At the same time, execution hierarchies are important tools that help us achieve performance. Manufacturers tend to deliver increased hardware performance upon generational updates by building wider computational units and deeper memories. We observe these changes as increased core counts, wider vector widths, and NUMA memories. Each individual processor core may be grouped together with a number of SIMD lanes or on-chip memories. Hardware architectures are naturally hierarchical.
The performance of an application often depends critically on the existence a good mapping of the computation onto the target architecture. To enable this mapping, Agency allows the programmer to expose the hierarchical organization of the execution agents involved. Because hardware architectures have similar hierarchical organization, there's often a straightforward mapping between the two.
To see this principle in action, let's look at an example of mapping a computation onto a GPU. NVIDIA GPUs are organized into a two-level hierarchy: a GPU is a collection of processing cores each containing a number of finer-grained processing units on which threads execute. For our application, we'd like to take a square, two-dimensional matrix and transpose it. One way to approach this problem is to break the matrix into smaller tiles and transpose each of them in parallel.
Here's how to do it with Agency:
agency::cuda::future<void> async_square_transpose(size_t matrix_dim,
const float* input_matrix,
float* transposed_matrix)
{
using namespace agency;
int tile_dim = 32;
int num_rows_per_group = 8;
size2 outer_shape{matrix_dim/tile_dim, matrix_dim/tile_dim};
size2 inner_shape{tile_dim, num_rows_per_group};
return cuda::bulk_async(grid(outer_shape, inner_shape),
[=] __device__ (parallel_group_2d<cuda::concurrent_agent_2d>& self)
{
auto idx = tile_dim * self.outer().index() + self.inner().index();
for(int j = 0; j < tile_dim; j += num_rows_per_group)
{
transposed_matrix[idx[0]*matrix_dim + (idx[1]+j)] =
input_matrix[(idx[1]+j)*matrix_dim + idx[0]];
}
});
}
The async_square_transpose()
function receives a square input_matrix
and returns its transpose through the transpose_matrix
parameter. The idea of the implementation is to break the input into square, 32 by 32 shaped tiles, and create a group of concurrent execution agents for each. Because our problem is naturally two-dimensional, we'll create execution agents that have two-dimensional indices.
We need to create one group of execution agents per tile, so the outer_shape
of the agent hierarchy is a square matrix_dim/tile_dim
on a side. Rather than create an inner execution agent per element of the tile, we'll create rectangular inner groups with dimensions {tile_dim, num_rows_per_group}
, where num_rows_per_group
is chosen arbitrarily to be 8
. This means that each execution agent will perform several matrix element swaps instead of just 1.
To create our execution hierarchy, we use the grid()
execution policy:
cuda::bulk_async(grid(outer_shape, inner_shape), lambda);
The expression
grid(outer_shape, inner_shape)
is shorthand for
cuda::par(outer_shape, cuda::con(inner_shape))
and creates a "grid" of CUDA threads. In this case, because both the outer_shape
and inner_shape
are two dimensional, the execution agents created by the use of grid()
is as well.
The execution agent parameter is
parallel_group_2d<cuda::concurrent_agent_2d>& self
whose type distinguishes it from the one-dimensional agents we've seen in previous examples. It works just like what we've seen previously, except that its .index()
and .group_shape()
functions return two-dimensional values.
The body of the lambda is pretty simple. It's each agent's job to copy tile_dim/num_rows_per_group
matrix elements to the output, while transposing the element's position along the way. To compute where to begin copying, we first compute the origin of the group's tile:
auto idx = tile_dim * self.outer().index() ...
and add an offset for the individual agent:
... + self.inner().index()
The result is the two-dimensional index idx
.
TODO
- demonstrate
.on(exec)
syntax - show how we can write a single kernel and move it around using
.on(...)
- show how to make a really simple user-defined executor based on OpenMP