How To Parallelize Code! - abrantsma/Pneumo GitHub Wiki

Using octave -forge parallel package:

Setup:

Yeah, I'll get to this at some point.

How to parallelize a for-loop-ed function:

So we started with this:

tic();
for(i = 1:length(marbleCoord))
    targets{i} = mk_c2f_circ_mapping(img.fwd_model, transpose(marbleCoord(i,:)) );
    img.elem_data = img.elem_data + DelC1*targets{i}(:,1);
end
toc()

This takes like 800 seconds. That's too long. But we can make it shorter!

Let's do a quick test of parallelizing a single instance of the for loop into one job for one core:

one_transposed_cell_of_marbleCoord = {transpose(marbleCoord(1,:))};
tic();
targets{1} = parcellfun(1, @(xyzr)mk_c2f_circ_mapping(img.fwd_model, xyzr), one_transposed_cell_of_marbleCoord);
toc()

This works! Although it takes 12-14 seconds...

But since we have four cores and want to do it for every element of marbleCoord, we do:

targets = cell(1, length(marbleCoord));
transposed_marbleCoord = transpose(marbleCoord);
transposed_cells_of_marbleCoord = num2cell(transposed_marbleCoord, 1);
tic();
targets = parcellfun(4, @(xyzr)mk_c2f_circ_mapping(img.fwd_model, xyzr), transposed_cells_of_marbleCoord, "UniformOutput", false);
for(i = 1:length(marbleCoord))
    img.elem_data = img.elem_data + DelC1*targets{i}(:,1);
end
toc()

This should only take around 200 seconds! (But wound up taking around 240-250 for various reasons like cores running other background processes.)

Now to get it to work in a cluster! First, start the octave parallel servers. On each machine's command line (even the client machine if you want it to use it's cores in processing), type octave --eval "pserver(struct('use_tls',false))".

To connect the client machine, we add these lines at the top of the code:

connections = pconnect({'master', 'slave'}, struct('use_tls', false)); % Connect to all machines.
sleep(3); % to be safe.
reval("run /home/rpi00/eidors/eidors-v3.8/eidors/startup.m", connections) % Need to set up the eidors environment on all machines since we are using eidors functions.
sleep(5); % again, to be safe.

And then change the parcellfun line to:

targets = netcellfun(connections, @(xyzr)mk_c2f_circ_mapping(img.fwd_model, xyzr), transposed_cells_of_marbleCoord, "UniformOutput", false);

And there you have it!! With our two RPis (one overclocked to 1GHz, the other at stock speeds) we got the loop runtime down to 144 seconds. Looking forward to trying it soon with more RPis!

Huge, huge, huge props to Alberto Andreotti's blog post! No idea how I would have figured this out without it.