VEGAS_GPU_Code_Documentation - david-macmahon/wiki_convert_test GitHub Wiki
The VEGAS GPU spectrometer is a Polyphase Filter Bank (PFB) spectrometer. In a nutshell, what a PFB spectrometer does is this: Instead of taking an N-point FFT directly, a block of data of size N x P = M is read, and multiplied point-by-point with a window function (in other words, the data is 'weighted'). The shape of the window function determines the shape of the single-bin frequency response. Since we wish the single-bin frequency response to resemble a rectangular function as much as possible, we choose its Fourier Transform pair, the sinc function, as our window function. Once the multiplication is done, the block of data is split into P subsets of length N each, and added point-by-point. This array is then passed to a regular DFT routine to get an N-point transform that exhibits less leakage. This is graphically depicted in the figure below. For a more detailed background on the PFB technique, see the PFB memo.
The basic code flow is depicted below:
Initialisation
+------------------+ +--------------------------------------------+
| | |1. Get device properties |
| Initialisation | |2. Calculate kernel launch parameters |
| | |3. Load filter coefficients to device memory|
+---------+--------+ |4. Create FFT plan |
| +--------------------------------------------+
+-----------------------+
| |
| +------------+------------+
| | |
| | Copy time series data |
| | from host to device |
| | |
| +------------+------------+
| |
| |
| +-----------+-----------+
| | |
| | Perform |
| | pre-filtering |
| | |
| +-----------+-----------+
| |
| |
| +----------+----------+
| | |
| | Perform |
| | Fourier Transform |
| | |
| +----------+----------+
| |
| |
| +-----------+-----------+
| | |
| | Accumulate spectrum |
| | |
| +-----------+-----------+
| |
| |
| +--------------+--------------+
| | |
| | Copy accumulated spectrum |
| | from host to device |
| | |
| +--------------+--------------+
| |
| |
+-----------------------+
vegas_devel/src/vegas_hpc/src/vegas_pfb_thread.c : Entry point to the GPU thread
vegas_devel/src/vegas_hpc/src/pfb_gpu.cu : The main GPU code
vegas_devel/src/vegas_hpc/src/pfb_gpu.h : Include file for the main GPU code
vegas_devel/src/vegas_hpc/src/pfb_gpu_kernels.cu : The GPU kernel definitions
vegas_devel/src/vegas_hpc/src/pfb_gpu_kernels.h : Include file for the GPU kernel definitions
vegas_devel/src/gpu_dev/vegas_gencoeff.py : Python script to generate filter coefficients
To generate the polyphase filter coefficients file, run vegas_gencoeff.py. For usage information, run:
vegas_gencoeff_gencoeff.py -h
As an example, to generate an 8-tap filter for the 8-sub-band, 4K-channel mode, you would run the script as follows:
vegas_gencoeff.py -n 4096 -t 8 -b 8 -d float
File: pfb_gpu.cu Function: init_gpu()
The initialisation routine performs all GPU start-up related operations. In addition to allocation of memory used by the various data arrays, this routine does the following operations:
-
`cudaError_t `[`cudaGetDeviceProperties`](http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDART__DEVICE_g5aa4f47938af8276f08074d09b7d520c.html)`(struct cudaDeviceProp* prop, int device)` - This function populates the fields in the structure variable `prop` with device properties. The property we are interested in is an integer field named `maxThreadsPerBlock`, which contains the maximum number of threads a threadblock can have. You can use the `struct cudaDeviceProp` variable `stDevProp` that has been declared for you in `Init()`. For PCs with a single GPU, the device number is 0. Store the obtained value of maximum number of threads per block in the variable `iMaxThreadsPerBlock`.
-
Kernel launch parameters are usually CUDA structures of type `dim3` that tell the GPU the number of threads per block and the number of blocks per grid to use when launching the kernel. It also specifies the layout of the block and grid. For example, here is how you create a *M*x*N*x*P* threadblock: - <code> dim3 dimBlock; ... dimBlock.x = M; dimBlock.y = N; dimBlock.z = P; </code> For this spectrometer, the simplest layout is a one-dimensional threadblock, having a length equal to the number of points in the transform, `N`. If the number of points in the transform is more than `cudaDeviceProp::maxThreadsPerBlock`, then we need to set the length to the maximum allowed value, and use multiple such blocks, as shown in the below example, for the kernel launch parameters for the PFB kernel: - <code> g_dimBPFB.x = iMaxThreadsPerBlock; g_dimBPFB.y = 1; g_dimBFB.z = 1; g_dimGPFB.x = (g_iNumSubBands \* g_nchan) / iMaxThreadsPerBlock; g_dimGPFB.y = 1; </code> `cudaDeviceProp::maxThreadsPerBlock` is always a power of 2. The assumption in this piece of code is that the number of points in the transform, `N` is also a power of 2, and that `N` \> `cudaDeviceProp::maxThreadsPerBlock`.
-
`cudaError_t `[`cudaMemcpy`](http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html)`(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)` - Loading the filter coefficients to device memory is a regular host-to-device memory copy operation, and just involves an invocation of this function, with `kind` being `cudaMemcpyHostToDevice`. The destination buffer is `g_pfPFBCoeff_d`, the source buffer is `g_pfPFBCoeff`, and the count, in bytes, is calculated as `g_iNTaps * g_iNFFT * sizeof(float)`.
-
`cufftResult `[`cufftPlanMany`](http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/CUFFT_Library.pdf)`(cufftHandle *plan, int rank, int *n, int *inembed, int istride, int idist, int *onembed, int ostride, int odist, cufftType type, int batch)` - This function -- a Beta feature of the CUFFT 4.0 library -- is used to create an FFT plan that enables multiple Fourier Transforms to be performed simultaneously. A brief description of some of the arguments is given below. - `rank` is the dimension of the transform - `n` is the number of points in the FFT - `inembed` is the storage dimension of the input data, which, in this case, is equal to the number of points in the FFT - `istride` is the distance between two successive input elements. In our case, the data is packed thus: ![Gpu_tutorial_input_data_format.gif](Gpu_tutorial_input_data_format.gif "Gpu_tutorial_input_data_format.gif") <!-- end list --> - - Here, *S1* to *Sn* are *n* sub-bands, whose data is packed in a completely interleaved fashion. One 'element' is one complex time sample, consisting of both the real and imaginary parts. The distance between two such successive elements is *n* times the number of polarisations (2), or 2*n*. <!-- end list --> - `idist` is the distance between the first elements of two consecutive batches in the input data. For the *n*-sub-band data shown above, `cufftPlanMany` creates plans to perform 2*n* (number of polarisations times the number of sub-bands) FFTs in parallel. Each FFT operation is a 'batch', so that the distance between the first elements of two consecutive batches would just one complex time sample. - `onembed` is similar to `inembed`, for the output data. - `ostride` is similar to `istride`, for the output data. - `odist` is similar to `idist`, for the output data. - `type` is the kind of Fourier Transform to be performed. The only supported type, which meets our requirements, is `CUFFT_C2C`, the complex-to-complex Fourier Transform. - `batch` is the number of FFTs performed in parallel, which is 2*n*.
cudaError_t
cudaMemcpy
(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)
- Copying time series data to device memory is a regular
host-to-device memory copy operation, and just involves an
invocation of this function, with
kind
beingcudaMemcpyHostToDevice
.
File: pfb_gpu_kernels.cu Function: DoPFB()
This function is of the following form.
__global__ void DoPFB(char4 *pc4Data, float4 *pf4FFTIn, float *pfPFBCoeff);
The input and output data are of special CUDA data types. As the name
implies, char4
is a struct
that encapsulates four char
variables,
named x
, y
, z
, and w
. For example, pc4Data->x
gives you the
first char
and so on. float4
is of similar structure, encapsulating
floating-point variables.
Since our input data is packed as shown in the figure in Task A.4.,
pc4Data->x
will contain (Re(X)), pc4Data->y
will contain
(Im(X)), pc4Data->z
will contain (Re(Y)), and pc4Data->w
will
contain (Im(Y)).
The first step in writing any CUDA kernel is calucating the thread ID. Since we are using one-dimensional threadblocks in a one-dimensional grid, the thread ID calculation is as follows.
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
Since we are performing an N-point transform, the input to the PFB function is of length N x P, where the number of taps, P = 8, in this case. We need the value of N in the kernel for indexing into the data arrays. There are two ways to do this: copy the value of N into device memory prior to invoking the kernel and use it here, or, as has been done in the solution set, compute it on the fly. The kernel launch parameters have been selected such that the dimension of a threadblock times the dimension of the grid gives you N, as shown below.
int iNFFT = (gridDim.x * blockDim.x);
We need some memory to hold the output of the PFB operation, and the
best data type that we can use in this case is a float4
-
floating-point, because the next stage (CUFFT) requires that data type
as input, and specifically float4
because each time sample, as
described before, is composed of real and imaginary values for both X
and Y polarisations. One thing to note is that this variable needs to be
declared using a CUDA constructor to ensure proper alignment.
float4 f4PFBOut = make_float4(0.0, 0.0, 0.0, 0.0);
The PFB algorithm is as follows, in pseudo code.
for each tap
compute index into data arrays
out_re(x) += in_re(x) * coefficient
out_im(x) += in_im(x) * coefficient
out_re(y) += in_re(y) * coefficient
out_im(y) += in_im(y) * coefficient
Here, coefficient
is given in pfPFBCoeff
. For optimisation, each
coefficient repeats 4 times, so that even though the index is
incremented within the loop, the coefficient is the same for all four
multiplications.
File: pfb_gpu.cu Function: do_fft()
cufftResult
cufftExecC2C
(cufftHandle plan, cufftComplex *idata, cufftComplex *odata, int direction)
- This function executes all the single-precision complex-to-complex Fourier Transforms in parallel.
File: pfb_gpu_kernels.cu Function: Accumulate()
This kernel performs the following computations, and adds these values to corresponding accumulators.
- (A = Re^2(X) + Im^2(X))
- (B = Re^2(Y) + Im^2(Y))
- (C = Re(XY^*) = Re(X) Re(Y) + Im(X) Im(Y))
- (D = Im(XY^*) = Im(X) Re(Y) - Re(X) Im(Y))
Here, A and B are the powers in X and Y polarisations, respectively. The Stokes parameters of the signal can (later, offline) be calculated from these four values as follows:
(I = A + B)
(Q = A - B)
(U = 2C)
(V = 2D)
The kernel is of the form:
__global__ void Accumulate(float4 *pf4FFTOut, float4 *pf4SumStokes);
File: tut5_main.cu
cudaError_t
cudaMemcpy
(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)
- Copying the computed spectra back to device memory is a regular
device-to-host memory copy operation, and just involves an
invocation of this function, with
kind
beingcudaMemcpyDeviceToHost
.