Tutorial_5:_Heterogeneous_Instrumentation - david-macmahon/wiki_convert_test GitHub Wiki

This page is under construction.


SAlbcgt9_casper_logo-1.png

Tutorial 5: Heterogeneous Instrumentation Tutorial

Authors: Mark Wagner, Jayanth Chennamangalam (Version 1)

Expected Completion Time: 4 hours


Audience

This tutorial is intended for those who are familiar with CASPER tools, understand CASPER-based design at the level of Tutorial 2, and who are also familiar with basic CUDA programming in C.

Introduction

In this tutorial, you will learn how to build an astronomical signal-processing instrument - specifically, a polyphase filterbank spectrometer - using both FPGA and Graphics Processing Unit (GPU) technologies. A typical heterogeneous instrument consists of three parts

  • an ADC-FPGA system, a PC-based data-acquisition mechanism, and an online processing system that uses a GPU. Likewise, this tutorial is divided into three sections. In the first section (FPGA Design), you will build an FPGA design using Simulink and CASPER tools, in the second section (Data Acquisition), you will use a simple Python script to capture data on a 10GbE link and record it on a PC, and in the third section (Polyphase Filter Bank Spectrometer), you will write CUDA kernels that will perform polyphase filtering and spectrometry.

Setup

Add info on CASPER libraries, etc.

You will need an NVIDIA CUDA-capable GPU with compute capability 2.x and CUDA Toolkit 4.0 or higher. You will also need the PGPLOT library.

FPGA Design

The FPGA design essentially consists of a block RAM that contains some time domain data (say, the sum of two sine waves), a packet-size-controlling and rate-limiting mechanism, and a Ten GbE v2 block as the data transmitter.

Tut5_mdl_a.png

Set the dest_ip register to the IP address of the PC interface that will receive the packets. Set the dest_port to the destination port number, which is 60000.

Tut5_mdl_b.png

The packet-size-controlling and rate-limiting mechanism is very similar to that of Tutorial 2. The payload_len software register should be loaded with the size of the packet in terms of 64-bit words. That is, to send a packet with 1024 bytes of data, load this register with 128 (128 * 8 bytes = 1024 bytes). The period software register is used for rate-limiting. Set this to 4 times payload_len, that is, 512.

To start transmitting data, set enable to 1.

The whole procedure is available as the configuration script tut5.py. Note that you may need to edit the path to the Python interpreter.

Data Acquisition

To acquire data, run the script udp_grab.py. You may need to change the path to the Python interpreter and/or value of the variable udp_ip, depending on your configuration. This file records ten 35 MB files of data. Depending on the data rate, the first file may take a while to appear. Once you see a file named 'file0000', you can start running the GPU spectrometer code.

NOTE: There is a bug in the spectrometer code that prevents files less than 32 MB in size from being processed. So if you choose to modify the code in udp_grab.py, make sure the data files that get written to disk are greater than 32 MB in size.

Polyphase Filter Bank Spectrometer

At this point, you will have a set of files, named 'file0000', 'file0001', 'file0002', and so on in numerical sequence, that contains time domain data. The next step is to do the actual spectrometry.

A spectrometer is a mechanism that converts time domain data to the frequency domain, outputting spectra at regular intervals of time. A simple spectrometer is made up of a Discrete Fourier Transform (DFT) -- usually, a Fast Fourier Transform (FFT) -- operation performed on the time series data, resulting in output spectra that may or may not be added up (accumulated) for a given time duration, to increase the signal-to-noise ratio. Such a basic spectrometer suffers from a couple of inherent drawbacks of the DFT, namely, spectral leakage and scalloping loss. The Polyphase Filter Bank (PFB) is a signal-processing technique that reduces these drawbacks.

In a nutshell, what a PFB spectrometer does is this: Instead of taking an N-point transform directly, a block of data of size N x P = M is read, and multiplied point-by-point with a window function (in other words, the data is 'weighted'). The shape of the window function determines the shape of the single-bin frequency response. Since we wish the single-bin frequency response to resemble a rectangular function as much as possible, we choose its Fourier Transform pair, the sinc function, as our window function. Once the multiplication is done, the block of data is split into P subsets of length N each, and added point-by-point. This array is then passed to a regular DFT routine to get an N-point transform that exhibits less leakage. This is graphically depicted in the figure below. For a more detailed background on the PFB technique, see the PFB memo.

Memo_pfb_polyphase.png

In this section of the tutorial, you will be given a skeleton CUDA/C code that reads data from the input files, and plots the data from an output memory buffer. The objective of this section of the tutorial is to write CUDA kernel-related code, namely, kernel launch parameter calculation, and the actual kernels that perform PFB, FFT, and accumulation of spectra.

This code is for a general-purpose software that performs an 8-tap polyphase filtering, with N channels, and some S sub-bands. For the purpose of this tutorial, though, we will stick to S = 1 and N = 32768.

The basic code flow is depicted below:



                                                                     Initialisation
                           +------------------+                      +--------------------------------------------+
                           |                  |                      |1. Get device properties                    |
                           |  Initialisation  |                      |2. Calculate kernel launch parameters       |
                           |                  |                      |3. Load filter coefficients to device memory|
                           +---------+--------+                      |4. Create FFT plan                          |
                                     |                               +--------------------------------------------+
             +-----------------------+
             |                       |
             |          +------------+------------+
             |          |                         |
             |          |  Copy time series data  |
             |          |   from host to device   |
             |          |                         |
             |          +------------+------------+
             |                       |
             |                       |
             |           +-----------+-----------+
             |           |                       |
             |           |       Perform         |
             |           |  polyphase filtering  |
             |           |                       |
             |           +-----------+-----------+
             |                       |
             |                       |
             |            +----------+----------+
             |            |                     |
             |            |      Perform        |
             |            |  Fourier Transform  |
             |            |                     |
             |            +----------+----------+
             |                       |
             |                       |
             |           +-----------+-----------+
             |           |                       |
             |           |  Accumulate spectrum  |
             |           |                       |
             |           +-----------+-----------+
             |                       |
             |                       |
             |        +--------------+--------------+
             |        |                             |
             |        |  Copy accumulated spectrum  |
             |        |     from host to device     |
             |        |                             |
             |        +--------------+--------------+
             |                       |
             |                       |
             +-----------------------+

Directory Structure

Download the problem set and the solution set (given for reference).

Search for the string 'Task ' (where = A, B, and so on, as given below) in the problem set source files to see the tasks you need to do (computing kernel launch parameters, writing the CUDA kernel code for PFB, FFT and accumulation, etc.). If you choose not to write your own kernels, or just wish to see how the final output looks like, you may run the solution set first.

NOTE: You may need to modify the makefile, depending on the configuration of the PC that you use.

   bin/             : Directory in which the output binaries will be created    tut5_fileread.cu : Data-file-reading routines    tut5_fileread.h  : Header for data-file-reading routines    tut5_kernels.cu  : CUDA kernels    tut5_kernels.h   : Header for CUDA kernels    tut5_main.cu     : Top-level file    tut5_main.h      : Top-level header file    tut5_plot.cu     : Plotting routines    tut5_plot.h      : Header for plotting routines    Makefile         : The makefile    tut5_gencoeff.py : Python script to generate filter coefficients

To run the solution set, or the code that you write, you need to generate the polyphase filter coefficients file, by running the script tut5_gencoeff.py. For usage information, run:

tut5_gencoeff.py -h

For the purpose of this tutorial, to generate an 8-tap filter for 32K-channel, single-sub-band operation, you would run the script as follows:

tut5_gencoeff.py -n 32768 -t 8 -b 1 -d float

Tasks

Task A. Initialisation

File: tut5_main.cu

The initialisation routine Init() performs all start-up related operations. In addition to allocation of memory used by the various data arrays, this routine does the following operations that you will code:

  1. Get device properties

    `cudaError_t
    `[`cudaGetDeviceProperties`](http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDART__DEVICE_g5aa4f47938af8276f08074d09b7d520c.html)`(struct
    cudaDeviceProp* prop, int device)`
      -
        This function populates the fields in the structure variable
        `prop` with device properties. The property we are
        interested in is an integer field named
        `maxThreadsPerBlock`, which contains the maximum number of
        threads a threadblock can have.
        You can use the `struct cudaDeviceProp` variable `stDevProp`
        that has been declared for you in `Init()`.
        For PCs with a single GPU, the device number is 0. Store the
        obtained value of maximum number of threads per block in the
        variable `iMaxThreadsPerBlock`.
    
  2. Calculate kernel launch parameters

    Kernel launch parameters are usually CUDA structures of type
    `dim3` that tell the GPU the number of threads per block and the
    number of blocks per grid to use when launching the kernel. It
    also specifies the layout of the block and grid. For example,
    here is how you create a *M*x*N*x*P* threadblock:
      -
        <code>
        dim3 dimBlock;
        ...
        dimBlock.x = M;
        dimBlock.y = N;
        dimBlock.z = P;
        </code>
    For this spectrometer, the simplest layout is a one-dimensional
    threadblock, having a length equal to the number of points in
    the transform, `N`. If the number of points in the transform is
    more than `cudaDeviceProp::maxThreadsPerBlock`, then we need to
    set the length to the maximum allowed value, and use multiple
    such blocks, as shown below, for the kernel launch parameters
    for the PFB kernel:
      -
        <code>
        g_dimBPFB.x = iMaxThreadsPerBlock;
        g_dimBPFB.y = 1;
        g_dimBFB.z = 1;
        g_dimGPFB.x = g_iNFFT / dimBlock.x;
        g_dimGPFB.y = 1;
        </code>
        You need to do the same for the kernel launch parameters for
        the accumulation kernel.
    `cudaDeviceProp::maxThreadsPerBlock` is always a power of 2. The
    assumption in this piece of code is that the number of points in
    the transform, `N` is also a power of 2, and that `N` \>
    `cudaDeviceProp::maxThreadsPerBlock`.
    
  3. Load filter coefficients to device memory

    `cudaError_t
    `[`cudaMemcpy`](http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html)`(void*
    dst, const void* src, size_t count, enum cudaMemcpyKind kind)`
      -
        Loading the filter coefficients to device memory is a
        regular host-to-device memory copy operation, and just
        involves an invocation of this function, with `kind` being
        `cudaMemcpyHostToDevice`.
        The destination buffer is `g_pfPFBCoeff_d`, the source
        buffer is `g_pfPFBCoeff`, and the count, in bytes, is
        calculated as `g_iNTaps * g_iNFFT * sizeof(float)`.
    
  4. Create FFT plan

    `cufftResult
    `[`cufftPlanMany`](http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/CUFFT_Library.pdf)`(cufftHandle
    *plan, int rank, int *n, int *inembed, int istride, int idist,
    int *onembed, int ostride, int odist, cufftType type, int
    batch)`
      -
        This function -- a Beta feature of the CUFFT 4.0 library --
        is used to create an FFT plan that enables multiple Fourier
        Transforms to be performed simultaneously. A brief
        description of some of the arguments is given below.
          - `rank` is the dimension of the transform
          - `n` is the number of points in the FFT
          - `inembed` is the storage dimension of the input data,
            which, in this case, is equal to the number of points in
            the FFT
          - `istride` is the distance between two successive input
            elements. In our case, the data is packed thus:
            ![Gpu_tutorial_input_data_format.gif](Gpu_tutorial_input_data_format.gif
            "Gpu_tutorial_input_data_format.gif")
        <!-- end list -->
          -
    
              -
                Here, *S1* to *Sn* are *n* sub-bands, whose data is
                packed in a completely interleaved fashion. One
                'element' is one complex time sample, consisting of
                both the real and imaginary parts. The distance
                between two such successive elements is *n* times
                the number of polarisations (2), or 2*n*.
        <!-- end list -->
          - `idist` is the distance between the first elements of
            two consecutive batches in the input data. For the
            *n*-sub-band data shown above, `cufftPlanMany` creates
            plans to perform 2*n* (number of polarisations times the
            number of sub-bands) FFTs in parallel. Each FFT
            operation is a 'batch', so that the distance between the
            first elements of two consecutive batches would just one
            complex time sample.
          - `onembed` is similar to `inembed`, for the output data.
          - `ostride` is similar to `istride`, for the output data.
          - `odist` is similar to `idist`, for the output data.
          - `type` is the kind of Fourier Transform to be performed.
            The only supported type, which meets our requirements,
            is `CUFFT_C2C`, the complex-to-complex Fourier
            Transform.
          - `batch` is the number of FFTs performed in parallel,
            which is 2*n*.
    

Task B. Copy Time Series Data from Host to Device

File: tut5_fileread.cu

This task has already been done for you.

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)

  • Copying time series data to device memory is a regular host-to-device memory copy operation, and just involves an invocation of this function, with kind being cudaMemcpyHostToDevice.

Task C. Perform Polyphase Filtering

File: tut5_kernels.cu

In this section, you write code for an 8-tap polyphase filter. This function is of the following form.

   __global__ void DoPFB(char4 *pc4Data, float4 *pf4FFTIn, float *pfPFBCoeff);

The input and output data are of special CUDA data types. As the name implies, char4 is a struct that encapsulates four char variables, named x, y, z, and w. For example, pc4Data->x gives you the first char and so on. float4 is of similar structure, encapsulating floating-point variables.

Since our input data is packed as shown in the figure in Task A.4., pc4Data->x will contain (Re(X)), pc4Data->y will contain (Im(X)), pc4Data->z will contain (Re(Y)), and pc4Data->w will contain (Im(Y)).

The first step in writing any CUDA kernel is calucating the thread ID. Since we are using one-dimensional threadblocks in a one-dimensional grid, the thread ID calculation is as follows.

   int i = (blockIdx.x * blockDim.x) + threadIdx.x;

Since we are performing an N-point transform, the input to the PFB function is of length N x P, where the number of taps, P = 8, in this case. We need the value of N in the kernel for indexing into the data arrays. There are two ways to do this: copy the value of N into device memory prior to invoking the kernel and use it here, or, as has been done in the solution set, compute it on the fly. The kernel launch parameters have been selected such that the dimension of a threadblock times the dimension of the grid gives you N, as shown below.

   int iNFFT = (gridDim.x * blockDim.x);

We need some memory to hold the output of the PFB operation, and the best data type that we can use in this case is a float4 - floating-point, because the next stage (CUFFT) requires that data type as input, and specifically float4 because each time sample, as described before, is composed of real and imaginary values for both X and Y polarisations. One thing to note is that this variable needs to be declared using a CUDA constructor to ensure proper alignment.

   float4 f4PFBOut = make_float4(0.0, 0.0, 0.0, 0.0);

The PFB algorithm is as follows, in pseudo code.

   for each tap        compute index into data arrays        out_re(x) += in_re(x) * coefficient        out_im(x) += in_im(x) * coefficient        out_re(y) += in_re(y) * coefficient        out_im(y) += in_im(y) * coefficient

Here, coefficient is given in pfPFBCoeff. For optimisation, each coefficient repeats 4 times, so that even though the index is incremented within the loop, the coefficient is the same for all four multiplications.

Task D. Perform Fourier Transform

File: tut5_kernels.cu

cufftResult cufftExecC2C(cufftHandle plan, cufftComplex *idata, cufftComplex *odata, int direction)

  • This function executes all the single-precision complex-to-complex Fourier Transforms in parallel.

Task E. Accumulate Spectrum

File: tut5_kernels.cu

The kernel you write should perform the following computations, and add these values to corresponding accumulators.

  • (A = Re^2(X) + Im^2(X))
  • (B = Re^2(Y) + Im^2(Y))
  • (C = Re(XY^*) = Re(X) Re(Y) + Im(X) Im(Y))
  • (D = Im(XY^*) = Im(X) Re(Y) - Re(X) Im(Y))

Here, A and B are the powers in X and Y polarisations, respectively. The Stokes parameters of the signal can (later, offline) be calculated from these four values as follows:

(I = A + B)

(Q = A - B)

(U = 2C)

(V = 2D)

For writing the kernel, follow the general instructions given in Task C. The kernel is of the form:

   __global__ void Accumulate(float4 *pf4FFTOut, float4 *pf4SumStokes);

Task F. Copy Accumulated Spectrum from Host to Device

File: tut5_main.cu

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)

  • Copying the computed spectra back to device memory is a regular device-to-host memory copy operation, and just involves an invocation of this function, with kind being cudaMemcpyDeviceToHost.

Running the Spectrometer

To compile the code that you have written, or to compile the solution set code, just type make at the command prompt. Note that you may need to modify the makefile for successful compilation. Now you should have a binary named tut5 in the bin directory. For usage, type:

bin/tut5 -h

As an example, for single-sub-band, 32768-channel operation, run the program as follows:

bin/tut5 -b 1 -n 32768 -p

Conclusion

By the end of this tutorial, you would have learnt how to construct a heterogeneous astronomical signal processing instrument that uses an FPGA design to packetize time domain data and send it to a PC over 10GbE, acquire the data on the PC, and generate polyphase-filtered spectra.

⚠️ **GitHub.com Fallback** ⚠️