Low latency filtering: Partitioned Convolution - df8oe/UHSDR GitHub Wiki

work in progress - conceptual thoughts

Text and figures © DD4WH, under GNU GPLv3

A longstanding plan is the modification of the main audio filtering in UHSDR from time domain filtering to Fast Convolution filtering. However, in order to obtain filters steep enough, the FFT size for the FFT-iFFT audio chain has to be at least 2048 or even 4096 (FIR filter impulse response with 1025 / 2049 coefficients; however, by using decimation this could be reduced). This produces an inherent delay of 170msec @24ksps sample rate, which is unacceptable for CW operators and can also be annoying for operators in other modes.

A solution to this problem has been highlighted by Warren Pratt in his HAMRADIO 2018 talk at the Software Defined Academy, which is called "Partitioned Convolution" (see also Kulp 1988, Armelloni et al. 2003). In Partitioned Convolution, the filters impulse response is partitioned into separate blocks and so are the convolutions which are performed for the separate blocks and not one big FFT for the whole impulse response.

For UHSDR running on OVI40 with the STM32F7 processor, we would like to implement Fast Convolution filtering with partitioned convolution in order to minimize filter latency while maintaining a high quality filter with steep filter skirts ("brickwall").

Partitioned Convolution has been used extensively in the past ten years to enable very large FIR filter lengths with acceptable "real time" latency. However, this imposes a very large processor load. Garcia (2003) gives some hints and formulae to calculate MCU load for different partition schemes.

We will use multiple block convolution with uniform partition using a frequency domain delay line (FDL), sometimes also called Single-FDL Convolution (Garcia 2003). This is a quite efficient scheme, because for each block of input samples, only one FFT and one inverse FFT is needed to perform instead of one inverse FFT per partitioned block.

[the following is just notes taken from understanding wdsp, firmin.c, "Standalone Partitioned overlap-save bandpass", Pratt 2018]

abbreviations

SR = sample rate

DF = decimation factor

size = no. of input and output samples processed at one time

FFT_size = size of the FFT and inverse FFT used [FFT_size = 2 * size, because we use 50% overlap]

nc = no. of filter coefficients for the complex FIR bandpass filter

max_BW = maximum bandwidth useable for the filter [dependent on the sample rate and the decimation factor, --> Nyquist!]

N_blocks = the impulse response for the FIR filter with nc coeffs is partitioned into N_blocks blocks

latency = time that is needed for acquisition of the required number of input samples (size). latency = (size * DF) / SR

window used for the calculation of the filter coeffs: Blackman-Harris 4-term, this provides 110dB stopband attenuation (Pratt 2017, pp. 21), more than enough for our purposes.

Variables used

maskgen[FFT_size * 2] = holds the coefficients for each block of the impulse response prior to the FFT

fmask[N_blocks][FFT_size * 2] = holds the results of the FFTs of the blocks of the impulse response, used for the complex multiplication in the frequency domain

fftin[FFT_size * 2] = input buffer for the main real time FFT

fftout[N_blocks][FFT_size * 2] = ouput buffer for the FFT results, they need to be stored for each block in order to be used in subsequent rounds

accum[FFT_size * 2] = accumulator for the input to the final inverse FFT for each round

Setup (repeat every time the filter is adjusted):

calculate the complex FIR filter coefficients (= impulse response) with windowing (Blackman-Harris 4-term) --> results in nc * 2 coefficients
partition nc * 2 coefficients into Nblocks blocks of size * 2 coeffs
fill first half of maskgen buffer [total size = size * 4] with size * 2 zeros
put size * 2 coefficients of one block into second half of maskgen buffer [which has the total size: size * 4]
Calculate a complex FFT of size FFT_size (= size * 2) of this block [gives size * 4 output values]
store FFT results in fmask[N_blocks][FFT_size * 2]
continue with 3. until the whole impulse response has been processed

Real-time filter process:

collect I & Q samples (size * I + size * Q)
overlap 50% with previous samples
complex FFT of those size * 4 samples
copy FFT result into fftout[buffidx]
fill accum buffer with zeroes
k = buffidx
repeat for j=0; j < N_blocks; i++ {
complex-multiply fftout[k] with fmask[j]
accumulate result of complex-multiply in accum[size * 4]
k-- (and wrap-around) }
buffidx++
copy second half of fftin buffer into first half of fftin buffer for next time
inverse FFT of accum[size * 4]
discard first half and take last (128 * I + 128 * Q) samples as output [overlap & save]

One example of a low latency filter with reasonable filter size would be:

nc = 1024 coefficients, N_blocks = 8, SR = 48ksps, DF = 4, size = 128, FFT_size = 256, latency = 10.7 millisec, max_BW = 5kHz

memory consumption: about 40kbytes

processor load

--> for 128 new samples (128 * I & 128 * Q) coming in, we have to do the following calculations:

one FFT256 (1280 complex multiplies & 1280 complex additions)
6 * 512 * 8 = 24576 multiplications
4 * 512 * 8 = 16384 additions
one inverse FFT256 (1280 complex multiplies & 1280 complex additions)

--> about 27136 multiplications and 18944 additions to be performed on a 216MHz machine (STM32F7)

--> (14 * 27136 + 4 * 18944) / 216000000 = (380000 + 76000) / 216000000 = 2.11ms

for a latency of 10.67ms this is a processor load of 2.11/10.67 = 20%

References

Armenolli et al. (2003): Implementation of real-time partitioned convolution on a DSP board. - IEEE workshop on Applications of Signal Processing to Audio and Acoustics - HERE

Garcia, G. (2003): Optimal filter partition for efficient convolution with short input/output delay. - Audio Engineering Socoety Convention paper 5660, 1-9. - HERE

Kulp, B.D. (1988): Digital Equalization using Fourier Transform Techniques. - HERE

Pratt, W. (2017): WDSP guide. - HERE

Pratt, W. (2018): Open source DSP library wdsp. - HERE

Wefers, F. & M. Vorländer (2011): Optimal filter partitioning for real-time FIR filtering using uniformly-partitioned FFT-based convolution in the frequency domain. - Proc. of the 14th Conference on Digital Audio Effects (DAFx-11) - HERE