Low latency filtering: Partitioned Convolution - df8oe/UHSDR GitHub Wiki
work in progress - conceptual thoughts
Text and figures © DD4WH, under GNU GPLv3
A longstanding plan is the modification of the main audio filtering in UHSDR from time domain filtering to Fast Convolution filtering. However, in order to obtain filters steep enough, the FFT size for the FFT-iFFT audio chain has to be at least 2048 or even 4096 (FIR filter impulse response with 1025 / 2049 coefficients; however, by using decimation this could be reduced). This produces an inherent delay of 170msec @24ksps sample rate, which is unacceptable for CW operators and can also be annoying for operators in other modes.
A solution to this problem has been highlighted by Warren Pratt in his HAMRADIO 2018 talk at the Software Defined Academy, which is called "Partitioned Convolution" (see also Kulp 1988, Armelloni et al. 2003). In Partitioned Convolution, the filters impulse response is partitioned into separate blocks and so are the convolutions which are performed for the separate blocks and not one big FFT for the whole impulse response.
For UHSDR running on OVI40 with the STM32F7 processor, we would like to implement Fast Convolution filtering with partitioned convolution in order to minimize filter latency while maintaining a high quality filter with steep filter skirts ("brickwall").
Partitioned Convolution has been used extensively in the past ten years to enable very large FIR filter lengths with acceptable "real time" latency. However, this imposes a very large processor load. Garcia (2003) gives some hints and formulae to calculate MCU load for different partition schemes.
We will use multiple block convolution with uniform partition using a frequency domain delay line (FDL), sometimes also called Single-FDL Convolution (Garcia 2003). This is a quite efficient scheme, because for each block of input samples, only one FFT and one inverse FFT is needed to perform instead of one inverse FFT per partitioned block.
[the following is just notes taken from understanding wdsp, firmin.c, "Standalone Partitioned overlap-save bandpass", Pratt 2018]
abbreviations
SR = sample rate
DF = decimation factor
size = no. of input and output samples processed at one time
FFT_size = size of the FFT and inverse FFT used [FFT_size = 2 * size, because we use 50% overlap]
nc = no. of filter coefficients for the complex FIR bandpass filter
max_BW = maximum bandwidth useable for the filter [dependent on the sample rate and the decimation factor, --> Nyquist!]
N_blocks = the impulse response for the FIR filter with nc coeffs is partitioned into N_blocks blocks
latency = time that is needed for acquisition of the required number of input samples (size). latency = (size * DF) / SR
window used for the calculation of the filter coeffs: Blackman-Harris 4-term, this provides 110dB stopband attenuation (Pratt 2017, pp. 21), more than enough for our purposes.
Variables used
maskgen[FFT_size * 2] = holds the coefficients for each block of the impulse response prior to the FFT
fmask[N_blocks][FFT_size * 2] = holds the results of the FFTs of the blocks of the impulse response, used for the complex multiplication in the frequency domain
fftin[FFT_size * 2] = input buffer for the main real time FFT
fftout[N_blocks][FFT_size * 2] = ouput buffer for the FFT results, they need to be stored for each block in order to be used in subsequent rounds
accum[FFT_size * 2] = accumulator for the input to the final inverse FFT for each round
Setup (repeat every time the filter is adjusted):
- calculate the complex FIR filter coefficients (= impulse response) with windowing (Blackman-Harris 4-term) --> results in nc * 2 coefficients
- partition nc * 2 coefficients into Nblocks blocks of size * 2 coeffs
- fill first half of maskgen buffer [total size = size * 4] with size * 2 zeros
- put size * 2 coefficients of one block into second half of maskgen buffer [which has the total size: size * 4]
- Calculate a complex FFT of size FFT_size (= size * 2) of this block [gives size * 4 output values]
- store FFT results in fmask[N_blocks][FFT_size * 2]
- continue with 3. until the whole impulse response has been processed
Real-time filter process:
- collect I & Q samples (size * I + size * Q)
- overlap 50% with previous samples
- complex FFT of those size * 4 samples
- copy FFT result into fftout[buffidx]
- fill accum buffer with zeroes
- k = buffidx
- repeat for j=0; j < N_blocks; i++ {
- complex-multiply fftout[k] with fmask[j]
- accumulate result of complex-multiply in accum[size * 4]
- k-- (and wrap-around) }
- buffidx++
- copy second half of fftin buffer into first half of fftin buffer for next time
- inverse FFT of accum[size * 4]
- discard first half and take last (128 * I + 128 * Q) samples as output [overlap & save]
One example of a low latency filter with reasonable filter size would be:
nc = 1024 coefficients, N_blocks = 8, SR = 48ksps, DF = 4, size = 128, FFT_size = 256, latency = 10.7 millisec, max_BW = 5kHz
memory consumption: about 40kbytes
processor load
--> for 128 new samples (128 * I & 128 * Q) coming in, we have to do the following calculations:
- one FFT256 (1280 complex multiplies & 1280 complex additions)
- 6 * 512 * 8 = 24576 multiplications
- 4 * 512 * 8 = 16384 additions
- one inverse FFT256 (1280 complex multiplies & 1280 complex additions)
--> about 27136 multiplications and 18944 additions to be performed on a 216MHz machine (STM32F7)
--> (14 * 27136 + 4 * 18944) / 216000000 = (380000 + 76000) / 216000000 = 2.11ms
for a latency of 10.67ms this is a processor load of 2.11/10.67 = 20%
References
Armenolli et al. (2003): Implementation of real-time partitioned convolution on a DSP board. - IEEE workshop on Applications of Signal Processing to Audio and Acoustics - HERE
Garcia, G. (2003): Optimal filter partition for efficient convolution with short input/output delay. - Audio Engineering Socoety Convention paper 5660, 1-9. - HERE
Kulp, B.D. (1988): Digital Equalization using Fourier Transform Techniques. - HERE
Pratt, W. (2017): WDSP guide. - HERE
Pratt, W. (2018): Open source DSP library wdsp. - HERE
Wefers, F. & M. Vorländer (2011): Optimal filter partitioning for real-time FIR filtering using uniformly-partitioned FFT-based convolution in the frequency domain. - Proc. of the 14th Conference on Digital Audio Effects (DAFx-11) - HERE