Noise reduction - df8oe/UHSDR GitHub Wiki
Noise reduction in the UHSDR software
Text and figures © DD4WH, under GNU GPLv3
[We do want other people to be able to understand the firmware and the underlying theoretical considerations. We see that this is a real challenge and a major flaw in many other software projects]
fig. 0: Noise reduction in the UHSDR: work in progress [2017-11-18]
NEW (February 4th, 2018): new spectral NR in final test for OVI40, mcHF and mcHF with small flash
fig. 1: Noise reduction in the UHSDR: current implementation [2018-02-04]
SWITCH ON THE SPECTRAL NOISE REDUCTION WITH THE NORMAL PROCEDURE: BUTTON M2 (DSP BOX)
- Adjust NR strength with encoder 2
- ready!
- there is no more need to adjust anything, just switch on the NR
A few adjustments are in the DEBUG MENU however, just in case you want to play around with them: SPECTRAL NOISE REDUCTION
- alpha = NR setting in DSP box = smoothing constant for SNRprio calculation
- beta = [default 0.96]
- asnr = active SNR in dB [default 30dB]
Noise reduction is a pretty complex task. The UHSDR already has a working noise reduction (based on the least mean squares algorithm LMS, also called a Wiener filter), which is useful in some situations. In most situations, however, this noise reduction tends to produce more additional unwanted noise than reducing the noise. This subsequently leads to an unnatural sound like somebody shouting into a pipe (which is not due to the specific implementation of the LMS noise reduction on the UHSDR, but a general feature of ALL implementations of this kind of noise reduction).
There are better, but also much more complex noise reduction algorithms out there (most of them working in the frequency domain): The most promising algorithms mentioned in the professional noise reduction literature are spectral subtraction and spectral weighting methods (Martin 1994, Kim et al. 2002).
There is a longstanding wish to add a really good state-of-the-art noise reduction to the UHSDR. What are the design goals for such a noise reduction?
- Noise reduction should improve the Signal-to-Noise-Ratio of a signal by taking some of the noise out and leaving the speech signal inside the signal.
- A noise reduction cannot improve a signal with a very low SNR. Thus applying noise reduction to a very low quality signal does not improve the signal.
- Signal quality should remain approximately the same without distortion or "musical tones".
- It should be "light" enough to run on our small STM32F4 processor/or on the STM32F7.
Inventing a new noise reduction is far from our capabilities. So we should look around how others have dealt with this problem and use existing approaches. By reading through some papers of the very very rich literature (that´s awesome and awful at the same time ;-)) on this subject I drew these subjective conclusions:
- Use an algorithm in the frequency domain rather than in the time domain. See for example Schmitt et al. (2002).
- Use the algorithm by Ephraim & Malah (1984, 1985) or (much better) some modified less CPU-demanding version of that algorithm for the estimation of the noise amplitude in each frequency bin
- Use an algorithm that has explicitly been designed for low CPU load AND suppression of the musical tones often associated with Spectral Weighting, e.g. the low CPU load rule by Romanin et al. (2009).
- Use a VAD (voice activity detector) to decide which bins/frames should be weighted by the algorithm.
- Use relatively small sizes for the FFT-iFFT chain in order to restrict the memory needs, i.e. 256-point FFT/iFFTs.
- Use the algorithm in the decimated path, eg. in 12ksps (= 48ksps & decimate-by-4).
- Think hard whether the latency introduced by this implementation is small enough to be negligible even for CW freaks with QSK etc. and potentially new applications like VOX. Is a latency of 21.333ms (256 * 4 / 48000) acceptable???
I will now provide the starting point for a discussion:
Our time domain signal is x(k) which consists of the pure speech signal s(k) and noise n(k)
x(k) = s(k) + n(k) (eq. 1 of Schmitt et al. 2002)
Have a look at figure 1: Spectral Noise Reduction transforms the time domain signal into the frequency domain. This is done for "frames", ie. blocks of samples one at a time. The frame number is n. When we transform into the frequency domain, we get a spectrum of the signal with a number of frequency bins bin[i].
The principle of Spectral noise reduction is to estimate the noise in each bin (Nest) and multiply the bin amplitude with (1 - Nest) in order to eliminate the noise.
Y(n, bin[i]) = X(n, bin[i]) * (1 - (Nest(n,bin[i] / X(n, bin[i]))) (eq. 2 of Schmitt et al. 2002)
which is equivalent to:
Y(n, bin[i]) = X(n, bin[i]) * H(n, bin[i]) (eq. 3 of Schmitt et al. 2002)
So we need two core elements for the noise reduction:
- the noise estimator Nest to estimate the noise in each bin
- the algorithm/function H to calculate the weight to be applied to each bin
After that the weighting function is multiplied with the bin spectrum X(n, bin[i]) and that is transformed to the time domain again. Ready!
Diagram A: General functioning blocks of the noise reduction (modified from Schmitt et al. 2002). (c) DD4WH, under GNU GPLv3
I will now explain a relatively simple algorithm using spectral weighting and noise estimation by minimum statistics:
- FFT size is 256 points, overlap is 50% (half-overlapped data buffers), window function is Hann, frame step size is 128
- fill first half of FFT_buffer with last frame´s audio samples
- fill second half of FFT buffer with recent frame´s audio samples
- do windowing on the whole FFT buffer
- perform FFT 256 points
- NR_X[bin] --> calculate magnitude for 128 frequency bins (sqrtf(real * real + imag * imag))
- NR_E[bin] --> average magnitude over 2 to 6 frames
- NR_M[bin] --> search bin-wise for the minimum magnitude value in the last 12 to 40 frames
- NR_lambda[bin] --> calculate signal-noise-ration NR_T[bin] = NR[X] / NR_M[bin]; if NR_T > NR_PSI (this is an SNR threshold user-definable from 2 to 7) --> NR_lambda[bin] = NR_M[bin] else NR_lambda[bin] = NR_E[bin]
- gain weight calculation: NR_G[bin] = 1 - (NR_lambda[bin] / NR_X[bin])
- time smoothing (exponential averaging) of gain weights: NR_Gts[t, bin] = alpha * NR_Gts[t-1, bin] + (1 - alpha) * NR_G[t, bin]
- frequency smoothing of gain weights: NR_Gfs[bin] = beta * NR_Gts[bin - 1] + (1 - 2 * beta) * NR_Gts[bin] + beta * NR_Gts[bin + 1]
- apply gain weighting to all the FFT bins in the first half of the FFT buffer, real AND imaginary
- do this also for the second half of the FFT buffer in order to account for conjugate symmetry
- perform 256 point inverse FFT
- do overlap-add: take real part of first half of current iFFT result and add to 2nd half of last frames´ iFFT result
- DONE !
Acceptable preliminary results (clear noise suppression, some minor bubbling sounds, better intelligibility) have been achieved with the following settings:
- decimation-by-4 --> effective sample rate 12ksps
- number of frames to average magnitudes: L = 3
- number of averaged magnitudes values for minimum search: N = 18
- threshold for noise floor PSI = 2.5
- time averaging constant alpha = 0.98
- frequency averaging constant beta = 0.25
A second way of doing this is described in the scheme by Romanin et al. (2009) and Schmitt et al. (2002):
We need a fast weighting function Hk (taken from Romanin et al. 2009):
The weighting function estimates the signal-noise-ration SNR prior to the weighting SNRprio and the SNR posterior the weighting SNRpost.
- set time constant alpha for estimation of apriori SNR
- set time constant beta for exponential averager for noise power estimation to 0.85
- estimate the noise power spectrum in each bin by applying an exponential averager with beta = 0.85: Nest(n, bin[i]) = beta * Nest(n-1, bin[i]) + X(n, bin[i]) * (1 - beta) (eq. 5 of Schmitt et al. 2002, eq. 12 of Romanin et al. 2009)
- calculate SNRpost (n, bin[i]) = (X(n, bin[i])^2 / Nest(n, bin[i])^2) - 1 (eq. 13 of Schmitt et al. 2002)
- calculate SNRprio (n, bin[i]) = (1 - alpha) * Q(SNRpost(n, bin[i]) + alpha * (Hk(n - 1, bin[i]) * X(n - 1, bin[i])^2 / Nest(n, bin[i])^2 (eq. 14 of Schmitt et al. 2002, eq. 13 of Romanin et al. 2009) [Q[x] = x if x>=0, else Q[x] = 0]
- calculate vk = SNRprio(n, bin[i]) / (SNRprio(n, bin[i]) + 1) * SNRpost(n, bin[i]) (eq. 12 of Schmitt et al. 2002, eq. 9 of Romanin et al. 2009)
- finally calculate the weighting function for each bin: Hk(n, bin[i]) = 1 / SNRpost(n, [i]) * sqrtf(0.7212 * vk + vk * vk) (eq. 26 of Romanin et al. 2009)
Although the algorithm is faster than the original Ephraim/Malah algorithm by a factor of 115 (!) on average according to Romanin et al. (2009), it still consumes a lot of resources. Therefore we should update the noise spectrum estimation of Nest only in those frames n, where only noise is present (Romanin et al. 2009, p. 239). Those bins are detected by a voice activity detector VAD. The approach in Hu et al. (2001) is followed for this detector. The detector is a statistical model-based voice activity detection approach, which computes the likelihood ratio of speech being present or absent in the input frame n. The threshold of speech detection has to be determined: VAD_thresh
- We have to calculate (eq. 10) from Hu et al.(their eq.(10)) for one frame n:
VAD = 1/frame_size * sum of i = 0 to frame_size -1 of (X(n, bin[i])^2 / Nest(n, bin[i]^2) - logf (X(n, bin[i])^2 / Nest(n, bin[i]^2)) - 1)
If VAD is larger than VAD_thresh --> speech present! If VAD is smaller than VAD_thresh --> noise present!
Only in the latter case we should estimate the noise spectrum in that frame!
Diagram B: Implementation blocks of the noise reduction (modified from Romanin et al. 2009). (c) DD4WH, under GNU GPLv3
Preliminary tests of this algorithm implemented in the Teensy Convolution SDR show very promising results of strong noise reduction effects with neglectable artefacts !
The final (current) implementation of the spectral noise reduction in UHSDR draws together the best elements of the following papers:
-
noise estimation by MMSE (minimum mean square error) algorithm with SPP (speech presence probability) calculation in order to weight the update of the noise estimate with the probability of speech being present in the current frame (Gerkmann & Hendriks 2012, MATLAB voicebox estnoiseg)
-
SNRpost & SNRprio estimation as originally proposed by Ephraim & Malah 1984
-
however, contrary to Ephraim & Malah (1984), gain calculation is performed with the approximation formula by Romanin et al. (2009), which is about two orders of magnitude faster than the original algorithm
-
musical noise reduction by dynamic averaging as proposed by Warren Pratt in his wdsp library (2017)
-
we use a sample rate of 6ksps (filter bandwidth below 2.9kHz) or 12ksps (filter bandwidth > 2.7kHz and < 5.0kHz) and an FFT size of 256 with a frame overlap of 50%, a squareroot von Hann FFT window and a final synthesis window (squareroot von Hann) with overlap-add after the final inverse FFT
References
Berouti, M., R. Schwartz & J. Makhoul (1979): Enhancement of speech corrupted by acoustic noise. - Proc. IEEE Int Conf Acoust Speech Signal Proc, April 1979, pages 208-211. - HERE
Ephraim, Y. & D. Malah (1984): Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. - IEEE Trans Acoust Speech Signal Proc ASSP-32 (6), Dec 1984, pages 1109-1121. - HERE
Ephraim, Y. & D. Malah (1985): Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. - IEEE Trans Acoust Speech Signal ASSP-33 (2), April 1985, pages 443-445. - HERE
Esch, T. & P. Vary (2009): Efficient musical noise suppression for speech enhancement systems. - ICASSP 2009: 4409-4412. - HERE
Hu, Y., M. Bhatnagar & P. Loizou (2001): A cross-correlation technique for enhancing speech corrupted with correlated noise. - ICASSP 1, May 2001, pages 673-676. - HERE
Gerkmann, T. & R.C. Hendriks (2012): Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. - IEEE Transactions on Speech and Audio Processing 13(5): 857-869. HERE
Kim, H.-G. & D. Ruwisch (2002): SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS. - 7th international Conference on Spoken Language Processing Denver, Colorado, USA; September 16-20, 2002. HERE
Kim, H.-G., Obermayer, K., Bode, M. & D. Ruwisch (2001): Efficient Speech Enhancement by Diffusive Gain Factors (DGF). - Eurospeech 2001, Scandinavia. HERE
Martin, R. (2001): Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics. - IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001 HERE
Martin, R. (1994): Spectral Subtraction Based on Minimum Statistics. - Proc. EUSIPCO 94: 1182-1185. HERE
MATLAB voicebox implementation - MMSE noise estimation. HERE
Romanin, M., E. Marchetto & F. Avanzini (2009): A spectral subtraction rule for real-time DSP implementation of noise reduction in speech signals. - Proc. 12th Conf. DAFx-09, pages 235-239. - HERE
Schmitt, S., M. Sandrock & J. Cronemeyer (2002): Single channel noise reduction for hands free operation in automotive environments. - AES 112th Convention, Munich 2002 May 11-14. - HERE
Sohn, J. & W. Sung(): A voice activity detector employing soft decision based noise spectrum adaptation. - HERE