Pitch Detection Algorithms Comparison - 180D-FW-2023/Knowledge-Base-Wiki GitHub Wiki

Introduction

Pitch, as an essence characteristic of speech, is tied to the concept of fundamental frequency (F0). It represents the rate at which chords in the vocal tract vibrate and exist in both dialogues and music. The emotions that pitch can deliver make it an important attribute for researchers to analyze human perception and communication.

[1]Human vocal tract and vocal folds

A speech typically contains both voiced and unvoiced segments, where voiced sections carry periodic vibrations and unvoiced sections are aperiodic. Pitch only exists in the periodic signals. Pitch detection algorithms (PDAs) are able to learn the voiced and unvoiced portions within a speech and then determine the pitch of voiced parts. PDAs are implemented in speech applications to identify speakers, determine intonation, and distinguish tones. The applications extend to auditory aids for the deaf, automatic score transcription in music processing, language translation, and vocoder systems.

Pitch Detection Algorithms

In the sections below, three PDAs will be introduced and discussed. They are pYIN, YAAPT, and CREPE, among them including both conventional methods and deep learning model.

pYIN (Probabilistic YIN)

The Probabilistic YIN algorithm is based on the time domain and formulated from the auto-correlation method. It is the improved version of the YIN method, which is a famous algorithm for fundamental frequency estimation.

The procedures performed in YIN include auto-correlation, difference function, cumulative mean normalized difference function (CMND), pitch period estimation, and fundamental frequency estimation. During auto-correlation, a signal and its time-shifted version are compared. If there are any variations, they are fed into the difference function, which subtracts the autocorrelation values from their lagged counterpart. The CMND function helps identify peaks by normalizing the difference function and taking the cumulative mean. Then in the following step, the algorithm estimates the period of the pitch by locating the first minimum in the CMND function. From the period, the fundamental frequency can be calculated.

[2]Comparison of the first steps of the original YIN algorithm and the proposed pYIN algorithm

YIN only outputs one pitch value per frame, which limits the options for smoothing the output pitch contour. Instead of relying on a single absolute threshold, pYIN implements threshold distribution, as shown in the figure above, where it outputs multiple pitch candidate values with their corresponding probabilities. This probabilistic thresholding allows for a more optimal path because now the output contour is determined by the combined probabilities. The threshold distribution within pYIN is built upon a model known as the Hidden Markov Model (HMM). Alongside with some modifications to the difference function, proper normalization, and thresholding that minimize the effect of a low sampling rate, pYIN performs better than YIN and avoids pitch doubling, making it more robust and effective. pYIN is noted for finding higher pitch values, thus useful for high-pitched voices or music.

YAAPT (Yet Another Algorithm for Pitch Tracking)

The next algorithm, YAAPT, which stands for yet another algorithm for pitch tracking. This method incorporates both time domain and frequency domain processing. Like pYIN, YAAPT is also based on a previous algorithm called RAPT, robust algorithm for pitch tracking. RAPT is typically used to detect pitch in challenging environments, such as very noisy settings. It is time domain based and includes a harmonic model to represent the periodic structure of the signal. RAPT was designed to be less sensitive to outliers so that the noises picked up along the signal would not get amplified. The normalized cross correlation function (NCCF) is applied in RAPT, which is another function that measures the similarity between two signals. Its formula is the following: CCF(m)=∑n x(n)⋅y(n+m), where x and y are two individual signals, m represents the differences between them, which could be the shifted value. NCCF is then normalized to ensure the value would range from -1 to 1 and is independent of the amplitude of the signals. The pitch can be calculated once the peaks are identified because those are where the two signals are most similar. However, despite its effectiveness, there exists a large amount of gross pitch errors due to frequent pitch doubling. To counter this, YAAPT combines frequency domain analysis with the existing time domain, and the result turns out to be more accurate.

[3]YAAPT algorithm

As the diagram above shows, YAAPT consists of four stages: preprocessing, fundamental frequency track calculation, fundamental frequency candidate estimation, and final pitch determination. First, the original signal is being processed twice into filtered speech and filtered squared speech. Filtered speech is the bandpass filtered version of the original signal including frequencies between 50-1500 Hz (absolute value). Whereas filtered squared speech is the nonlinear version of the original signal (square of the original signal). The spectrograms of these two preprocessed signals are analyzed to identify pitch using spectral harmonics correlation and the normalized low frequency energy ratio. These processed signals are later fed into the NCCF algorithm again for estimating pitch, then the results are compared with the outcomes from the second stage. Lastly, with these references, YAAPT can accurately determine the pitch using dynamic programming.

CREPE (Convolutional Representation for Pitch Estimation)

In contrast to pYIN and YAAPT, CREPE, which stands for convolutional representation for pitch estimation, involves a neural network approach. The PDAs that are based on machine learning have emerged more recently.

[4]The architecture of the CREPE pitch tracker. The six convolutional layers operate directly on the time-domain audio signal, producing an output vector that approximates a Gaussian curve as in Equation 3, which is then used to derive the exact pitch estimate as in Equation 2

CREPE algorithm is processed in the time domain while employing a deep convolutional neural network. The network contains 6 densely connected convolutional layers, with the outer layer incorporating a probabilities function for determining the pitch. The diagram above illustrates the overall architecture for CREPE.

Just like the YIN algorithm, this model determines the periodicity of signals by examining the phase information of the input instead of using amplitude-based estimation. The latter is not as robust for noisy environments because it is sensitive to changes in the amplitude, which could easily fluctuate in the presence of noise. The accuracy would then be affected if some frequency components change. On the other hand, phase information includes the relative positions of different components within a signal, and is less sensitive to changes in dynamic conditions.

CREPE trains its network to learn from the training datasets that include annotated pitch values. It then produces an output based on the spectrogram of the input signal and checks to see if the output generated matches the training dataset. Therefore, the content and size of the training data are crucial for this algorithm because they affect the final performance, along with the training parameters. Researchers can train it specifically to target certain types of voice signals. CREPE is also designed for real time pitch detection. In standard, the sampling rate is 16 kHz and the algorithm uses a 64 ms analysis window to process every 10 ms.

This page contains a pre-trained model for CREPE: https://github.com/marl/crepe

Conclusion

Despite extensive study in the domain of pitch prediction, most of the existing PDAs are still not ideal. Sounds are produced by the human body, which is an imperfect source. The periodicity waveforms we produce are usually not perfectly periodic after traveling through the vocal tract. This makes it harder for the design process of PDAs to distinguish the voiced and unvoiced segments. Another challenge is differentiating between unvoiced speech and very low frequency voiced speech. Researchers today are still targeting these bottlenecks while working towards a more perfect pitch detection algorithm.

References

[1]https://personal.utdallas.edu/~hxb076000/citing_papers/bartosek_Bartosek_Comparing%20Pitch%20Detection%20Algorithms%20for%20Voice%20Applications.pdf Figure 1

[2]https://www.eecs.qmul.ac.uk/~simond/pub/2014/MauchDixon-PYIN-ICASSP2014.pdf Figure 1

[3]https://medium.com/@neurodatalab/pitch-tracking-or-how-to-estimate-the-fundamental-frequency-in-speech-on-the-examples-of-praat-fe0ca50f61fd Picture 1

[4]https://arxiv.org/pdf/1802.06182.pdf Figure 1

https://web.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/107_comparative%20pitch%20detectors.pdf

https://www.ee.columbia.edu/~dpwe/e4896/lectures/E4896-L08.pdf

https://ccrma.stanford.edu/~pdelac/154/m154paper.htm