Additional Datasets for ASR - cereal-d3v/LLM-ASR GitHub Wiki

Datasets for Automatic Speech Recognition (ASR)

In this section, we will cover some of the widely used datasets for training, evaluating, and developing ASR systems. Additionally, we will explore potential new datasets that can be leveraged for improving ASR performance, including diverse and multilingual datasets.

1. Mozilla Common Voice

Mozilla Common Voice is a free, multilingual speech dataset that is open to the public. It aims to help build better speech recognition systems for everyone, including underrepresented languages. The dataset is crowdsourced, which means anyone can contribute their voice by reading sentences aloud.

State of the Art:

Key Features:

  • Multilingual: Supports a wide range of languages, making it ideal for building ASR systems across different regions and dialects.
  • Diverse speakers: The dataset includes voice samples from people of varying ages, genders, and accents, contributing to the diversity and robustness of ASR models.
  • Open-source: The data is freely available and can be used for academic and commercial purposes.

Potential Use for ASR:

Mozilla Common Voice can be a valuable resource for improving the accuracy of ASR systems, especially for languages with limited available data. It also supports research into accent and dialect recognition, which could enhance the adaptability of ASR systems to diverse speech patterns.

Access:

Accessible here. Dataset can be downloaded in pieces and total dataset size is 83 GB.

2. WeNetSpeech

WeNetSpeech is a large-scale, open-source dataset for speech recognition and other speech-related tasks. It is designed to facilitate the development of both end-to-end ASR models and traditional pipeline models.

Key Features:

  • Massive dataset: Contains over 10,000 hours of transcribed speech data, making it one of the largest ASR datasets available.
  • Chinese speech: Primarily focused on Mandarin Chinese, it provides a valuable resource for developing state-of-the-art ASR models for this language.
  • High-quality annotations: The dataset is meticulously transcribed, ensuring high-quality training data for speech recognition tasks.

Potential Use for ASR:

WeNetSpeech can be particularly useful for developing ASR systems tailored to Mandarin Chinese. Given the scale of the dataset, it also offers opportunities for training robust models with high accuracy across different speech styles.

Access:

Accessible here

Download using Linux shell: bash utils/download_wenetspeech.sh DOWNLOAD_DIR UNTAR_DIR

Download using modelscope: 'conda create -n modelscope python=3.7 conda activate modelscope pip install torch pip install modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html '


3. ESC-50

ESC-50 is an environmental sound dataset consisting of 50 classes of sounds from everyday environments, such as animal noises, human activities, and natural phenomena.

Key Features:

  • 50 different sound categories: The dataset includes a variety of non-speech sounds, organized into categories like human, nature, and mechanical sounds.
  • 5-second clips: Each sound file is 5 seconds long, making it a useful dataset for detecting and identifying short, distinct sounds.
  • Balanced dataset: The number of examples for each category is balanced, ensuring fairness in training and evaluation.

Potential Use for ASR:

While ESC-50 is not a traditional ASR dataset, it could be useful for building models that need to differentiate between speech and non-speech sounds. For example, it can help ASR systems learn to ignore background noise or to detect environmental sounds that could impact speech recognition quality.

Access:

Accessible here

Zip File Link (warning: 600 mb zip file that contains the whole dataset)

4. CHiME-5

CHiME-5 is a dataset in the CHiME series of datasets. CHiME datasets focus on voice samples with high levels of environmental noise or low fidelity, encouraging robustness in ASR research. Researchers developed CHiME-5 by simulating dinner party settings, recording audio using strategically placed X-Box Kinect microphones, resulting in a dataset of realistic speech, recorded using consumer-grade equipment.

Key Features:

  • Multi-channel recording: the dataset gives the user access to all Kinect recording channels of the simulated settings.

Potential Use for ASR:

CHiME presents a unique opportunity for finding edge cases produced by sound files with multiple voices in addition to the target voice. Likewise, the consumer-grade equipment reflects real-world use cases for ASRs.

Potential Future Directions

In addition to the above datasets, future ASR research can benefit from combining these datasets to create more robust models that perform well across different languages, dialects, and noisy environments. Some ideas for using these datasets include:

  • Multilingual ASR models using Mozilla Common Voice’s diverse language support.
  • Speech-enhanced ASR systems that incorporate environmental sound recognition using ESC-50, improving ASR performance in real-world settings.
  • Fine-tuning on large datasets like WeNetSpeech to develop models with higher accuracy and better generalization, especially for specific languages like Mandarin Chinese.

Access:

Accessible here


These datasets offer a broad range of opportunities for improving ASR systems. By leveraging a combination of these resources, it is possible to build more versatile and accurate ASR systems that work well across different environments and languages.