The Librispeech Corpus - sdrobert/pytorch-database-prep Wiki

IN PROGRESS

This page serves to outline the repository's treatment of the corpus, as well as any information one should know when working with the corpus.

Summary

The Librispeech corpus is a freely-available corpus extracted from LibriVox. Utterances are extracted from novice, publicly-sourced readings of books in public domain. The database is partitioned into 7 tranches: train-clean-100 (~101hrs), train-clean-360 (~363hrs), train-other-500 (~497hrs), dev-clean (~5hrs), dev-other (~5hrs), test-clean (~5hrs), and test-other (~5hrs). clean and other roughly correspond to "easy" and "hard" utterances. Evaluation is traditionally at the word-level.

Acquisition

The corpus is freely available at OpenSLR. The page lists the corpus as protected under the Creative Commons Attribution 4.0 International License (CC BY 4.0). It is derived from LibriVox, itself protected by CC BY 4.0.

The corpus' standard n-gram language models are also available from OpenSLR, though they are in the public domain.

Detailed Description

Librispeech was collected by Vassil Panayotov, a key contributor to the Kaldi ASR project. The ICASSP submission explicitly mentions the Kaldi recipes "demonstrate how high quality acoustic models can be trained on this data." Thus, the submission and the Kaldi recipe itself can be considered authoritative with respect to corpus matters. The submission is short and easy enough that I recommend reading it. Below are a few key qualities of the corpus.

Reception

In my opinion, Librispeech has become the most popular corpus for ASR evaluation, as well as a popular source of supplementary training data for so-called "pre-training" of recognizers for other evaluations. I attribute this mainly to two factors: its size (~1000 hours!) and the fact it's free. In addition, the corpus evaluation isn't nearly as complicated as the various ARPA initiatives like WSJ or TIMIT, making it easy to get up and running. Finally, I suspect that engineers like corpora that make their systems look good, and Librispeech error rates are consistently lower than those of other corpora.

Generalizability

Librispeech has no doubt been a force in the big data approach to ASR. If you submit to an academic conference and don't test on Librispeech, relying instead on some smaller corpus, there's a good chance a reviewer will ask "why no Librispeech?" It is also hard to deny that supplementing your training data with Librispeech will often improve your scores on other corpora.

That said, Librispeech is not a be-all, end-all solution to ASR. To illustrate, take a look at this paper by Facebook researchers, specifically Table 4. With the exception of WSJ (a much smaller corpus of news readings), replacing a corpus training set with Librispeech data leads to worse performance than if the original training set used. While Librispeech training does lead to the best averaged error rates in this paper, you're much better off trying to figure out what corpus is closest to the type of data you'll be deploying for. I suspect conversational corpora like Fisher are better suited to industry applications, if you can afford the license. Well... unless you're trying to transcribe an audiobook.

Training with Librispeech in addition to your in-domain corpus is another matter. I can't imagine a situation where training first with Librispeech followed by fine-tuning with your in-domain corpus would hurt.

Low Error Rates

I mentioned earlier that Librispeech's state-of-the-art error rates are lower than those of other corpora. This can't be explained away by the quantity of data in the training partition, since then we'd expect error rates on corpora like Common Voice (CV), which is over twice the size of Librispeech, to be even lower. But CV error rates are much higher than those of Librispeech.

What's going on? Though I can't be sure, I'm guessing it's a combination of the type of utterance composing LibriVox and the sanitization process with which utterances are extracted for Librispeech. While the speakers are (presumably) untrained, LibriVox prompts were extracted from books. The prompts can be considered well-formed English (if archaic) with few specific technical terms. In contrast, CV prompts were pulled from Wikipedia and thus are not so well-formed. This impacts both the quality of language modelling and the manner with which prompts are delivered. In addition, the mechanism whereby Librispeech automatically extracted and sanitized utterances from LibriVox likely ended up choosing the easiest clips for ASR to transcribe. After all, a precondition for being chosen to be part of the corpus is whether a GMM-based ASR system can force-align the text in the book to the audio (plus a number of other niceties, including the presence of surrounding silence). Therefore, even the utterances in the "other" partitions are still of relatively high quality. In contrast, CV vets transcriptions through crowd-sourcing. While perhaps including more difficult utterances for ASR, I'd imagine it's also more unreliable.

This comparison isn't meant to say that Librispeech is wrong in its approach per se. As stated before, all corpora are tied to the domain they represent, and these decisions fold into that domain. If anything, I'd criticize CV instead for being too sloppy.

Unsupervised Learning

A common use of Librispeech is as a source for unsupervised learning (e.g. WaveNet). This sort of training only needs the audio itself. In that case, however, you might as well switch to Libri-Light, which uses a much larger portion of LibriVox.

MP3s and De-Noising

The ICASSP paper mentions that the LibriVox recordings were saved in MP3 format and also denoised (though apparently these were not consistently enforced). I can't be bothered to determine how much of the corpus was originally compressed and/or denoised. Given that Librispeech training was effective on the WSJ evaluation, the latter being a clean but uncompressed and not denoised corpus, I'm not too worried.