The Librispeech Corpus - sdrobert/pytorch-database-prep GitHub Wiki
IN PROGRESS
This page serves to outline the repository's treatment of the corpus, as well as any information one should know when working with the corpus.
The Librispeech corpus is a freely-available corpus extracted from LibriVox. Utterances are extracted from novice, publicly-sourced readings of books in public domain. The database is partitioned into 7 tranches: train-clean-100 (~101hrs), train-clean-360 (~363hrs), train-other-500 (~497hrs), dev-clean (~5hrs), dev-other (~5hrs), test-clean (~5hrs), and test-other (~5hrs). clean
and other
roughly correspond to "easy" and "hard" utterances. Evaluation is traditionally at the word-level.
The corpus is freely available at OpenSLR. The page lists the corpus as protected under the Creative Commons Attribution 4.0 International License (CC BY 4.0). It is derived from LibriVox, itself protected by CC BY 4.0.
The corpus' standard n-gram language models are also available from OpenSLR, though they are in the public domain.
We provide a command to automatically download all the materials. See the Quickstart section below.
The primary interface is through librispeech.py
. One can download the corpus and format it using just these commands:
python pytorch-database-prep/librispeech.py data download
python pytorch-database-prep/librispeech.py data preamble
python pytorch-database-prep/librispeech.py data init_word
python pytorch-database-prep/librispeech.py data torch_dir # very slow!
Which will populate the data/
folder with something like
data/
dev_clean/
feat/
ref/
dev_other/
test_clean/
test_other/
train_2kshort/
train_5k/
train_10k/
train_clean_100/
train_clean_360/
train_other_500/
ext/
dev_clean.ref.trn
dev_other.ref.trn
...
train_other_500.ref.trn
token2id.txt
id2token.txt
...
local/
All folders from dev_clean
to train_other_500
contain two subdirectories each, feat
and ref
. Files in the feat/
and ref/
subdirectories have the format (feat|ref)/<utt_id>.pt
, where utt_id
is the corresponding utterance ID from the TIMIT corpus. feat
stores feature sequences for utterances; ref
stores reference sequences. A feature sequence is a float tensor of shape (T, F)
, where T
is the number of audio feature frames and F
is the number of features per frame. By default, F=41
for the standard 40-dimensional triangular Mel-scaled filter bank plus one energy coefficient. A reference sequence is a long tensor of shape (R,)
, where R
is the number of word tokens in the utterances. Librispeech does not include any segmentations. The token id can be mapped back-and-forth to words with the files ext/token2id.txt
and ext/id2token.txt
. While these folders are set up for use with pydrobert-pytorch
's SpectDataLoader
, this is by no means a requirement.
Textual transcripts of references can be found in the files ext/<partition>.ref.trn
. They are in NIST's TRN format, which is basically a space-delimited list of tokens followed by an utterance id in parentheses. If your ASR system outputs hypothesis token sequences to a ref/
-like directory, you can convert those to a TRN with pydrobert
's torch-token-data-dir-to-trn
. Then, assuming you saved the file as, e.g. test_clean.hyp.trn
, you can compute WERs using our script (whose backend is jiwer):
python pytorch-database-prep/error-rates-from-trn.py data/ext/test_clean.ref.trn test_clean.hyp.trn
Note that some tokens in the dev/test set cannot be found in the standard 200k (+3 special) word-type vocabulary. They are mapped to the <UNK>
token in the ref/
folders. For a correct WER, make sure to use the TRN files as reference, not the saved token sequence tensors in ref/
.
The download
command downloads the corpus to the local/data
subdirectory of the data directory (in the quickstart, this was data
). The result has the structure:
data/local/data/
dev-clean/
84/
121123/
84-121123-0000.flac
...
...
dev-other/
test-clean/
test-other/
train-clean-100/
train-clean-360/
train-other-500/
lm/
librispeech-vocab.txt
SPEAKERS.TXT
...
If one wants do do any language modelling - either by using the pretrained n-gram language models or by using the training materials - one should add the --lm
flag to the download
command to download those resources as well:
python pytorch-database-prep/librispeech.py data download --lm
If one wants to download a subset of the possible files, one can specify those files with the --files
flag, e.g.
python pytorch-database-prep/librispeech.py data download --files {dev,test}-{clean,other}.tar.gz train-clean-100.tar.gz librispeech-vocab.txt
Though any number of files can be downloaded this way, the example above lists the files necessary for subsequent commands to be successful. Note that, in this example, it will be necessary to set the --compute-up-to
flag to 100
when running the torch_dir
command to keep from trying to compute features for the missing train-clean-360
and train-other-500
partitions.
Alternatively, one can skip the command manually download the files dev-clean.tar.gz, dev-other.tar.gz, test-clean.tar.gz, test-other.tar.gz, train-clean-100.tar.gz, train-clean-360.tar.gz, train-other-500.tar.gz, and librispeech-vocab.txt, plus the files 3-gram.arpa.gz, 3-gram.pruned.1e-7.arpa.gz, 3-gram.pruned.3e-7.arpa.gz, 4-gram.arpa.gz, and librispeech-lm-norm.txt.gz if you want to do any explicit language modelling. Store them all in a single folder and extract the x.tar.gz
files into subfolders with name x
. The following preamble
and init_wrd
commands must receive the path to the folder as a positional argument:
LIBRISPEECH=/path/to/downloaded/data
python pytorch-database-prep/librispeech.py data preamble $LIBRISPEECH
python pytorch-database-prep/librispeech.py data init_word $LIBRISPEECH
The download
command was provided as a convenience; manual or automatic downloading should result in the same prepared data.
The preamble
command has two variants. The first relates to how speakers are defined. By default we follow Kaldi's convention of treating each reader-chapter pair as a speaker. They cite "simplicity" and per-chapter CMVN as the reason for this. Clearly the more accurate mapping is to treat "reader" and "speaker" identically. One may use this mapping with a flag:
python pytorch-database-prep/librispeech.py data preamble --readers-are-speakers
The remainder of the recipe merely passes the speaker information on. Unless one plans on performing speaker-dependent ASR, the distinction is moot.
The second variant of preamble
allows you to exclude some training subsets from future consideration with the --exclude-subsets
flag. The subsets, train_2kshort
, train_5k
, and train_10k
, are all strict subsets of train-clean-100
. As such, no training data is lost by excluding them. They exist as part of Kaldi's training procedure, which trains smaller models on less data. If you don't plan on using these partitions in your recipe, excluding them will save a bunch of time copying features.
init_word
also has only one variant, though not a particularly meaningful one (yet). In this stage, one of the downloaded language models (if available) is chosen to be copied into the configuration subdirectory as arpa.lm.gz
, eventually to make its way into data/ext
. By default, the largest pre-trained LM (4-gram.arpa.gz
) is copied. It is no different from any of the downloaded language models.
torch_dir
has the usual variants for creating features, including --raw
(simply saves the audio files as PyTorch tensors), --computer-json
to configure features as something other than 41-dimensional f-bank features, --preprocess
and --postprocess
to add things like dithering or deltas, and --features-from
to copy features from another directory. --compute-up-to {100,360,500}
is unique to Librispeech. It allows one to avoid computing features for the larger partitions. This can save a lot of time and memory if you're not doing anything with the larger partitions. --force-compute-subsets
, also unique to Librispeech, forces the command to re-compute the features for train_2kshort
, train_5k
, and train_10k
, rather than just copying the relevant ones from train-clean-100
.
Librispeech was collected by Vassil Panayotov, a key contributor to the Kaldi ASR project. The ICASSP submission explicitly mentions the Kaldi recipes "demonstrate how high quality acoustic models can be trained on this data." Thus, the submission and the Kaldi recipe itself can be considered authoritative with respect to corpus matters. The submission is short and easy enough that I recommend reading it. Below are a few key qualities of the corpus.
In my opinion, Librispeech has become the most popular corpus for ASR evaluation, as well as a popular source of supplementary training data for so-called "pre-training" of recognizers for other evaluations. I attribute this mainly to two factors: its size (~1000 hours!) and the fact it's free. In addition, the corpus evaluation isn't nearly as complicated as the various ARPA initiatives like WSJ or TIMIT, making it easy to get up and running. Finally, I suspect that engineers like corpora that make their systems look good, and Librispeech error rates are consistently lower than those of other corpora.
Librispeech has no doubt been a force in the big data approach to ASR. If you submit to an academic conference and don't test on Librispeech, relying instead on some smaller corpus, there's a good chance a reviewer will ask "why no Librispeech?" It is also hard to deny that supplementing your training data with Librispeech will often improve your scores on other corpora.
That said, Librispeech is not a be-all, end-all solution to ASR. To illustrate, take a look at this paper by Facebook researchers, specifically Table 4. With the exception of WSJ (a much smaller corpus of news readings), replacing a corpus training set with Librispeech data leads to worse performance than if the original training set used. While Librispeech training does lead to the best averaged error rates in this paper, you're much better off trying to figure out what corpus is closest to the type of data you'll be deploying for. I suspect conversational corpora like Fisher are better suited to industry applications, if you can afford the license. Well... unless you're trying to transcribe an audiobook.
Training with Librispeech in addition to your in-domain corpus is another matter. I can't imagine a situation where training first with Librispeech followed by fine-tuning with your in-domain corpus would hurt.
I mentioned earlier that Librispeech's state-of-the-art error rates are lower than those of other corpora. This can't be explained away by the quantity of data in the training partition, since then we'd expect error rates on corpora like Common Voice (CV), which is over twice the size of Librispeech, to be even lower. But CV error rates are much higher than those of Librispeech.
What's going on? Though I can't be sure, I'm guessing it's a combination of the type of utterance composing LibriVox and the sanitization process with which utterances are extracted for Librispeech. While the speakers are (presumably) untrained, LibriVox prompts were extracted from books. The prompts can be considered well-formed English (if archaic) with few specific technical terms. In contrast, CV prompts were pulled from Wikipedia and thus are not so well-formed. This impacts both the quality of language modelling and the manner with which prompts are delivered. In addition, the mechanism whereby Librispeech automatically extracted and sanitized utterances from LibriVox likely ended up choosing the easiest clips for ASR to transcribe. After all, a precondition for being chosen to be part of the corpus is whether a GMM-based ASR system can force-align the text in the book to the audio (plus a number of other niceties, including the presence of surrounding silence). Therefore, even the utterances in the "other" partitions are still of relatively high quality. In contrast, CV vets transcriptions through crowd-sourcing. While perhaps including more difficult utterances for ASR, I'd imagine it's also more unreliable.
This comparison isn't meant to say that Librispeech is wrong in its approach per se. As stated before, all corpora are tied to the domain they represent, and these decisions fold into that domain. If anything, I'd criticize CV instead for being too sloppy.
A common use of Librispeech is as a source for unsupervised learning (e.g. WaveNet). This sort of training only needs the audio itself. In that case, however, you might as well switch to Libri-Light, which uses a much larger portion of LibriVox.
The ICASSP paper mentions that the LibriVox recordings were saved in MP3 format and also denoised (though apparently these were not consistently enforced). I can't be bothered to determine how much of the corpus was originally compressed and/or denoised. Given that Librispeech training was effective on the WSJ evaluation, the latter being a clean but uncompressed and not denoised corpus, I'm not too worried.