The TIMIT Corpus - sdrobert/pytorch-database-prep GitHub Wiki

This page serves to outline the repository's treatment of the corpus, as well as any information one should know when working with the corpus.

Summary

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a very small corpus for recognizing sequences of phones instead of words. It was developed jointly by DARPA, Texas Instruments, and MIT, released in 1988. The training set clocks in at slightly over 3 hours and the core test set about 15 minutes. Speakers were classified into one of 8 American regional dialects (one being "Army brat").

Acquisition

The corpus is available from the LDC under the listing LDC935S1.

Quickstart

The primary interface for TIMIT is through timit.py. Calling the following commands will set up the standard TIMIT environment for training.

TIMIT=/path/to/timit/directory  # immediate subdirectories should include 'DOC', 'TRAIN', and 'TEST'
python pytorch-database-prep/timit.py data preamble $TIMIT
python pytorch-database-prep/timit.py data init_phn
python pytorch-database-prep/timit.py data torch_dir

Which will populate the data/ folder with something like

data/
    local/
    train/
        feat/
        ref/
    dev/
        feat/
        ref/
    test/
        feat/
        ref/
    ext/

Files in the feat/ and ref/ subdirectories have the format (feat|ref)/<utt_id>.pt, where utt_id is the corresponding utterance ID from the TIMIT corpus. feat stores feature sequences for utterances; ref stores reference sequences. A feature sequence is a float tensor of shape (T, F), where T is the number of audio feature frames and F is the number of features per frame. By default, F=41 for the standard 40-dimensional triangular Mel-scaled filter bank plus one energy coefficient. A reference sequence is a long tensor of shape (R, 3), where R is the number of phones/tokens in the reference sequence and the triple (the 3 dimension) stores the token id, the start frame of the token (inclusive) between 0 and T-1, and the end frame of the token (exclusive) between 1 and T. The token id can be mapped back-and-forth to the phone string with the files ext/token2id.txt and ext/id2token.txt. TIMIT includes phone alignment information by default, though most end-to-end systems won't use it. To extract the token sequence only, merely slice the reference sequence tensor ref[..., 0]. While these folders are set up for use with pydrobert-pytorch's SpectTrainingDataLoader and SpectEvaluationDataLoader, this is by no means a requirement. Feel free to build your own loaders.

Textual reference transcripts of each partition can be found in ext/(train|dev|test).ref.(ctm|stm|trn). The different file types correspond to NIST's CTM, STM, and TRN formats. The official online documentation for these formats can be accessed in raw HTML here, but a few illustrative examples can be seen in the pydrobert-pytorch documentation. The TRN format is simplest: it's merely a space-delimited list of tokens followed by the utterance id in parentheses.

The ext/*.ref.* files have been mapped down to the 39-phone evaluation set already. The tensor reference sequences by default use the 48-phone set. The 39-phone set evaluation is standard for evaluation. Therefore, do not directly compute error rates from the reference directories (e.g. through pydrobert-pytorch's compute-torch-token-data-dir-error-rates. First, convert your hypothesis token sequences to a trn file (e.g. through pydrobert-pytorch's torch-token-data-dir-to-trn. Assuming your output is in test.hyp.trn, you can map it to the 39-phone set using the command

python pytorch-database-prep/timit.py data filter test.hyp.trn test.hyp.filt.trn

Then you can either use SCLITE or our script (whose backend is jiwer) to compute phone error rates (PERs):

python pytorch-database-prep/error-rates-from-trn.py data/ext/test.ref.trn test.hyp.filt.trn 
sclite -r data/ext/test.ref.trn -h test.hyp.filt.trn -i swb

Other setups

There is some flexibility to the above setup, albeit less so than for the WSJ corpus. Take a look at python pytorch-database-prep/timit.py --help for a full list.

Setup common to all configurations is performed in the preamble step, so there aren't any options there.

The init_phn stage allows for two options. The first is to build a Modified Kneser-Ney language model over phone sequences of whatever order. This doesn't change anything except that it adds an arpa.lm.gz file to the ext/ subdirectory after the torch_dir command. The second is to specify the phone vocabulary size to be used for training (61, 48, or 39). More information about the distinction can be found in the description below. Note that the language model (if specified) will be trained on that vocabulary size. Here's an example command:

python pytorch-database-prep/timit.py data init_phn --vocab-size 39 --lm  # triphone lm with 39-phone vocabulary

It is technically possible to do word recognition on TIMIT. At some later point I may include an init_word command that can be used in place of init_phn.

torch_dir builds the feat/ and ref/ folders from a given init_phn configuration. If there's only one such configuration, it's chosen automatically. Otherwise you have to specify e.g. phn<vocab_size> as your second positional argument (assuming you've already called init_phn for that vocabulary). It also stores the requisite maps and info files in the ext/ folder. You can change your feature representation by specifying different configuration files using the --computer-json flag. The default can be found in this project's conf/fbank_41.json file. Others can be found in the conf/ folder and follow pydrobert-speech JSON formatting. You can instead set the flag --raw to store the raw audio as tensors. In this case, a frame is a sample and there is one coefficient per frame. Finally, you can change the development and test set according to the below detailed discussion. Here are some example calls.

# 1. Default vocab size (assuming init_phn called once) with raw audio, stored in data/{train,dev,test,ext}
# 2. 48-phone vocab size with 41-dimensional frame vectors using Gabor filters, stored in data/{train,dev,test,ext}
# 3. 39-phone vocab size with 41-dimensional frame vectors using Gammatone filters and short integration, stored in data/foo/{train,dev,test,ext}
python pytorch-database-prep/timit.py data torch_dir --raw  # 1.
python pytorch-database-prep/timit.py data torch_dir phn48 --computer-json conf/gbank_41.json  # 2.
python pytorch-database-prep/timit.py data torch_dir phn39 foo --computer-json conf/sitonebank_41.json  # 3.

Finally, while error rates computed with text-based alignments are the standard by which ASR is evaluated on TIMIT, the additional alignment info provided with the corpus (and stored in data/ext/(train,dev,test).ref.ctm) can be paired with hypotheses that also include alignment information to perform time-mediated alignments if so desired. Insofar as the hypothesis alignments are accurate, time-mediated alignments will better represent the mistakes made by the recognizer (though likely resulting in higher error rates). The filter command can be also be used to apply the 39-phone map to either CTM or STM files, as illustrated below:

# map to 39 phones, then apply time-mediated aligments
python pytorch-database-prep/timit.py data filter test.hyp.ctm test.hyp.filt.ctm --both-ctm
sclite -r data/ext/test.ref.ctm ctm -h test.hyp.filt.ctm ctm -T
# drop STM file down to trn at the same time as mapping to 39 phones, then apply text-based alignments (the standard)
python pytorch-database-prep/timit.py data filter test.hyp.stm test.hyp.filt.trn --in-stm
# as before...

Detailed descriptions

Phone mappings

Almost immediately after the release of TIMIT, Lee and Hon proposed the reduction of the 61 phone set that TIMIT was transcribed with into 48 for training, then down to 39 for testing. The glottal stop q is completely removed. According to the authors, the change with the greatest impact to performance is to collapse 9 silence-like phones into one, sil.

Unfortunately, the authors provide little reasoning as to why they map 61 down to 48 down to 39, except for performance. Most changes appear reasonable enough. For example, the specific type of closure that occurs before a stop-consonant, while often distinct acoustically, can easily be inferred by the subsequent plosion (e.g. gcl g can be mapped to cl g with no loss of information). A bunch of other reductions make sense when phones would be mapped to the same phoneme. I do take strong issue with the collapse of the voiced/unvoiced fricatives zh (e.g. a_z_ure) and sh (e.g. _sh_e) together as they are very distinct, but whatever.

While reduction to the 39 phone set is mandatory before evaluation, there is little consensus on what's best before then. You could use the 61 phone set, the 60 phone set (w/o q), the 48 phone set, the 39 phone set, or some random other set. The default settings drop to the 48 phone set for training in order to match the standard Kaldi recipe.

SI, SA, and SX utterances

The full corpus, clocking in at about 5 hours, consists of 1260 SA utterances, 3150 SX utterances, and 1890 SI utterances. The SA utterances are read-aloud from two "dialect sentences" engineered to elicit shibboleths (idiosyncrasies that identify the dialect) by each speaker in the corpus. The SX utterances are read-aloud from a pool of 450 "phonetically compact" sentences (5 per speaker, with an overlap of 7 speakers per sentence) chosen to maximize coverage of pairs of phones. The SI utterances read-aloud from unique, "phonetically diverse" prompts, each uttered by only one speaker and each speaker given 3 to utter. These prompts maximize "allophonic variation," which can be read about in the tech report, but I can't be bothered.

If the full corpus is about 5 hours, why do we only account for about 3 and a half in the summary above? The answer is that all SA utterances are discarded. This is because both a) the SA utterances feature speakers from the test set (though this is easily remedied by merely excluding speakers from the test sets), and b) because the SA utterances are based on only two prompts, which can bias the trained model to those specific phone sequences. The latter point is merely concern for our model's well-being (aww), so as long as you protect against the former there isn't an obligation (besides legacy and comparability) to exclude all SA utterances, which could give you another hour of training data.

Silent but deadly

While the following may appear to be yet another petty concern of mine, this one, at least, isn't. Despite it's insignificant appearance, this choice can lead to big (~1% absolute) changes in state-of-the-art performance.

In word-level speech recognition, it is common to remove silence tokens from both reference and hypothesis transcripts before calculating error rates. This makes a lot of sense when calculating word error rates since it's not a word and the point when an inter- or intra-word pause becomes a silence is ill-defined.

However, in TIMIT error rate evaluation, all silences are treated as tokens (even the closures during stop-consonants!). While I think you can make an argument for why this shouldn't have been the decision in the first place, it was. Now we have to live with it.

This was respected in Kaldi until briefly in 2014 with this commit. In a later commit, Kaldi fixed its error and started to include silences again in their PER calculations. If you take a look at that the changes in the Kaldi RESULTS file, you'll find that the PER of all models increased by a minimum of ~0.5% and a maximum of 2.0%. At the state-of-the-art, this change is non-negligible.

While the developers Kaldi did the right thing in changing the evaluation script back, you'll still see that scoring script kicking around some of the state-of-the-art models on TIMIT. If there is source code for a state-of-the-art model, you should double-check this. If you choose to report scores with silences removed, please report this in your paper. Otherwise, direct comparison of results becomes very difficult.

Partitions and the mystery development set

Because Kaldi is such a popular repository that many rely on to at least set up their databases, the train/dev/test split they've provided has become the de facto standard. However, the official documentation only specifies a training partition, a "complete" test partition, and a "core" test partition.

The core test set features all utterances of two male and one female speaker of eight different dialects; the full test set includes the core test speakers as well as any speakers that read-aloud the same prompts as the core test set. The training set consists of the remaining speakers, so, regardless of whether you use the core or completetest set, the training data will not include the same prompts as the test.

A review from 2011 by Lopes and Perdigao reveals that there was still competition as to whether PERs should be reported for the complete or core test set. Because the Kaldi dev set is a subset of the complete-minus-core set, it must not have been standard until sometime later. The first Kaldi commit of the TIMIT recipe (including the dev set) was in 2012.

Morris and Fossler-Lussier refer to the MIT development set of 50 speakers back in 2008. Following Jim Glass back in time, there is a paper from 1996 by Glass et al. referring to a development set of the same size. After discussion with Abdel-Rahman Mohamed, who confirmed it was listed in the dissertation of one of Glass' students, I was able to find it in that of Andrew Halberstadt. Note that its acceptance date, 1998, was well before the set was canonicalized.

Reading the relevant section of the thesis gave no clues as to why this development set was chosen as opposed to some other 50-speaker subset or even the entire "complete" test set minus the "core." The thesis, and other sources that explicitly mention the development set, stated the choice as matter-of-fact.

I can easily come up with a reason for not choosing this dev set. Building a dev set out of the speakers in the test partition does allow for a slight advantage on the remaining test cases. Speakers were added to the full partition if any of their prompts matched those in the core partition, which means any speaker in this dev set is going to have at least one prompt in common with a speaker in the core test set. To me, this negates (or even reverses) any benefit to extracting the full test set in the first place. Once you've made the choice to violate this separation, you might as well take the entire "complete-minus-core" set, which is three times as large as the 50-speaker dev set. Given that the precedence is for the 50-speaker set, using the "complete-minus-core" set is a non-default option, however. You should also state in any papers that that's what you're doing.

Calculating edit distances

This one is definitely petty, so please skip it. If you won't, please first read the section called Implementation Choices. In it, I mentioned the canonical edit distance weights for NIST being 3 for insertions and deletions and 4 for substitutions, which might slightly increase error rates over a uniform penalty. While you can make a strong argument that this is canon for WSJ and later tasks, TIMIT preceded WSJ. While the embedded SCORE software in the WSJ corpus does allow for TIMIT scoring, I think this a retroactive convention. I don't think that anyone had thought about exactly how to evaluate it yet. You're probably safe to use whichever method you'd prefer. Funnily enough, the Kaldi scoring code for TIMIT relies on NIST's Hubscr, which relies on SCLITE, which use the 3-4 penalties. This is in contrast to Kaldi's WSJ recipe, which definitely should have used the 3-4 penalties, but didn't.