The Wall Street Journal Speech Corpus - sdrobert/pytorch-database-prep GitHub Wiki

This page serves to outline the repository's treatment of the WSJ corpus, as well as any information one should know when working with the corpus.

Summary

The Wall Street Journal CSR Corpus contains both no-audio and dictated portions of the Wall Street Journal newspaper. The corpus contains about 80 hours of recorded speech in the standard training condition. Evaluation has been traditionally performed at the word-level, though character-level evaluations do exist.

Though the corpus has additional training and test data, the "standard" paradigm is to factor training data into two conditions (SI-84 or SI-284), choose one of two test suites ('92 or '93), and one vocabulary size (5k or 64k). The latter option of each is considered the "more standard" one, though Kaldi likes to report on the '92 set.

Acquisition

The corpus spans multiple LDC entries:

WSJ0: LDC93S6A or LDC93S6B. The earlier release contains Speaker-Independent entries for 84 speakers (the so-called SI-84 data) and the development and evaluation data from November 1992. The corpus description paper is based on this.
WSJ1: LDC94S13A or LDC94S13B. The later release. Adds 200 new non-journal Speaker-Independent entries. When added to SI-84, the total dataset is called SI-284.

The latter options (*B) only features Sennheiser data, but that's all that's needed.

Quickstart

The primary interface for WSJ is through wsj.py. You can get a standard word-level setup with a 64k vocabulary, the si284 training set, the dev93 development set, and the eval92 test set by calling

WSJ0=/path/to/wsj0
WSJ1=/path/to/wsj1
python pytorch-database-prep/wsj.py data/ preamble $WSJ0 $WSJ0
python pytorch-database-prep/wsj.py data/ init_word $WSJ0 $WSJ1
python pytorch-database-prep/wsh.py data/ torch_dir

Which will leave you with the folder data/ populated as

data/
    local/
    si284/
        feat/
        ref/
    dev93/
        feat/
        ref/
    eval92/
        feat/
        ref/
    ext/
        si284.ref.trn
        dev93.ref.trn
        eval92.ref.trn
        token2id.txt
        id2token.txt
        ...

Files in feat/ and ref/ have the format (feat|ref)/<utt_id>.pt, where utt_id is a corresponding utterance ID from the WSJ corpus. feat stores feature sequences for utterances; ref stores reference sequences. A feature sequence has shape (T, F), where T is the number of audio frames in the utterance and F is the number of coefficients per frame. A reference sequence is of shape (R,), where R is the number of tokens in the sequence. token2id.txt and id2token.txt map between token ids (numerical) and token types (strings). While these folders are set up for use with pydrobert-pytorch's SpectDataLoader, this is by no means a requirement. Feel free to build your own loaders.

si284.ref.trn has <NOISE> symbols, whereas dev93.ref.trn and eval92.ref.trn don't.

Once you've generated a hypothesis transcription file eval92.hyp.trn, you should filter it in order to remove <NOISE> symbols. You can do so with:

python pytorch-database-prep/wsj.py data/ filter_hyp eval92.hyp.trn eval92.hyp.filt.trn

And then feed eval92.hyp.filt.trn into sclite:

sclite -r data/ext/eval92.ref.trn -h eval92.hyp.filt.trn -i wsj

Other setups

There is a lot more flexibility than just the setup above. Most of the information can be found by calling python pytorch-database-prep/wsj.py --help and working from there. We cover some of the big pieces here, such as different training/evaluation sets, word vs. character vs. subword modelling, filter banks, and language models for shallow fusion.

We separate the setup into three commands in order to avoid repeating computations across configurations.

All setup begins with the preamble command, which does a lot of preparation in a similar fashion to Kaldi but with a couple minor changes and a bunch of unneeded things removed. This command stores a bunch of stuff in data/local/data. It need only be run once - files written to data/local/data are agnostic to future configuration.

After the preamble, you can call one of init_word, init_char, or init_subword to set up the recipe for word, character, or subword recognition, respectively. At this stage, you must also decide on whether you'll be running on the test sets with 64k vocabulary (standard) or the 5k vocabulary. You can also use the flag --lm to train an n-gram language model near-identical to that of kenlm in order to rescore hypotheses. Here are some example configurations:

# 1. 5k closed word-level vocabulary with trigram Modified Kneser-Ney language model. Stored in data/local/word5k
# 2. 32 character vocabulary on 64k test set with 5-gram LM. Stored in data/local/char64k
# 3. 108 subword vocabulary on 5k test set using byte pair encoding with 4-gram LM. Stored in data/local/bpe108_5k
python pytorch-database-prep/wsj.py data/ init_word $WSJ0 $WSJ1 --vocab-size 5 --lm  # 1. 
python pytorch-database-prep/wsj.py data/ init_char $WSJ0 $WSJ1 --lm  # 2.
python pytorch-database-prep/wsj.py data/ init_subword $WSJ0 $WSJ1 --lm  # 3.

Any generated language model will be stored under ext/lm.arpa.gz after the torch_dir command.

Subword vocabularies are determined using SentencePiece. The default algorithm performs Byte Pair Encoding, though unigram weighting can be performed instead with the --algorithm flag. Subword units are by no means common in WSJ setups; the default subword vocabulary size (108) is based off the experimentation of Xu et al..

The torch_dir sets up the feat and ref directories for a given configuration. If init_* has only been called once, like in the quick start, the command can figure out the configuration. Otherwise you have to point to what configuration you're using from data/local. At this point, you can also decide what type of filter bank to use and whether to use the '92 evaluation set or '93 evaluation set (or both) for testing. If you're using more than one configuration, we suggest that you provide an additional argument that tells the command what subdirectory in data/ to put the labels/features in. Here are some examples:

# 1. Same as quickstart setup, but also create eval93 folder in data/. 40 Mel-scale triangular filters + 1 energy
# 2. Using 64k vocabulary character setup, create *only* eval93 test setup. Store in data/char/{si284,dev93,eval93,ext}.
#    40 Gabor filters + 1 energy.
# 3. Using 5k vocabulary BPE subword setup, restrict training data to SI-84 and store in data/bpe
python pytorch-database-prep/wsj.py data/ format word64k --both-evals  # 1.
python pytorch-database-prep/wsj.py data/ format char64k char \
    --eval93 --computer-json pytorch-database-prep/conf/feats/gbank_41.json  # 2.
python pytorch-database-prep/wsj.py data/ format bpe108_5k bpe --si84  # 3.

Detailed descriptions

The corpus

Documentation for the corpus is labyrinthine because the corpus itself has too many branching goals. I admit that I might not have everything down perfectly - corrections are appreciated.

The WSJ0 corpus was considered a "pilot" corpus, intended to be extended into a much greater thing. First, a large portion of Wall Street Journal news articles were scrubbed to get prompts. 10% of the text was held out for testing, leaving the remaining 90% available for training. All that training data can be used for language modelling, but only a portion was then dictated. The recorded portion is split into three tranches based on intent: Speaker-Independent (SI) training, Speaker-Dependent (SD) training, and Longitudinal Speaker-Dependent (LSD) training. The tranches are roughly equal in size, but have different balances of speakers and utterances: the SI set has 84 speakers with about 100 utterances per speaker, the SD set has 12 speakers with about 600 utterances per speaker, and the LSD set has 3 speakers with about 2400 utterances per speaker. The SI and SD tranches are non-overlapping, but the LSD speakers are a strict subset of the SD ones.

The Nov '92 eval set is a suite of tests, though nobody cares about most of them. The primary division is between Speaker-Independent testing and Speaker-Dependent testing. The former tests on 8 brand-new speakers (with 8 other brand-new speakers in the dev set); the latter tests on the same 12 speakers from the SD partition. Within each split, utterances are broken down into "verbalized punctuation" and "non-verbalized punctuation." Each division is further broken down into a closed vocabulary of ~5,000 words (no words outside) and an open vocabulary of ~20,000 words. Saying 20k words for the open condition is a tad misleading, however, since the full vocabulary is about 64k words with the 20k being the most frequent (with 97.8% coverage) and the realized vocabulary within the test set is about 13k words (though not guaranteed to fully match the 20k). It appears the 20k is a mere guideline, with the corpus description paper noting

Since this data set was produced in a vocabulary-insensitive manner, it can be used without bias for open and closed recognition vocabulary testing at any vocabulary size up to 64K words.

The Nov '93 eval set seems to corroborate this perspective, given it no longer refers to the test sets as the "20K" condition but the "64K" condition, with a significance test comparing against a pre-trained language model with a 20K vocabulary size. Regardless, it appears nowadays that we use as many words of the 64K as is possible, with most Kaldi recipes relying on a "big dictionary" model that is extended to training OOVs. It makes sense to stick to the 20K vocabulary models for end-to-end models to avoid a prohibitively large softmax unless a hierarchical softmax or the like is used.

Returning to the '92 eval set, the instructions strongly advise testing: a) the spontaneous speech partition (neither the 5k or 20k condition - I haven't talked about it); b) the alternative microphone condition on 5k closed read; and c), qoute

OPEN VOCABULARIES -- Test on 8 SI speakers, read speech, 40 NVP utts per speaker, Sennheiser mic, either 5K or 20K vocabulary.

First off, forget about a) and b) -- nobody does them. The use of the word "OPEN" seems to suggest that we're using the 64K test data in c) regardless of whether we use the 5K or 20K language model. Fortunately, this interpretation fell by the wayside very quickly as HTK divided c) into two more sensible tasks: run one experiment on the 5k test set using the 5K closed vocabulary, and one on the 64k test set using the 20k open vocabulary. The '92 eval set doesn't have a canonical development set.

In addition, none of the "strongly recommended" tests mention "verbalized pronunciation" or the speaker dependent test set. In effect, only a small portion of the WSJ test data is actually ever used: 330 utterances across 8 speakers of read speech in the non-verbalized, open vocabulary SI condition, and 333 utterances across 8 other speakers of read speech in the non-verbalized, closed vocabulary SI condition.

Though the task guidelines allow any one of the SI, SD, or LSD to be used as training data, the mismatch between training and testing conditions likely made the SD and LSD tranches less attractive. By now, I doubt many know that the SD or LSD data exist. It's also worth mentioning that even though testing can be performed solely on the non-verbalized punctuation tests, the verbalized punctuation training data is usually used in addition to the non-verbalized.

We evaluate a test set using the "standard" Word Error Rate (WER) score.

The WSJ1 corpus extends the WSJ0 corpus and is clearly more fleshed-out in terms of tasks. The core partition again consists of SI data, this time with 200 speakers saying approximately 150 utterances each. Note that this is about 50% more data per speaker than the SI partition in WSJ0. Verbalized punctuation has been dropped entirely in both training and test data. Another half of the data are speaker-dependent, with 24 speakers speaking about 1200 utterances each. There is also a smaller partition of 20 journalists speaking 200 read and 200 spontaneous utterances. This time WSJ1 calls the SI partition the "standard" training set, relegating the others to oblivion (even if the '93 evaluation does not seem to restrict the choice of training data).

The Nov '93 eval set is much more fleshed-out than its predecessor. This time, the authors single out two tests as most important: evaluation against a 64k-word vocabulary data set of 10 speakers and about 20 utterances each of dictation, and evaluation against a closed 5k-word vocabulary with the same number of speakers, number utterances, and recording style. These are called the "Hub 1" and "Hub 2" tests. Though the setup is the same as in '92, the 64k set no longer mentions a "20k vocabulary" except in contrastive tests and spoke tests. Spoke tests are auxiliary to the hubs and test things like spontaneous speech, the alternate microphone, and non-native speech. All very cool stuff. These days they are ignored. The contrastive tests are ignored as well. The latter made more sense at the time, when all of this was a competition. Contrastive tests were attached to each hub and spoke to try and tease out statistical significance from a baseline (such as a recognizer trained on a 20k open vocabulary). Of course, a lot of time has passed since then, making significance against their baselines a forgone conclusion.

There is a corresponding development set in WSJ1 that resembles the hubs and spokes of the '93 eval set. The partitions for the hubs are significantly larger, however. For both hub 1 and hub 2, 10 speakers read 50 utterances-worth of text.

Implementation choices

The core recipe was modeled initially after Kaldi's s5 recipe, specifically after its wsj_data_prep.sh script. We largely follow along, with some exceptions. The noteworthy exceptions are below and in no particular order. You probably shouldn't read this section - it acts primarily as a reference for people like myself who spend hours dealing with small decisions which may or may not impact anything. You've been warned.

Though Kaldi claims that the '93 Hub 1 condition is the "most standard" in the prep script, they only ever report error rates on the '92 20k open vocab condition and the WSJ1 64k development condition. Why? Who knows. This has become the de facto pair for reporting on the WSJ. Thus we use the '92 eval set as the default test set.

Eval '93 has a helpful filter tool that handles some word- and utterance-level equivalences. Scoring allows you to convert from one to the other in your hypothesis transcriptions without penalty. For example, at the word-level, you're allowed to map "BUYOUT" to "BUY OUT" (reference transcriptions always use the latter) because they're lexical equivalents. Verbalized punctuation gets mapped to what was spoken (e.g. "COMMA," to "COMMA"). Etc. Utterance-level equivalences appear to be someone's idea of a grace for ambiguous audio. Regardless, they can only improve the error rates of the hypothesis transcriptions, so we apply them in our scoring function.

A brief word on the Character Error Rate (CER) or, equivalently, the Letter Error Rate (LER). This is a non-standard evaluation for WSJ and there doesn't seem to be a decided upon way of doing this. From what I can tell, it's as simple as treating every character (including spaces) as a token and calculating a WER. Since there's no standard, we'll default to the popular vote and use uniform 1 penalties in edit distance calculations (Levenshtein).

On the topic of edit distances: while the WER definition is agreed upon, how to determine the alignment, namely what should be insertions, deletions, and substitutions, is not terribly clear. This is a good article on the topic as well as the ARPA CSR initiative (which includes the WSJ). There was a lot of bandying back and forth about how to go about aligning text, including time-based and pronunciation-based initiatives. Because these approaches are costly (e.g. providing word alignments in the corpus or phonemic dictionaries for the vocabulary (though this latter option can be automated for the most part with grapheme-to-phoneme systems)) or require the ASR system to produce unnecessary output (such as alignments or confidence scores), the edit distance dynamic programming algorithm has remained the go-to method for computing error rates. However, there are an uncountably infinite number of ways to define an edit distance depending on how insertion, deletion, and substitution penalties are defined (see the Wiki article for more info). Suffice to say that NIST (and the SCORE software they bundled with the WSJ corpus) chose the penalty for insertions and deletions to be 3, while the substitution cost would be 4. 3 and 4 shall be the numbers; the numbers shall be 3 and 4.

Perhaps because WER is conceptualized as a normalized Levenshtein distance and the Levenshtein distance is identical to an edit distance with uniform penalties, the vast majority of online implementations of WER, including Kaldi, CMU Sphinx, and jiwer use the uniform 1 penalties. NIST software and HTK use the 3-4 penalties.

What effects do the unequal weights have on the word error rate? I believe that one can prove unequal weights will never decrease WER and possibly increase it: it gives avenue for an alignment consisting of more operations overall but with lower score, and WER only cares about the number of ops. Using uniform weights may therefore unfairly decrease WER. That said, I doubt the effect size is very large. Also, if most people report WER based on Levenshtein alignments, one could argue that that is the standard, regardless of initial intent. My take is to use NIST software, at the risk of very slightly higher WER, in order to avoid others calling you a cheater.

An extremely minor point: for all the test sets, we derive the references transcriptions from the filtered transcriptions rather than the raw ones. Kaldi uses the raw, filtering out things like noise and emphasis at test time. This didn't produce any meaningful distinctions except for these three (left is Kaldi, right is us):

< 4obc020e -SEOUL YUK STATION HE REPEATS PRACTICING HIS NEW ENGLISH WORD
---
> 4obc020e SEOUL YUK STATION HE REPEATS PRACTICING HIS NEW ENGLISH WORD
< 4odc0207 ALLEGHENY INTERNATIONAL AGREED TO GO PRIVATE IN A FIVE HUNDRED MILLION DOLLAR BUYOUT ARRANGED BY FIRST BOSTON
---
> 4odc0207 ALLEGHENY INTERNATIONAL AGREED TO GO PRIVATE IN A FIVE HUNDRED MILLION DOLLAR BUY BACK ARRANGED BY FIRST BOSTON
< 4oic020a FED CHAIRMAN GR- ALAN GREENSPAN IS SCHEDULED TO GIVE HIS MID YEAR ECONOMIC REPORT AT A HOUSE BANKING COMMITTEE HEARING ON FEBRUARY TWENTY THIRD
---
> 4oic020a FED CHAIRMAN ALAN GREENSPAN IS SCHEDULED TO GIVE HIS MID YEAR ECONOMIC REPORT AT A HOUSE BANKING COMMITTEE HEARING ON FEBRUARY TWENTY THIRD

Interestingly, after listening to the tape, it turns out Kaldi's 4odc0207 transcription is the correct one. There was a kerfuffle involving lexical equivalences that might've been a cause for the mislabeling. I decided that keeping a mislabeling because its standard would be a bit too perverse, so I swapped it manually with "BUY OUT" ("BUYOUT" always gets mapped to "BUY OUT" when using lexical equivalences).