Tips and tricks - sdrobert/pytorch-database-prep GitHub Wiki

Windows Defender Real-Time Protection

If you're using Windows Defender, there's a good chance you have "Real-time protection" enabled on your device. As you can imagine, the software does a whole bunch of real-time analysis, including on files being saved to disk. When generating thousands to millions of these files, as is the case for the data directories we're generating, it will bog down the creation process, spending more time on checking the files than producing them. If you can, you might consider temporarily disabling real-time protection while building the directories. To do so, try START > Windows Security > Virus & threat protection > Manage settings > Real-time protection.

N-gram LMs

The file ngram_lm.py can be used to build ARPA-format n-gram models using a variety of smoothing techniques, including modified Kneser-Ney. It is far less efficient than KenLM, which should be preferred for large corpora. For small corpora, however, there should be little practical difference. The only external requirement of ngram_lm.py is numpy - the file can be copied into your project verbatim. Just please be sure to acknowledge this repo :)