NLP - stereoboy/Study GitHub Wiki

Project 01: HMM Tagger
- Part of Speech Tagging
Project 02: Machine Translation
- Machine Translation
Project 03: DNN Speech Recognizer
- DNN Speech Recognizer

Project 01

AIND-NLP: https://github.com/stereoboy/AIND-NLP.git

Text Processing
- Cleaning
  - from bs4 import BeautifulSoup
- Normalization
  - Lower() & Punctuation Removal
- Tokenization
  - NLTK: Natural Language ToolKit
  - Named Entity Recognition
- Stemming & Lemmatization

NLP-Exercises: https://github.com/stereoboy/NLP-Exercises.git

Viterbi Algorithm

Dynamic Programming to get an optimal path with the highest proboblity

Main Project: HMM-Tagger

Hidden Markov Model Part of Speech tagger project

https://github.com/stereoboy/hmm-tagger/
Speech and Language Processing (3rd ed. draft)
- Dan Jurafsky and James H. Martin
- https://web.stanford.edu/~jurafsky/slp3/
References
- AI in Practice: Identifying Parts of Speech in Python
  - https://medium.com/@brianray_7981/ai-in-practice-identifying-parts-of-speech-in-python-8a690c7a1a08
- Learning POS Tagging & Chunking in NLP
  - https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb
- Part-Of-Speech Tagging for Social Media Texts
  - https://link.springer.com/chapter/10.1007/978-3-642-40722-2_15

Project 02

Lesson 01: Feature extraction and embedding

Keywords
- Bag of wors/TF-IDF
- One-hot-encoding
- Word Embeddings/Word2Vec/GloVe
- t-SNE

Lesson 02: Topic Modeing

Latent Dirichlet Allocation (http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
Jupyter Notebook Exercise
- https://github.com/stereoboy/NLP-Exercises/tree/master/2.2-topic-modeling

Lesson 05: Deep Learning Attention

Additive or multiplicative attention
- Neural Machine Translation by Jointly Learning to Align and Translate
  - https://arxiv.org/abs/1409.0473
- Effective Approaches to Attention-based Neural Machine Translation
  - https://arxiv.org/abs/1508.04025

Super interesting computer vision applications using attention:

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention [pdf]
- https://arxiv.org/pdf/1502.03044.pdf
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [pdf]
- https://arxiv.org/pdf/1707.07998.pdf
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks [pdf]
- https://www.cv-foundation.org/openaccess/content_cvpr_2016/app/S19-04.pdf
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos [pdf]
- https://arxiv.org/pdf/1507.05738.pdf
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge [pdf]
- https://arxiv.org/pdf/1708.02711.pdf
Visual Question Answering: A Survey of Methods and Datasets [pdf]
- https://arxiv.org/pdf/1607.05910.pdf

NLP Application: Google Neural Machine Translation

The best demonstration of an application is by looking at real-world systems that are in production right now. In late 2016, Google released the following paper describing Google’s Neural Machine Translation System:

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation [pdf]
- https://arxiv.org/pdf/1609.08144.pdf

This system later went into production powering up Google Translate.

Take a stab at reading the paper and connecting it to what we've discussed in this lesson so far. Below are a few questions to guide this external reading:

Is the Google’s Neural Machine Translation System a sequence-to-sequence model?

Does the model utilize attention?

If the model does use attention, does it use additive or multiplicative attention?

What kind of RNN cell does the model use?

Does the model use bidirectional RNNs at all?

Text Summarization:
- Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
  - https://arxiv.org/pdf/1602.06023.pdf
Other Attention Methods
- Transformer
  - https://youtu.be/VmsR9FVpQiM
  - https://youtu.be/F-XN72bQiMQ
  - Paper: Attention Is All You Need
    - https://arxiv.org/abs/1706.03762
  - Talk: Attention is all you need attentional neural network models – Łukasz Kaiser
    - https://www.youtube.com/watch?v=rBCqOTEfxvg

Lesson 6: RNN Keras Lab, Deciphering Code with Character-Level RNN

https://github.com/stereoboy/NLP-Exercises/tree/master/2.6-rnn-keras-lab

Main Project: Machine Translation

https://github.com/stereoboy/aind2-nlp-capstone

Project 03

Lesson 3: Speech Recognition

04 References: Signal Analysis

Sound: https://en.wikipedia.org/wiki/Sound
Signal Analysis: Cassidy, Steve. "Speech recognition." Sydney Australia (2002): Chapter 3.
- http://web.science.mq.edu.au/~cassidy/comp449/html/ch03.html
Fourier Analysis: Fourier Transforms – the most important tool in mathematics?. (2014). IB Maths Resources from British International School Phuket.
- https://ibmathsresources.com/2014/08/14/fourier-transforms-the-most-important-tool-in-mathematics/
Spectrograms: Marcus, Mitch. "CIS 391 Artificial Intelligence." Philadelphia (2015). Seas.upenn.edu.
- http://www.seas.upenn.edu/~cis391/Lectures/speech-rec.pdf

07 Feature Extraction

Feature Extraction: A summary of methods used in ASR:
- Narang, Shreya, and Ms Divya Gupta. "Speech Feature Extraction Techniques: A Review." International Journal of Computer Science and Mobile Computing 4.3 (2015): 107-114.
  - http://www.ijcsmc.com/docs/papers/March2015/V4I3201545.pdf
Mel Scale

The Mel Scale was developed in 1937 and is based on human studies of pitch perception. At lower pitches (frequencies), humans can distinguish pitches better. Read more about it in Wikipedia (https://en.wikipedia.org/wiki/Mel_scale)

The Source/Filter Model
- Cassidy, Steve. "Speech recognition." Sydney Australia (2002): Chapter 7.
  - http://web.science.mq.edu.au/~cassidy/comp449/html/ch07.html#d0e1094
- Cepstral Analysis : Cepstral Analysis of Speech (Theory) : Speech Signal Processing Laboratory : Electronics & Communications : IIT GUWAHATI Virtual Lab
  - http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
MFCC

Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both mel frequency analysis and cepstral analysis. The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away. The following reference provides nice visualizations of the process of audio->spectrogram->MFCC:
- Prahallad, Kishore. "Speech Technology: A Practical Introduction, topic: Spectrogram, Cepstrum and Mel-Frequency Analysis." Carnegie Mellon University
  - http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf
MFCC Deltas and Delta-Deltas

Intuitively, it makes sense that changes in frequencies, deltas, and changes in changes in frequencies, delta-deltas, might also be meaningful features in speech recognition. The following succinct tutorial for MFCC's includes a short discussion on deltas and delta-deltas:
- Mel Frequency Cepstral Coefficient (MFCC) tutorial. Practical Cryptography
  - http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

09 Phonetics

Phoneme/Grapheme
Lexical Decoding
Lexicon

12 Voice Data Lab

https://github.com/udacity/AIND-VUI-Lab-Voice-Data
Sonic Visualizer
- https://www.sonicvisualiser.org/download.html

14. Acoustic Models and the Trouble with Time

DTW (Dynamic Time Warping)
CTC (Connectionist Temporal Classification)

19. References: Traditional ASR

A bit of Computer History Museum nostalgia on Speech Recognition presents what we think of now as "Traditional" ASR:

Kai-Fu Lee (Apple) in 1993. Computer History Museum video.
- https://youtu.be/PJ_KCTsOCrs
Acoustic Models with HMMs

HMMs are the primary probabalistic model in traditional ASR systems. The following slide decks from Carnegie Mellon include very helpful and detailed visualizations of HMM's, the Viterbi Trellis, State Tying, and more from the Carnegie Mellon:
- Raj, Bhiksha, and Rita Singh. "Design and implementation of speech recognition systems." Carnegie Mellon School of Computer Science (2011).
  - slides - HMMs
    - http://www.cs.cmu.edu/~bhiksha/courses/11-756.asr/spring2014/lectures/class7-8.hmm.pdf
  - slides - Continuous Speech
    - http://www.cs.cmu.edu/~bhiksha/courses/11-756.asr/spring2014/lectures/class9.continuousspeech.pdf
  - slides - HMM tying
    - http://asr.cs.cmu.edu/spring2011/class21.6apr/class21.subwordunits.pdf
N-Grams

N-Grams provide a way to constrain a series of words by chaining the probabilities of the words that came before. For more on creating and using N-Grams, see the references below:
- Martin, James H., and Daniel Jurafsky. "Speech and language processing." International Edition 710 (2014). Chapter 4 Draft.
  - https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf
- Jurafsky, Daniel. "CS124 - From Languages to Information". Stanford University.Language Modeling. Slides
  - http://web.stanford.edu/class/cs124/lec/languagemodeling2016.pdf

22. Connectionist Temporal Classification

Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
- http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_GravesFGS06.pdf

23. References: Deep Neural Network ASR

Deep Speech 2

The following presentation, slides, and paper from Baidu on DeepSpeech 2 were important resources for the development of this course and its capstone project:
- Amodei, Dario, et al. "Deep speech 2: End-to-end speech recognition in english and mandarin." International Conference on Machine Learning. 2016.
  - https://arxiv.org/pdf/1512.02595v1.pdf
- Presentation: https://www.youtube.com/watch?v=g-sndkf7mCs
- Slides: https://cs.stanford.edu/~acoates/ba_dls_speech2016.pdf
Language modeling with CTC
- Gram-CTC from Baidu on integrating a language model into CTC for better performance:
  - Liu, Hairong, et al. "Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling." arXiv preprint arXiv:1703.00096 (2017).
    - https://arxiv.org/pdf/1703.00096.pdf
- Language modeling with CTC based on weighted finite-state transducers (WFSTs):
  - Miao, Yajie, Mohammad Gowayyed, and Florian Metze. "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding." Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015.
    - https://arxiv.org/pdf/1507.08240.pdf
  - Slides: http://people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdf

Main Project: DNN Speech Recognizer

https://github.com/stereoboy/AIND-VUI-Capstone
LibriSpeech dataset
References
- Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients
  - http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
- On the difficulty of training Recurrent Neural Networks
  - https://arxiv.org/pdf/1211.5063.pdf
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
  - http://www.cs.toronto.edu/~graves/icml_2006.pdf
- A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
  - https://arxiv.org/abs/1512.05287
- WaveNet: A Generative Model for Raw Audio
  - https://arxiv.org/abs/1609.03499
- Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
  - https://arxiv.org/abs/1701.02720
- HYBRID SPEECH RECOGNITION WITH DEEP BIDIRECTIONAL LSTM
  - https://www.cs.toronto.edu/~graves/asru_2013.pdf
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
  - https://arxiv.org/abs/1512.02595