NLP - stereoboy/Study GitHub Wiki
Contents
Project 01
https://github.com/stereoboy/AIND-NLP.git
AIND-NLP:- Text Processing
- Cleaning
from bs4 import BeautifulSoup
- Normalization
- Lower() & Punctuation Removal
- Tokenization
- NLTK: Natural Language ToolKit
- Named Entity Recognition
- Stemming & Lemmatization
- Cleaning
https://github.com/stereoboy/NLP-Exercises.git
NLP-Exercises:Viterbi Algorithm
- Dynamic Programming to get an optimal path with the highest proboblity
Further Reading: Language Processing by Daniel Jurafsky and James H. Martin.
- Chapter9 Sequence Processing with Recurrent Networks
- Chapter10 Encoder-Decoder Models, Attention, and Contextual Embeddings:
Main Project: HMM-Tagger
Hidden Markov Model Part of Speech tagger project
- https://github.com/stereoboy/hmm-tagger/
- Speech and Language Processing (3rd ed. draft)
- Dan Jurafsky and James H. Martin
- https://web.stanford.edu/~jurafsky/slp3/
- References
- AI in Practice: Identifying Parts of Speech in Python
- Learning POS Tagging & Chunking in NLP
- Part-Of-Speech Tagging for Social Media Texts
Project 02
Lesson 01: Feature extraction and embedding
- Keywords
- Bag of wors/TF-IDF
- One-hot-encoding
- Word Embeddings/Word2Vec/GloVe
- t-SNE
Lesson 02: Topic Modeing
- Latent Dirichlet Allocation (http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
- Jupyter Notebook Exercise
Lesson 05: Deep Learning Attention
- Additive or multiplicative attention
- Neural Machine Translation by Jointly Learning to Align and Translate
- Effective Approaches to Attention-based Neural Machine Translation
Super interesting computer vision applications using attention:
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention [pdf]
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [pdf]
- Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks [pdf]
- Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos [pdf]
- Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge [pdf]
- Visual Question Answering: A Survey of Methods and Datasets [pdf]
NLP Application: Google Neural Machine Translation
The best demonstration of an application is by looking at real-world systems that are in production right now. In late 2016, Google released the following paper describing Google’s Neural Machine Translation System:
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation [pdf]
This system later went into production powering up Google Translate.
Take a stab at reading the paper and connecting it to what we've discussed in this lesson so far. Below are a few questions to guide this external reading:
- Is the Google’s Neural Machine Translation System a sequence-to-sequence model?
- Does the model utilize attention?
- If the model does use attention, does it use additive or multiplicative attention?
- What kind of RNN cell does the model use?
- Does the model use bidirectional RNNs at all?
-
Text Summarization:
- Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
-
Other Attention Methods
- Transformer
- https://youtu.be/VmsR9FVpQiM
- https://youtu.be/F-XN72bQiMQ
- Paper: Attention Is All You Need
- Talk: Attention is all you need attentional neural network models – Łukasz Kaiser
- Transformer
Lesson 6: RNN Keras Lab, Deciphering Code with Character-Level RNN
Main Project: Machine Translation
Project 03
Lesson 3: Speech Recognition
04 References: Signal Analysis
- Sound: https://en.wikipedia.org/wiki/Sound
- Signal Analysis: Cassidy, Steve. "Speech recognition." Sydney Australia (2002): Chapter 3.
- Fourier Analysis: Fourier Transforms – the most important tool in mathematics?. (2014). IB Maths Resources from British International School Phuket.
- Spectrograms: Marcus, Mitch. "CIS 391 Artificial Intelligence." Philadelphia (2015). Seas.upenn.edu.
07 Feature Extraction
- Feature Extraction: A summary of methods used in ASR:
- Narang, Shreya, and Ms Divya Gupta. "Speech Feature Extraction Techniques: A Review." International Journal of Computer Science and Mobile Computing 4.3 (2015): 107-114.
- Mel Scale
The Mel Scale was developed in 1937 and is based on human studies of pitch perception. At lower pitches (frequencies), humans can distinguish pitches better. Read more about it in Wikipedia (https://en.wikipedia.org/wiki/Mel_scale)
-
The Source/Filter Model
- Cassidy, Steve. "Speech recognition." Sydney Australia (2002): Chapter 7.
- Cepstral Analysis : Cepstral Analysis of Speech (Theory) : Speech Signal Processing Laboratory : Electronics & Communications : IIT GUWAHATI Virtual Lab
-
MFCC
Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both mel frequency analysis and cepstral analysis. The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away. The following reference provides nice visualizations of the process of audio->spectrogram->MFCC:
- Prahallad, Kishore. "Speech Technology: A Practical Introduction, topic: Spectrogram, Cepstrum and Mel-Frequency Analysis." Carnegie Mellon University
-
MFCC Deltas and Delta-Deltas
Intuitively, it makes sense that changes in frequencies, deltas, and changes in changes in frequencies, delta-deltas, might also be meaningful features in speech recognition. The following succinct tutorial for MFCC's includes a short discussion on deltas and delta-deltas:
- Mel Frequency Cepstral Coefficient (MFCC) tutorial. Practical Cryptography
09 Phonetics
- Phoneme/Grapheme
- Lexical Decoding
- Lexicon
12 Voice Data Lab
- https://github.com/udacity/AIND-VUI-Lab-Voice-Data
- Sonic Visualizer
14. Acoustic Models and the Trouble with Time
- DTW (Dynamic Time Warping)
- CTC (Connectionist Temporal Classification)
19. References: Traditional ASR
A bit of Computer History Museum nostalgia on Speech Recognition presents what we think of now as "Traditional" ASR:
-
Kai-Fu Lee (Apple) in 1993. Computer History Museum video.
-
Acoustic Models with HMMs
HMMs are the primary probabalistic model in traditional ASR systems. The following slide decks from Carnegie Mellon include very helpful and detailed visualizations of HMM's, the Viterbi Trellis, State Tying, and more from the Carnegie Mellon:
- Raj, Bhiksha, and Rita Singh. "Design and implementation of speech recognition systems." Carnegie Mellon School of Computer Science (2011).
- slides - HMMs
- slides - Continuous Speech
- slides - HMM tying
- Raj, Bhiksha, and Rita Singh. "Design and implementation of speech recognition systems." Carnegie Mellon School of Computer Science (2011).
-
N-Grams
N-Grams provide a way to constrain a series of words by chaining the probabilities of the words that came before. For more on creating and using N-Grams, see the references below:
- Martin, James H., and Daniel Jurafsky. "Speech and language processing." International Edition 710 (2014). Chapter 4 Draft.
- Jurafsky, Daniel. "CS124 - From Languages to Information". Stanford University.Language Modeling. Slides
22. Connectionist Temporal Classification
- Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
23. References: Deep Neural Network ASR
- Deep Speech 2
The following presentation, slides, and paper from Baidu on DeepSpeech 2 were important resources for the development of this course and its capstone project:
- Amodei, Dario, et al. "Deep speech 2: End-to-end speech recognition in english and mandarin." International Conference on Machine Learning. 2016.
- Presentation: https://www.youtube.com/watch?v=g-sndkf7mCs
- Slides: https://cs.stanford.edu/~acoates/ba_dls_speech2016.pdf
- Language modeling with CTC
- Gram-CTC from Baidu on integrating a language model into CTC for better performance:
- Liu, Hairong, et al. "Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling." arXiv preprint arXiv:1703.00096 (2017).
- Language modeling with CTC based on weighted finite-state transducers (WFSTs):
- Miao, Yajie, Mohammad Gowayyed, and Florian Metze. "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding." Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015.
- Slides: http://people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdf
- Gram-CTC from Baidu on integrating a language model into CTC for better performance:
Main Project: DNN Speech Recognizer
- https://github.com/stereoboy/AIND-VUI-Capstone
- LibriSpeech dataset
- References
- Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients
- On the difficulty of training Recurrent Neural Networks
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
- A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
- WaveNet: A Generative Model for Raw Audio
- Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
- HYBRID SPEECH RECOGNITION WITH DEEP BIDIRECTIONAL LSTM
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin