Natural Language Processing (NLP) - spinningideas/resources GitHub Wiki
Natural Language Processing is a branch of Machine Learning that uses various approaches and frameworks/libraries to extract meaning and metadata from textual data. You can take inputs of textual data and provide value back to users or systems by extracting entities, classifying the text (sentiment, emotion, relations), identifying keywords, or by providing answers to any questions posed.
Introductions
Use Cases
Classification
- Text Classification: assigning a category to a sentence or document (e.g. spam filtering).
- Sentiment Analysis: identifying the polarity of a piece of text.
- Fake News Detection: detecting and filtering out texts containing false and misleading information.
- Stance Detection: determining an individual’s reaction to a primary actor’s claim. It is a core part of a set of approaches to fake news assessment.
- Hate Speech Detection: detecting if a piece of text contains hate speech.
Information Retrieval and Document Ranking
- Sentence/document similarity: determining how similar two texts are.
- Question Answering: the task of answering a question in natural language.
Text-to-Text Generation
- Machine Translation: translating from one language to another.
- Text Generation: creating text that appears indistinguishable from human-written text.
- Text Summarization: creating a shortened version of several documents that preserves most of their meaning.
- Text Simplification: making a text easier to read and understand, while preserving its main ideas and approximate meaning.
- Lexical Normalization: translating/transforming a non-standard text to a standard register.
- Paraphrase Generation: creating an output sentence that preserves the meaning of input but includes variations in word choice and grammar.
Knowledge bases, entities and relations
- Relation extraction: extracting semantic relationships from a text. Extracted relationships usually occur between two or more entities and fall into specific semantic categories (e.g. lives in, sister of, etc).
- Relation prediction: identifying a named relation between two named semantic entities.
- Named Entity Recognition: tagging entities in text with their corresponding type, typically in BIO notation.
- Entity Linking: recognizing and disambiguating named entities to a knowledge base (typically Wikidata).
Topics and Keywords
- Topic Modeling: identifying abstract “topics” underlying a collection of documents.
- Keyword Extraction: identifying the most relevant terms to describe the subject of a document
Chatbots
- Intent Detection: capturing the semantics behind messages from users and assigning them to the correct label.
- Slot Filling: aims to extract the values of certain types of attributes (or slots, such as cities or dates) for a given entity from texts.
- Dialog Management: managing of state and flow of conversations.
Text Reasoning
- Common Sense Reasoning: use of “common sense” or world knowledge to make inferences.
- Natural Language Inference: determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.
Text-to-Data and Data-to-Text
- Text-to-Speech: technology that reads digital text aloud.
- Speech-to-Text: transcribing speech to text.
- Text-to-Image: generating photo-realistic images which are semantically consistent with the text descriptions.
- Data-to-Text: producing text from non-linguistic input, such as databases of records, spreadsheets, and expert system knowledge bases.
Text Preprocessing
- Coreference Resolution: clustering mentions in text that refer to the same underlying real-world entities.
- Part Of Speech (POS) tagging: tagging a word in a text with its part of speech. A part of speech is a category of words with similar grammatical properties, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.
- Word Sense Disambiguation: associating words in context with their most suitable entry in a pre-defined sense inventory (typically WordNet).
- Grammatical Error Correction: correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors.
- Feature Extraction: extraction of generic numerical features from text, usually embeddings.
Transformers
- https://github.com/huggingface/transformers
- https://medium.com/nlplanet/two-minutes-nlp-20-learning-resources-for-transformers-1bbff88b7524
- https://jalammar.github.io/illustrated-transformer/
Models
Additional Resources
https://github.com/spinningideas/resources/wiki/AIML-Resources