Platforms and Tools Taxonomy - TWIML/NLP_Working_Group GitHub Wiki

NLP

NLP is the Artificial Intelligence applications that process and analyze large amounts of natural language data.

The Problem to solve

Where the opportunities will be for NLP practionners According to Wipedia - NLP, Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

1. speech recognition

There are already many speech recognition software.

systems like GPT-2 could be used to create Better speech recognition systems

Speech Recognition != Natural Language Understanding ...speech recognition (really: transcription); knowing the words that were uttered, but not knowing yet what they mean, semantically and pragmatically, i.e. what we want the computer to do with the command/utterance/response heard. Understanding and interpreting what one heard

So, we’re missing a huge step to enable a computer to engage in meaningful dialog with us: the act of understanding what the user is saying. Once we’re in the text domain, we then need the computer to understand. That phase is referred to as natural language understanding. The output of this step is what’s called a semantic representation, or semantic interpretation.

2. natural language understanding

This Medium Article gives "A good rule of thumb is to use the term NLU if you’re just talking about a machine’s ability to understand what we say"

3. natural language generation.

According to Blog about NLG Gartner’s recent Hype Cycle for BI and Analytics sums up the difference between NLG and NLP (Natural Language Processing) well: “Whereas NLP is focused on deriving analytic insights from textual data, NLG is used to synthesize textual content by combining analytic output with contextualized narratives.”

NLP libraries

This section probably needs to be rearranged and classify based on whether the library is written on top of Tensorflow or Pytorch

Pytorch: Huggingface, AllenNLP, Spacy Tensorflow Apache MXNet: GLUON

List of NLP libraries

Here is a list of NLP libraries

Following are NLP libraires not listed in the above blog article:

AllenNLP, Intel NLP Architect,PyTorch-NLP, SciKit-Learn,SPark NLP, Textacy and numerous more non-English NLP libraries

Java script based Libraries: NLP.js. Retext,Compromise,Natural

Java based Libraries : Apache OpenNLP,StanfordNLP,CogCompNLP

Platform: AWS AI Services,IBM Watson, Google Cloud Natural Language, Azure Cognitive Services, Oracle OCI data science/Digital Assistant

Closer look at AllenNLP

AI2 has released the official v1 of its free NLP library Here is the related White Paper

.....many research codebases bury high-level parameters under implementation details,are challenging to run and debug, and are difficult enough to extend that they are more likely to be rewritten. This paper describes AllenNLP, a library for applying deep learning methods to NLP research,which addresses these issues with easy-to-use command-line tools, declarative configuration-driven experiments, and modular NLP abstractions.

AllenNLP is an ongoing open-source effort maintained by several full-time engineers and re-searchers at the Allen Institute for Artificial Intelligence, as well as interns from top PhD programs and contributors from the broader NLP community.

AllenNLP is built on PyTorch (Paszke et al.,2017), which provides many attractive features for NLP research.PyTorch supports dynamic networks, has a clean“Pythonic” syntax, and is easy to use.The AllenNLP library provides (1) a flexible data API that handles intelligent batching and padding, (2) high-level abstractions for common operations in working with text, and (3) a modular and extensible experiment framework that makes doing good science easy.

Many existing NLP pipelines, such as Stan-ford CoreNLP (Manning et al.,2014) and spaCy6,focus on predicting linguistic structures rather than modeling NLP architectures. While AllenNLP supports making predictions using pretrained models, its core focus is on enabling novel research.

While Keras’ abstractions and functionality are useful for general machine learning, they are somewhat lacking for NLP, where input data types can be very complex and dynamic graph frameworks are more often necessar

Problems adopting AllenNLP

contextualized word embedding is set to be the average of all the embeddings of the subtokens in a word

AllenNLP Code Base

github repo

Fast.ai

In article fastai: A Layered API for Deep Learning

.....clarity and development speed of Keras [2] and the customizability of PyTorch

The library itself is built on top of PyTorch [5], NumPy [6], PIL [7], pandas [8], and various other libraries.

The tokenization is flexible and can support many different organizers. The default used is Spacy. A SentencePiece tokenizer [15] is also provided by the library. Subword tokenization [16] [17], such as that provided by SentencePiece, has been used in many recent NLP breakthroughs [18] [19].

fastai’s text models are based on AWD-LSTM [21]. The user community have provided external connectors to the popular HuggingFace Transformers library [22]

The pandas library [8] already provides excellent support for processing tabular data sets, and fastai does not attempt to replace it. Instead, it adds additional functionality to pandas DataFrames through various pre-processing functions, such as automatically adding features that are useful for modelling with datedata.

Because fastai provides a layered architecture, users of the software can customize every part, as they need. The layered architecture is also an important foundation in allowing PyTorch users to incrementally add fastai functionality to existing code bases. Furthermore, fastai’s layers are reused across all applications, so an investment in learning them can be leveraged across many different projects.

The fastai library provides most data augmentation in computer vision on the GPU at the batch level. .......using a dedicated library such as PIL [7] or OpenCV [52]......

nbdev for exploratory programming

bdev In order to assist in developing this library,we built a programming environment called nbdev,which allows users to create complete Python packages, including tests and a rich documentation system, all in Jupyter Notebooks [53]. nbdev is a system for exploratory programming

fast.ai in production

1.) Deploy fastai model into a Kubernetes cluster using BentoML

2.) Fast.ai already has a list of production ready platforms such as Render

FLOW.....

MLflow is a Databricks project and Kubeflow is widely backed by Google.

MLflow is a python package that covers some key steps in model management. Kubeflow is a combination of open-source libraries that depends on a Kubernetes cluster to provide a computing environment for ML model development and production tools

Mlops vs. Devops

HPC and Distributed Machine Learning

A Survey on Distributed Machine learning

Linear Algebra Libraries for High-Performance Computing

AI Chips

SGD CPu vs GPU Sync vs. Async