Python Libraries for NLP - FHNW-IGEO/Geoharvester GitHub Wiki

Natural Language Processing (NLP)

NLTK (Natural Language Toolkit): This is a comprehensive library for NLP tasks such as tokenization, stemming, lemmatization, and part-of-speech tagging. It also includes a wide range of corpora and resources for working with human language data.

Mögliches Vorgehen für Model: https://medium.com/analytics-vidhya/nlp-tutorial-for-text-classification-in-python-8f19cd17b49e

SpaCy: This is a powerful and efficient NLP library that is designed for production use. It includes support for many NLP tasks such as tokenization, part-of-speech tagging, and named entity recognition.

Gensim: This is a library for topic modeling, document indexing, and similarity retrieval with large corpora. It includes efficient implementations of popular algorithms such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).

TextBlob: This is a simple and easy-to-use library for common NLP tasks such as part-of-speech tagging, noun phrase extraction, and sentiment analysis.

Stanford CoreNLP: This is a powerful Java-based library for NLP tasks such as tokenization, part-of-speech tagging, and named entity recognition. It has a Python wrapper available, which makes it easy to use from Python.

Pattern: This is a library for web mining, natural language processing, and machine learning. It includes tools for text processing, such as tokenization, part-of-speech tagging, and named entity recognition.

Polyglot: This is a natural language processing library that supports a wide range of languages and NLP tasks. It includes support for tokenization, part-of-speech tagging, and named entity recognition.

Scikit-learn: This is a library for machine learning in Python that includes many tools for feature extraction and natural language processing. It includes support for vectorization of text data and dimensionality reduction techniques such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).

PyNLPI: This is a library for natural language processing and information retrieval that includes support for tasks such as tokenization, part-of-speech tagging, and named entity recognition.

FastText: This is a library for fast text classification and representation learning. It includes support for efficient training of word embeddings and classification of text using those embeddings.

Huggin Face: vortrainierte Modelle wie BERT oder ähnliches.

Topic Modelling

Stm: A software transactional memory library

Tidytext: Text processing with pandas DataFrames.

Translations

translate-toolkit: Useful localization tools with Python API for building localization & translation systems

argostranslate: Open-source offline translation library written in Python

deepl: Official Python library for the DeepL language translation API.

idea for rumantsch

TFIDF

bind_tfidf in tidytext

sklearn.feature_extraction.text.TfidfVectorizer

TFIDF can be calculated (https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)