NLP with Hugging Face Transformers - ua-datalab/Generative-AI GitHub Wiki

Hugging Face Transformers

(Credit: Google DeepMind, Unsplash.com)

You may want to check: Introduction to Hugging Face

Overview of NLP and Hugging Face Transformers session

1. Introduction to NLP

Natural Language Processing (NLP) is a field at the intersection of linguistics, computer science, and artificial intelligence. It focuses on the interaction between computers and human languages, enabling machines to understand, interpret, and generate human language. NLP powers applications like language translation, sentiment analysis, chatbots, and more.

2. Main NLP Tasks

Text classification: Assigning categories or labels to text (e.g., spam detection, sentiment analysis). (See HF Tutorial).
Named-entity recognition (NER): Identifying and classifying entities (e.g., names, locations, dates) in text. (See HF Tutorial).
Machine translation: Translating text from one language to another. (See HF Tutorial).
Question answering: Finding answers to questions within a text. (See HF Tutorial).
Text summarization: Generating a concise summary of a longer text. (See HF Tutorial).
Part-of-speech (POS) tagging: Labeling words in a sentence with their parts of speech (e.g., noun, verb, adjective). (See HF Tutorial).
Speech Recognition: Converting spoken language into text. (See HF Tutorial).
Language modeling: Predicting the next word in a sentence or generating text. (See HF Tutorial).

3. Advantages of Hugging Face Transformers

Pre-trained Models: Access to a vast library of pre-trained models that can be fine-tuned for specific tasks, reducing the need for large amounts of labeled data.
Ease of Use: User-friendly APIs that simplify model deployment and integration into applications.
State-of-the-Art Performance: Many models in the Hugging Face library achieve cutting-edge performance on various NLP tasks.
Community and Support: A large, active community and extensive documentation strongly support developers.
Flexibility: Support for a wide range of tasks, including text classification, question answering, translation, and more.
Interoperability: Integration with popular deep learning frameworks like PyTorch and TensorFlow.

4. Learning Resources

Books

Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models | PDF draft 3ed, Aug 2024. Daniel Jurafsky and James H. Martin.
Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper.
Transformers for Natural Language Processing. Denis Rothman.

Papers

Papers with Code: Natural Language Processing

Online Courses

Deep Learning Specialization by Andrew Ng on Coursera (includes NLP-focused modules).
Hugging Face Course – A comprehensive guide to NLP with Transformers.
NLP Course. Hugging Face.

Documentation

Tutorials and Blogs

Hugging Face Blog – Insights, tutorials, and updates on NLP and Transformers.
Natural Language Processing with Transformers. | Code. Natural Language Processing with Transformers Book. Lewis Tunstall, Leandro von Werra, and Thomas Wolf. O'Rielly.
Mastering NLP with Hugging Face Transformers: Unveiling the Power of Pipelines. Roshika Nayanadhara. Medium, 2023.
How to Use Hugging Face Pipelines?. Towards AI.
Primers: NLP Tasks. Aman.ai.
Primers: Transformers. Aman.ai.
Towards Data Science – Articles on NLP and AI.

5. General References

Vaswani, A., et al. (2017). Attention is All You Need. arXiv preprint arXiv:1706.03762.
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.

This workshop will introduce participants to the foundational concepts of NLP, its various tasks, and how Hugging Face Transformers can be leveraged to tackle these tasks efficiently.

6. Jupyter Notebook example

Open and run the following Jupyter Notebook in Google Colab.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Basic notions: NLP, tokenization, transformers, fine-tuning
# Advantages of Hugging Face Transformers: pre-trained models, easy-to-use API, community
# Trends: large language models, generative AI, transfer learning

# Load a pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

# Load a dataset (e.g., IMDB sentiment analysis)
from datasets import load_dataset

dataset = load_dataset("imdb")

# Preprocess the data
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Fine-tune the model
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

# Activity: Experiment with different pre-trained models and hyperparameters.

Created: 08/16/2024 (C. Lizárraga); Last update: 08/26/2024 (C. Lizárraga)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.