NLP with Hugging Face Transformers - ua-datalab/Generative-AI GitHub Wiki

Hugging Face Transformers

(Credit: Google DeepMind, Unsplash.com)


You may want to check: Introduction to Hugging Face


Overview of NLP and Hugging Face Transformers session

1. Introduction to NLP

Natural Language Processing (NLP) is a field at the intersection of linguistics, computer science, and artificial intelligence. It focuses on the interaction between computers and human languages, enabling machines to understand, interpret, and generate human language. NLP powers applications like language translation, sentiment analysis, chatbots, and more.

2. Main NLP Tasks

3. Advantages of Hugging Face Transformers

  • Pre-trained Models: Access to a vast library of pre-trained models that can be fine-tuned for specific tasks, reducing the need for large amounts of labeled data.
  • Ease of Use: User-friendly APIs that simplify model deployment and integration into applications.
  • State-of-the-Art Performance: Many models in the Hugging Face library achieve cutting-edge performance on various NLP tasks.
  • Community and Support: A large, active community and extensive documentation strongly support developers.
  • Flexibility: Support for a wide range of tasks, including text classification, question answering, translation, and more.
  • Interoperability: Integration with popular deep learning frameworks like PyTorch and TensorFlow.

4. Learning Resources

Books

Papers

Online Courses

Documentation

Tutorials and Blogs

5. General References

This workshop will introduce participants to the foundational concepts of NLP, its various tasks, and how Hugging Face Transformers can be leveraged to tackle these tasks efficiently.

6. Jupyter Notebook example

Open and run the following Jupyter Notebook in Google Colab.


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Basic notions: NLP, tokenization, transformers, fine-tuning
# Advantages of Hugging Face Transformers: pre-trained models, easy-to-use API, community
# Trends: large language models, generative AI, transfer learning

# Load a pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

# Load a dataset (e.g., IMDB sentiment analysis)
from datasets import load_dataset

dataset = load_dataset("imdb")

# Preprocess the data
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Fine-tune the model
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

# Activity: Experiment with different pre-trained models and hyperparameters.

Created: 08/16/2024 (C. Lizárraga); Last update: 08/26/2024 (C. Lizárraga)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.