hpc: rag llm - decalyu/HPC-AI-Resources GitHub Wiki

🚀✨ HPC: Introduction to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) ✨🚀

🎯 Goal

🤖 Learn how to implement Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) efficiently using CSUSB HPC. Traditional LLMs struggle with retrieving up-to-date information. RAG enhances LLMs by integrating external knowledge retrieval, improving accuracy for tasks like answering complex queries.

📌 What You Will Learn 🧠💡

✅ Why Use HPC Instead of a Local Computer?
✅ Access HPC Terminal via CSUSB HPC Portal
✅ Install Dependencies, Setup Kaggle, and Download Dataset
✅ Load and Preprocess the Dataset in Jupyter Notebook
✅ Implementing Retrieval-Augmented Generation (RAG)
✅ Analyzing Retrieval Efficiency

🚀 1. Retrieval-Augmented Generation (RAG) Overview

📌 Introduction

🔹 Retrieval-Augmented Generation (RAG) is an AI framework that enhances Large Language Models (LLMs) by integrating real-time knowledge retrieval with text generation. Instead of relying solely on pre-trained knowledge, RAG retrieves relevant external documents 📚 to provide more accurate, context-aware responses.

🏗 RAG System Architecture

📊 Workflow Diagram

Source: What is Retrieval-Augmented Generation?

⚙️ How RAG Works (Aligned with Diagram)

🔹 Key Processing Steps

1️⃣ User Input 💬 → The user provides a query (e.g., a question or request for information). 2️⃣ Search Relevant Information 🔎 → The system searches an external knowledge base 📂 for relevant information. 3️⃣ Retrieve Information 📖 → The system finds the most contextually relevant documents and extracts useful content. 4️⃣ Augment Prompt 📝 → The retrieved information is merged with the original query, creating an enriched prompt. 5️⃣ Process with LLM 🧠 → The augmented prompt is sent to the language model to generate a fact-based answer. 6️⃣ Generate Final Response ✅ → The user receives a response grounded in retrieved knowledge.

🔍 Core Components (Matching Workflow Elements)

Component	Function
User Query 💬	Initial question or request for information
Search Relevant Information 🔍	Finds matching knowledge in external sources
Knowledge Sources 📚	Databases, documents, or repositories storing retrievable data
Retrieved Context 📄	Relevant snippets extracted for augmentation
Enhanced Prompt ✍️	Merged user query + additional context
LLM Processing 🧠	Generates responses using both pre-trained knowledge and retrieved content

🎯 Why Use RAG?

✔ Reduces Hallucination 🚫 – Retrieves real-time information instead of making assumptions.
✔ More Context-Aware 📌 – Uses retrieved data to enhance LLM responses.
✔ Scalable & Efficient ⚡ – Works with large document repositories without retraining the model.
✔ Improves Accuracy 🎯 – Ensures answers align with verified sources.

🔄 Example Use Case

Scenario: AI-powered Research Assistant 📑

📌 User Query: "What are the latest advancements in renewable energy?"

🔍 Without RAG:

The LLM might generate an outdated response based only on its last training data.

✅ With RAG:

The system retrieves recent research papers 📰 and trusted articles.
The model incorporates external knowledge, ensuring a current and factual answer.

🏁 Conclusion

By combining retrieval with generation, RAG significantly improves response accuracy by grounding LLM outputs in real-world information. This framework is widely used in applications like intelligent search engines 🔍, enterprise AI assistants 🤖, and automated research tools 📊.

🖥️ 1. Why Use HPC Instead of a Local Computer?

Limitations of Local Machines:

🚫 Limited memory (RAM) can slow down LLM inference and retrieval.
🚫 CPUs struggle with large datasets, making retrieval inefficient.
🚫 Local machines require heavy GPU resources for fine-tuning.

Advantages of HPC:

✅ Faster processing: Leverages powerful CPUs and GPUs.
✅ Handles large-scale datasets: Works seamlessly with vector search (FAISS).
✅ Parallel processing: Multiple cores process data in parallel, accelerating RAG models.
✅ Remote execution: No need to burden local machines with heavy computation.

🔍 2: Access HPC Terminal via JupyterHub

1️⃣ Go to CSUSB HPC if you're a student or teacher at CSUSB. If not, ask a teacher from your school to create an account for you using the ACCESS CI program, which provides free access to computing tools like Jupyter for classroom use. 2️⃣ Click CI Logon to log in using your school account. 3️⃣ Select the GPU model that best fits your needs. 4️⃣ After logging in, Welcome to JupyterLab. ✅ You're ready to go!

Important: Make sure to select Python 3 as your notebook kernel. This is essential for all code in this tutorial to work correctly! ** We'll be switching between the Terminal (for installing packages) and the Jupyter Notebook (for running code). Make sure you're in the correct environment when following each step.

💻 3. Install Dependencies, Setup Kaggle, and Download Dataset

⚠️ Switch back to your Terminal for the following steps.

1️⃣ Click Terminal in JupyterLab.
2️⃣ Run the following commands:

🔗 ChatGPT prompt to generate the code

# Install Pandas, a powerful library for data manipulation and analysis, 
# providing data structures like DataFrames for handling structured data.
pip install --user pandas  

# Install Seaborn, a statistical data visualization library built on top of Matplotlib, 
# useful for creating informative and attractive graphs.
pip install --user seaborn  

# Install Matplotlib, a fundamental plotting library for Python, 
# enabling the creation of static, animated, and interactive visualizations.
pip install --user matplotlib  

# Install the Kaggle API library, which allows users to download datasets, 
# submit solutions, and interact with Kaggle's platform programmatically.
pip install --user kaggle  

# Install FAISS (Facebook AI Similarity Search), an efficient library for 
# similarity search and clustering of dense vectors, optimized for fast retrieval.
pip install --user faiss-cpu  

# Install Hugging Face's Transformers library, which provides pre-trained models 
# for various NLP tasks such as text classification, translation, and generation.
pip install --user transformers   

# Temporarily add Kaggle to your system PATH
export PATH=~/.local/bin:$PATH  

# Permanently add Kaggle to your system PATH
echo 'export PATH=~/.local/bin:$PATH' >> ~/.bashrc  

# Apply the changes immediately
source ~/.bashrc  

# Create a new directory for dataset storage
mkdir -p ~/playstore_data  

# Navigate to dataset directory
cd ~/playstore_data  

# Download the Google Playstore dataset from Kaggle
# Note: This dataset may not be found. If you encounter an error,
# you may need to check if the dataset still exists or manually download it.
kaggle datasets download -d gauthamp10/google-playstore-apps  

# Unzip the downloaded dataset
unzip google-playstore-apps.zip  

# List extracted files to confirm successful download
ls -lh

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) and check the output!

✅ Now you have installed dependencies and downloaded the dataset! 🎉

📚 4: Load and Preprocess the Dataset in Jupyter Notebook

⚠️ Switch back to your Jupyter Notebook for the following steps.

🐍 Create a New Python Notebook 📓 1️⃣ Click on the "+" button in the top left corner of the JupyterLab interface 2️⃣ Select "Python 3 (ipykernel)" under the "Notebook" section 3️⃣ This will create a new untitled Jupyter notebook where you can run your Python code 4️⃣ You can rename the notebook by right-clicking on "Untitled.ipynb" in the file browser and selecting "Rename" ✏️ 5️⃣ Choose a name you like for your notebook! Maybe "RAG_Workshop" or "MyFirstRAG" 🚀

Step 1: Load the Dataset into a Pandas DataFrame

➕🐍 Add a New Code Cell

1️⃣ Click + Code in Jupyter Notebook to add a new code cell.
2️⃣ Copy and paste the following code:

🔗 ChatGPT prompt to generate the code

# Import Pandas for data handling
import pandas as pd  

# Define dataset path
dataset_path = "~/playstore_data/Google-Playstore.csv"  

# Load dataset into Pandas DataFrame
playstore_df = pd.read_csv(dataset_path)  

# Display first few rows
print(playstore_df.head())

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) and check the output.

✅ You should now see your dataset displayed! 🎉

Step 2: Clean and Prepare Data for RAG

➕🐍 Add a New Code Cell

1️⃣ Click + Code in Jupyter Notebook to add a new code cell.
2️⃣ Copy and paste the following code:

🔗 ChatGPT prompt to generate the code

# Check for missing values in the dataset - this helps identify data quality issues
missing_values = playstore_df.isnull().sum()  
print("Missing Values:\n", missing_values)  

# Remove rows with missing values - ensures complete data for analysis
# This step is important because incomplete data can cause errors in our analysis
playstore_df.dropna(inplace=True)  

# Remove duplicate entries - prevents bias from counting the same data multiple times
# Duplicates can skew statistics and analysis results
playstore_df.drop_duplicates(inplace=True)  

# Display cleaned dataset
print(playstore_df.head())

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) and check the output.

✅ Data cleaning completed! Your dataset is now free of missing values and duplicates. Ready for accurate analysis! 📊✨🎉

🤖 5. Implementing Retrieval-Augmented Generation (RAG)

➕🐍 Add a New Code Cell

1️⃣ Click + Code in the top left to add a new code cell.
2️⃣ Copy and paste the following code into the new code cell.

🔗 ChatGPT prompt to generate the code

# Import required libraries
import numpy as np
from transformers import pipeline
import faiss

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) to import the libraries!

✅ Required libraries imported successfully! Ready to roll with NumPy, Transformers, and FAISS. 📚

⚠️ Don't worry about that warning! The message about "TqdmWarning: IProgress not found" is completely normal and won't affect your code at all.

💡 Pro tip: This is just a suggestion about an optional update for progress bars. You can safely ignore it and continue with the workshop - everything will work perfectly fine!

Step 2: Create Text Embeddings

➕🐍 Add a New Code Cell

1️⃣ Click + Code in the top left to add a new code cell.
2️⃣ Copy and paste the following code into the new code cell.

🔗 ChatGPT prompt to generate the code

# Define a simple function to create embeddings for our text data
def create_embeddings(texts, dimension=768):
    """
    Create simple embeddings for demonstration purposes.
    In a real implementation, you would use a pre-trained embedding model.
    """
    np.random.seed(42)  # For reproducibility
    embeddings = []
    
    for _ in range(len(texts)):
        # Create a random vector to represent the embedding
        embedding = np.random.normal(0, 1, dimension)
        # Normalize the embedding vector
        embedding = embedding / np.linalg.norm(embedding)
        embeddings.append(embedding)
    
    return np.array(embeddings).astype('float32')

# Select a sample of app descriptions for our knowledge base
# Limit to a smaller number for demonstration purposes
sample_size = min(1000, len(playstore_df))
app_descriptions = playstore_df['App Name'].iloc[:sample_size].tolist()

print(f"Created a knowledge base with {len(app_descriptions)} app descriptions")

# Create embeddings for our app descriptions
embeddings = create_embeddings(app_descriptions)
print(f"Generated embeddings with shape: {embeddings.shape}")

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) to create embeddings!

✅ Embeddings generated! You've successfully created a knowledge base with 1000 app descriptions and embeddings. Ready for further analysis! 📊✨🎉

Step 3: Build a FAISS Index for Retrieval

➕🐍 Add a New Code Cell

1️⃣ Click + Code in the top left to add a new code cell.
2️⃣ Copy and paste the following code into the new code cell.

🔗 ChatGPT prompt to generate the code

# Build a FAISS index for fast similarity search
dimension = embeddings.shape[1]  # Get the dimension of our embeddings
index = faiss.IndexFlatL2(dimension)  # Create a simple L2 distance index
index.add(embeddings)  # Add our embeddings to the index

print(f"Created FAISS index with {index.ntotal} vectors of dimension {dimension}")

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) to build the search index!

✅ FAISS index created successfully! You've added 1000 vectors to the index, ready for fast similarity searches. 🚀🔍🎉

Step 4: Load a Text Generation Model

➕🐍 Add a New Code Cell

1️⃣ Click + Code in the top left to add a new code cell.
2️⃣ Copy and paste the following code into the new code cell.

🔗 ChatGPT prompt to generate the code

# Load a text generation model
try:
    # Try to load the text generation model
    model = pipeline("text-generation", model="gpt2")
    print("Successfully loaded GPT-2 model")
except Exception as e:
    print(f"Error loading model: {e}")
    print("Using a fallback mock model for demonstration")
    
    # Define a mock model function
    def mock_model(prompt, max_length=100, truncation=True):
        return [{"generated_text": f"This is a mock response to: '{prompt}'"}]
    
    model = mock_model

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) to load the text generation model!

✅ Model loaded successfully! You've got GPT-2 ready for text generation. 📚🤖✨🎉 If there's an issue, a fallback mock model will kick in to keep things running smoothly. 🚑👌

Step 5: Define Query and Retrieve Information

➕🐍 Add a New Code Cell

1️⃣ Click + Code in the top left to add a new code cell.
2️⃣ Copy and paste the following code into the new code cell.

🔗 ChatGPT prompt to generate the code

# Define a user query
query = "What are the top categories in Google Playstore?"

# Create an embedding for the query
query_embedding = create_embeddings([query])[0].reshape(1, -1)

# Search the FAISS index for the most similar documents
k = 5  # Number of similar documents to retrieve
distances, indices = index.search(query_embedding, k)

# Retrieve the most similar documents
retrieved_docs = [app_descriptions[i] for i in indices[0]]

print("Query:", query)
print(f"Retrieved {len(retrieved_docs)} similar documents:")
for i, doc in enumerate(retrieved_docs):
    print(f"{i+1}. {doc}")

# Create an augmented prompt with the retrieved documents
augmented_prompt = f"""
Query: {query}
Retrieved information:
{', '.join(retrieved_docs)}

Based on the above information, please answer the query.
"""

print("\nAugmented prompt created for the language model")

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) to retrieve relevant information!

✅ Augmented prompt created successfully! Your query has been matched with the top 5 most similar documents from the knowledge base. Ready for the language model to work its magic! 🔍🤖✨🎉

🎯 Challenge: Replace the default query with your own question about apps, such as "What are popular gaming apps?" 💡 Extra Tip: Try modifying both the query text and the number of documents to retrieve (k value) to see how it affects your results.

Step 6: Generate a Response

➕🐍 Add a New Code Cell

1️⃣ Click + Code in the top left to add a new code cell.
2️⃣ Copy and paste the following code into the new code cell.

🔗 ChatGPT prompt to generate the code

# Generate response based on the augmented prompt
if callable(model):  # Check if we're using the mock model
    output = model(augmented_prompt, max_length=100, truncation=True)
else:
    output = model(augmented_prompt, max_length=100, truncation=True)

# Extract generated response
generated_response = output[0]['generated_text']

# Print generated answer
print("\nGenerated Answer:", generated_response)

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) to generate the response!

✅ Response generated! Your language model has crafted an answer based on the augmented prompt. Ready to see the AI’s take on your query! 🤖✨🎉

📊 6: Analyzing Retrieval Efficiency

➕🐍 Add a New Code Cell

1️⃣ Click + Code in Jupyter Notebook to add a new code cell.
2️⃣ Copy and paste the following code:

🔗 ChatGPT prompt to generate the code

# Import visualization libraries
import matplotlib.pyplot as plt  
import seaborn as sns  

# Set figure size for better visualization
plt.figure(figsize=(12, 5))  

# Create bar plot for app categories
sns.countplot(x=playstore_df["Category"])  

# Rotate x-axis labels for readability
plt.xticks(rotation=90)  

# Set title
plt.title("Distribution of App Categories in Google Playstore")  

# Display the plot
plt.show()

🔗 ChatGPT explanation for the code

3️⃣ Click Run (▶) and check the output.

✅ You should see a distribution of app categories.

🎯 7. Wrap-Up & Next Steps

🎉 Congratulations! You learned how to:

✅ Use HPC for Large-Scale Retrieval-Augmented Generation (RAG)
✅ Download and process large datasets efficiently
✅ Generate responses using an AI model
✅ Analyze and visualize retrieval success

🚀 Next Workshop: 🔍 Ethical AI & Future Trends

🔗 Additional AI & HPC Resources 📚

Project Jupyter Documentation
Python Introduction (Use only the two green buttons "Previous" and "Next" to navigate the tutorial and avoid ads.)
Microsoft: RAG and Knowledge Retrieval Fundamentals
ACCESS CI (Free access to HPC for all using the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) U.S. government program)

🎉 Keep learning AI, and see you at the next workshop! 🚀

📝 Workshop Feedback Survey

Thanks for completing this workshop!🎆

We'd love to hear what you think so we can make future workshops even better. 💡

📌 Survey link