AI - robbiehume/CS-Notes GitHub Wiki

Read Later (Click me)

Useful prompts (Click me)

ChatGPT

  • I'm going to use this chat for learning more about health topics. I might ask you for answers, tips, or queries to learn more about specific topics. I may also send you links to save. For each link I send, read the article and give a headline title, a short summary (1-2 sentences), and a link to the URL. If you can't access the full article, then use the information from the URL itself or whatever you can get from the page. You don't need to save this information to memory, I just wanted to add it to the context. Do you have any questions or would any additional information be helpful?

Perplexity

  • For this query, accuracy and comprehensiveness are more important than speed. Please conduct an in-depth search and analysis. <query>

Services to look into

Look into

Overview

Humans are a mix of AI

  • Sometimes we know "if this happens then do that" (AI)
  • Sometimes we've seen a lot of similar things before, and we classify them (ML)
  • Sometimes we haven't seen something before, but we have "learned" a lot of similar concepts, so we can make a decision (Deep Learning)
  • Sometimes, we get creative, and based on what we've learned, we can generate content (GenAI)

What is AI

  • AI is a broad field for the development of intelligent systems capable of performing tasks that typically require human intelligence:
    • Perception
    • Reasoning
    • Learning
    • Problem solving
    • Decision making
  • AI is an umbrella term for various techniques
  • Use cases:
    • Intelligent Document Processing (IDP): automatically extract structured data from various types of documents, such as invoices contracts, and forms

AI Components

  • Data Layer: collect vast amounts of data
  • ML Framework and Algorithm Layer: data scientists and engineers work together to understand use cases, requirements, and frameworks that can solve them
  • Model Layer: implement a model and train it
    • We have the structure, parameters and functions, and set an optimizer function
  • Application Layer: how to serve the model and its capabilities to users

What is Machine Learning (ML)

  • ML is a type of AI for building methods that allow machines to learn
  • Data is leveraged to improve computer performance on a set of tasks
  • It's used to make predictions based on data used to train the model
  • You don't explicitly program the rules, you just give data to the algorithm and it creates its own model to classify or understand how the data is being structured

What is Deep Learning (DL)

  • It is a subset of ML
  • It uses neurons and synapses (like our brain) to train a model
  • It's able to process more complex patterns in the data than traditional ML
  • It's called Deep Learning because there's more than one layer of learning
  • Ex:
    • Computer Vision: image classification, object detection, image segmentation
    • NLP: text classification, sentiment analysis, machine translation, language generation
  • To have a good DL model you need a very large amount of input data and a GPU

What is Generative AI (GenAI)

  • It's a subset of Deep Learning
  • It's a multipurpose foundation model backed by neural networks
  • They can be fine-tuned if necessary to better fit our use cases
  • GenAI utilizes Transformer Models (LLM)
    • They're able to process a sentence as a whole instead of word by word
    • It provides faster and more efficient text processing (less training time)
    • It give relative importance to specific words in a sentence (more coherent sentences)
  • Transformer-based LLMs
    • Powerful models that can understand and generate human-like text
    • Trained on vast amounts of text data from the internet, books, and other sources, and learn patterns and relationships between words and phrases
    • Ex: Google BERT, ChatGPT (Generative Pretrained Transformer)
  • Diffusion models for images

When is ML NOT appropriate

  • For deterministic problems (the solution can be computed), it's better to write the computer code that is adapted to the problem
    • If we use (un)supervised learning or reinforcement learning, we may have an "approximation" of the result

Phases of an ML Project

  • Fill in from udemy video 66

Define Business Goals

ML Problem Framing

Data Collection & Preparation

Model Development

Model Evaluation

Model Deployment

Model Monitoring

Model Iterations

Hyperparameter Tuning

Hyperparameter

  • Settings that define the model structure and learning algorithm and process
  • Set before training begins
  • Examples: learning rate, batch size, number of epochs, and regularization
  • Hyperparameters have nothing to do with the data, they're just about the algorithm used to train the model

Hyperparameter Tuning

  • Finding the best hyperparameter values to optimize the model performance
  • Improves model accuracy, reduces overfitting, and enhances generalization

Implementations

  • Grid search, random search
  • Using services such as SageMaker Automatic Model Tuning (AMT)

Important Hyperparameters

  • Learning rate
    • How large or small the steps are when updating the model's weights during training
    • High learning rate can lead to faster convergence, but risks overshooting the optimal solution
    • Low learning rate may result in more precise but slower convergence
  • Batch size
    • How many training examples used to update the model weights in one iteration
    • Smaller batches can lead to more stable learning, but require more time to compute
    • Larger batches are faster but may lead to less stable updates
  • Number of Epochs
    • How many times the model will iterate over the entire training dataset
    • Too few epochs can lead to underfitting
    • Too many epochs may cause overfitting
  • Regularization
    • Adjusting the balance between simple and complex model
    • Increase regularization to reduce overfitting

Training Data

  • To train our model we must have good data (garbage in --> garbage out)
  • Collecting data, cleaning it, and ensuring its useful data for your purpose is the one of the most critical part of building a good model

Labeled vs Unlabeled Data

  • Labeled data: data that includes both input features and corresponding output labels
    • Ex: dataset with images of animals where each image is labeled with the corresponding animal type (cat, dog, etc.)
    • Use case: supervised learning, where the model is trained to map inputs to known outputs
  • Unlabeled data: data that includes only input features without any output labels
    • Ex: a collection of mimages without any associated labels
    • Use case: unsupervised learning, where the model tries to find patterns or structures in the data

Structured vs Unstructured Data

  • Unstructured data: data that is organized in a structured format, often in rows and columns (like Excel)
    • Tabular data: data is arranged in a table with rows representing records and columns representing features
      • Ex: customers database with fields such as name, age, and total purchase amount
    • Time series data: data points collected or recorded at successive points in time
      • Ex: stock price es recorded daily over a year
  • Unstructured data: data that doesn't follow a specific structure and is often text-heavy or multimedia content
    • Text data: unstructured data such as articles, social media posts, or customer reviews
      • Ex: a collection of product reviews from an e-commerce site
    • Image data: data in the form of images, which can vary widely in format and content
      • Ex: images used for object recognition tasks

Training vs Validation vs Test Set

  • Training set: used to train the model
    • Percentage: typically 60-80% of the dataset
    • Ex: 800 labels images from a dataset of 1,000 images
  • Validation set: used to tune model parameters and validate performance
    • Percentage: typically 10-20% of the dataset
    • Ex: 100 labeled images for hyperparameter tuning (tune the settings of the algorithm to make it more efficient)
  • Test set: used to evaluate the final model performance
    • Percentage: typically 10-20% of the dataset
    • Ex: 100 labeled imaged to test the model's accuracy

Feature Engineering

  • The process of using domain knowledge to select and transform raw data into meaningful features
  • It helps enhance the performance of ML models
  • It's especially meaningful for supervised learning
  • Techniques:
    • Feature Extraction: extracting useful info from raw data
      • Ex: deriving age from date of birth
    • Feature Selection: selecting a subset of relevant features
      • Ex: choosing a subset of the data to use only the important predictors in a regression model
    • Feature Transformation: transforming the data for better model performance
      • Ex: normalizing numerical data
  • Feature Engineering on Structured Data (tabular data)
    • Ex: predicting house prices based on features like size, location, and number of rooms
    • Feature engineering tasks:
      • Feature creation: deriving new features like "price per square foot"
      • Feature selection: identifying and retaining important features such as location or number of bedrooms
      • Feature transformation: normalizing features to ensure they are on a similar scale, which helps algorithms like gradient descent converge faster
  • Feature Engineering on Unstructured Data (text, images)
    • Ex: sentiment analysis of customer reviews
    • Feature engineering tasks
      • Text Data: converting text into numerical features using techniques like TF-IDF or word embeddings
      • Image Data: extracting features such as edges or textures using techniques like convolutional neural networks (CNNs)

ML Algorithms

Supervised Learning

  • Want to learn a mapping function that can predict the output for new unseen input data
  • Models are trained on labeled data: very powerful, but difficult to perform on millions of datapoints
  • Techniques:
    • Classification: predicts a discrete categroical label of the input data
      • Use cases: scenarios where decisions or predictions need to be made between district categories (fraud, image classification, customer retention, diagnostics)
      • Examples:
        • **Binary classification **(one or the other): classify emails as "spam" or "not spam"
        • Multi-class classification (more than two): classify animals in a zoo as "mammal", "bird", or "reptile"
        • Multi-label classification (can assign multiple to one): assign multiple labels to a movie, like "action" and "comedy"
      • Key algorithm: K-nearest neighbors (k-NN) model
    • Regression: predicts a continuous numeric value
      • Use cases: used hen the goal is to predict a quantity or real value
      • Examples: probabilities or scores; sales forecasts, temperature predictions

Unsupervised Learning

  • Models work with unlabeled data to find patterns, relationships, groupings, or underlying structures
  • The machine must uncover and create the groups itself, but humans still put labels on the output groups
  • Even though it uses unlabeled data, feature engineering can still help improve the quality of the training data
  • Techniques:
    • Clustering: used to group similar data points together into clusters based on their features
      • Use cases: customer segmentation, targeted marketing, recommender systems
      • Example: customer segmentation:
        • Scenario: e-commerce company wants to segment its customers to understand different purchasing behaviors
        • Data: a dataset contains customer purchase history (e.g. purchase frequency, average order value)
        • Goal: identify distinct groups of customer based on their purchasing behavior
        • Technique: K-means clustering
        • Outcome: the company can target each segment with tailored marketing strategies
    • Dimensionality Reduction
    • Anomaly Detection:
      • Example: Fraud Detection
        • Scenario: detect fraudulent credit card transactions β€’ Data: transaction data, including amount, location, and time
        • Goal: identify transactions that deviate significantly from typical behavior
        • Technique: Isolation Forest
        • Outcome: the system flags potentially fraudulent transactions for further investigation
    • Association Rule Learning:
      • Example: Market Basket Analysis
        • Scenario: supermarket wants to understand which products are frequently bought together
        • Data: transaction records from customer purchases β€’ Goal: Identify associations between products to optimize product placement and promotions
        • Technique: Apriori algorithm
        • Outcome: the supermarket can place associated products together to boost sales

Semi-Supervised Learning

  • Use a small amount of labeled data and a large amount of unlabeled data to train systems
    • It's useful because labeling data is useful but expensive, so providing some is a happy medium
  • After that, the partially trained algorithm itself labels the unlabeled data
    • This is called pseudo-labeling
  • Now that everything is labeled, retrain the model on the entire dataset

Self-Supervised Learning

  • Have a model generate pseudo-labels for its own data without having humans label any data first
    • This is useful since labeling data as humans can be expensive
  • Then, using the pseudo labels, solve problems traditionally solved by supervised learning
  • This is widely used in NLP (to create the BERT and GPT models for example) and in image recognition tasks
  • Self-supervised learning intuitive example:
    • Create "pre-text tasks" to have the model solve simple tasks to learn patterns in the dataset
    • Pretext tasks are not "useful" as-is, but will teach out model to create a "representation" of our dataset
      • Predict any part of the input from any other part
      • Predict the future from the past
      • Predict the masked from the visible
      • Predict any occluded part form all available parts
    • After solving the pre-text tasks, we have a model trained that can solve our end goal: "downstream tasks"

Reinforcement Learning (RL)

  • A type of ML where an agent interacts with an environment and learns to make the maximal decisions by receiving rewards or penalties
  • Good YouTube channel: channel, video
  • Key concepts:
    • Agent: the learner or decision-maker
    • Environment: the external system the agent interacts with
    • Action: the choices made by the agent
    • Reward: the feedback from the environment based on the agent's actions
    • State: the current situation of the environment
    • Policy: the strategy the agent uses to determine actions based on the state
  • How does RL work?
    • Learning Process
      • The Agent observes the current State of the Environment
      • It selects an Action based on its Policy
      • The environment transitions to a new State and provides a Reward
      • The Agent updates its Policy to improve future decisions
    • Goal: Maximize cumulative reward over time
  • Example: RL in action
    • Scenario: training a robot to navigate a maze
    • Steps: robot (Agent) observes its position (State)
      • Chooses a direction to move (Action)
      • Receives a reward (-1 for taking a step, -10 for hitting a wall, +100 for going to the exit)
      • Updates its Policy based on the Reward and new position
    • Outcome: the robot learns to navigate the maze efficiently over time
  • Applications of RL
    • Gaming – teaching AI to play complex games (e.g., Chess, Go)
    • Robotics – navigating and manipulating objects in dynamic environments
    • Finance – portfolio management and trading strategies
    • Healthcare – optimizing treatment plans
    • Autonomous Vehicles – path planning and decision-making

RLHF: Reinforcement Learning from Human Feedback

  • Use human feedback to help ML models to self-learn more efficiently
  • RLHF significantly enhances the model performance
  • In RL there's a reward function. In RLHF, human feedback is incorporated in the reward function, to be more aligned with human goals, wants, and needs
    • First the model's responses are compared to human's responses
    • Then, a human assesses the quality of the model's responses
  • RLHF is used thought GenAI applications including LLM models
  • Ex: grading text translations from "technically correct" to "human"
  • For the AWS exam, mostly focus on knowing the 4 steps of RLHF below
  • Example of how RLHF works: internal company knowledge chatbot
    • Data collection
      • Set of human-generated prompts and responses are created
      • β€œWhere is the location of the HR department in Boston?”
    • Supervised fine-tuning of a language model
      • Fine-tune an existing model with internal knowledge
      • Then the model creates responses for the human-generated prompts
      • Responses are mathematically compared to human-generated answers
    • Build a separate reward model
      • Humans can indicate which response they prefer from the same prompt
      • The reward model can now estimate how a human would prefer a prompt response
    • Optimize the language model with the reward-based model
      • Use the reward model as a reward function for RL
      • This part can be fully automated

Model Fit, Bias, and Variance

  • You want no underrating or overfitting and low bias and variance

Model Fit

  • In case your model has poor performance, you need to look at its fit
  • Overfitting (high variance): model performs well on the training data but doesn't perform well on the evaluation data
    • This can lead to high variance score
    • Occurs due to:
      • Training data size too small or doesn't represent all possible input values
      • The model trains too long on a single sample set of data
      • Model complexity is high and learns from the "noise" within the training data
    • How to prevent it:
      • Increase training data size
      • Early stopping the training of the model
      • Can do data augmentation to increase the diversity in the dataset
      • Can adjust hyperparameters, but usually not the best method
  • Underfitting (high bias): model performs poorly on training data
    • Could be a problem of having model too simple or poor data features
  • What you want is a balance fit: neither overfitting or undercutting

Bias (high = underfitting)

  • The difference or error between the predicted and actual value
  • Occurs due to the wrong choice in the ML process
  • High bias: means the model doesn't closely match the training data
    • Ex: linear regression function on a non-linear dataset
    • Considered as underfitting
  • Reducing the bias
    • Use a more complex model
    • Or increase the number of features

Variance (high = overfitting)

  • How much the performance of a model changes if trained on a different dataset that has a similar distribution
  • High variance: means the model is very sensitive to changes in the training data
    • This is occurs when overfitting: performs well on training data, but poorly on unseen test data
  • Reducing the variance
    • Use feature selection: consider less features (only the important ones)
    • Split the training and test data multiple times

Model Evaluation

Binary and multi-class classification

  • Confusion Matrix
    • Uses true positives (TP) and negatives (TN), false positives (FP) and negatives (FN)
    • It's the best way to evaluate performance of a model that does classifications
    • Can be for binary classification (image above) or for multi-class classification (multi-dimension confusion matrix)
  • Metrics
    • Precision – Best when false positives are costly
      • Precision = TP / (TP + FP)
      • Best when FP are costly
    • Recall – Best when false negatives are costly
      • Recall = TP / (TP + FN)
      • Best when FN are costly
    • F1 Score – Best when you want a balance between precision and recall, especially in imbalanced datasets
      • F1 = (2 * Precision * Recall) / (Precision + Recall)
      • Best when you want a balance between precision and recall, especially in imbalanced datasets
    • Accuracy (rarely used) – Best for balanced datasets
      • Accuracy = (TP + TN) / (TP + TN + FP + FN
      • Best for balanced datasets
  • For the AWS exam, you don't need to know the formulas, just need to know precision, recall, f1, and accuracy are used for binary classification
  • Area Under the ROC Curve (AUC)
    • Used for performance evaluation of binary classification models making probabilistic predictions

Regression Metrics

  • MAE
  • MAPE
  • RMSE
  • R^2 (R Squared)
  • These are used for evaluating models that predict a continuous value (i.e. regressions)
    • For MAE, MAPE, and RMSE, the lower the better
    • For R^2, the closer to 1 the better

Inferencing

  • Inferencing is when a model is making a prediction on new data
  • Real time
    • Computers have to make decisions quickly as data arrives
    • Speed is preferred over perfect accuracy
    • Ex: chatbots
  • Batch
    • Large amount of data that is analyzed all at once
    • Often used for data analysis
    • Speed of the results is usually not a concern, accuracy is
  • Inferencing at the Edge
    • Edge devices usually have less computing power and are close to where the data is generated, in places where internet connections can be limited
    • Small Language Model (SLM) on the edge device
      • Very low latency
      • Low compute footprint
      • Offline capability; local inference
    • Large Language Model (LLM) on a remote server
      • More powerful model
      • Higher latency
      • Must be online to be accessed

Responsible AI & Security

Responsible AI

  • Making sure AI systems are transparent and trustworthy
  • Need to mitigate potential risk and negative outcomes
  • Core facets of Responsible AI
    • Fairness: promote inclusion and prevent discrimination
    • Explainability
    • Privacy and security: individuals control when and if their data is used
    • Transparency:
    • Veracity and robustness: reliable even in unexpected situations
    • Governance: define, implement and enforce responsible AI practices
    • Safety: algorithms are safe and beneficial for individuals and society
    • Controllability: ability to align to human values and intent
  • AWS Services for Responsible AI
    • Bedrock: human or automatic model evaluation
    • Guardrails for Bedrock
    • SageMaker Clarify:
      • FM evaluation on accuracy robustness, toxicity
      • Bias detection (ex data skewed towards one race)
    • SageMaker Data Wrangler: fix bias by balancing dataset (e.g. with augmented data)
    • SageMaker Model Monitor: quality analysis in production
    • Amazon Augmented AI (A2I): human review of ML predictions
    • Governance: SageMaker Role Manager, Model Cards, and Model Dashboard
    • Also have AWS AI Service Cards for some services

Security

  • Ensure that confidentiality, integrity, and availability are maintained
  • This applies to your data, information assets, and infrastructure

Governance & Compliance

Governance

  • Ensure to add value and manage risk in he operation of business
  • Need clear policies, guidelines, and oversight mechanisms to ensure AI systems align with legal and regulatory requirements

Security

  • Ensure adherence to regulations and guidelines
  • Especially for sensitive domains such as healthcare, finance, and legal applications

Prompt tips

Parameters

⚠️ **GitHub.com Fallback** ⚠️