I'm going to use this chat for learning more about health topics. I might ask you for answers, tips, or queries to learn more about specific topics. I may also send you links to save. For each link I send, read the article and give a headline title, a short summary (1-2 sentences), and a link to the URL. If you can't access the full article, then use the information from the URL itself or whatever you can get from the page. You don't need to save this information to memory, I just wanted to add it to the context. Do you have any questions or would any additional information be helpful?
Perplexity
For this query, accuracy and comprehensiveness are more important than speed. Please conduct an in-depth search and analysis. <query>
Sometimes we know "if this happens then do that" (AI)
Sometimes we've seen a lot of similar things before, and we classify them (ML)
Sometimes we haven't seen something before, but we have "learned" a lot of similar concepts, so we can make a decision (Deep Learning)
Sometimes, we get creative, and based on what we've learned, we can generate content (GenAI)
What is AI
AI is a broad field for the development of intelligent systems capable of performing tasks that typically require human intelligence:
Perception
Reasoning
Learning
Problem solving
Decision making
AI is an umbrella term for various techniques
Use cases:
Intelligent Document Processing (IDP): automatically extract structured data from various types of documents, such as invoices contracts, and forms
AI Components
Data Layer: collect vast amounts of data
ML Framework and Algorithm Layer: data scientists and engineers work together to understand use cases, requirements, and frameworks that can solve them
Model Layer: implement a model and train it
We have the structure, parameters and functions, and set an optimizer function
Application Layer: how to serve the model and its capabilities to users
What is Machine Learning (ML)
ML is a type of AI for building methods that allow machines to learn
Data is leveraged to improve computer performance on a set of tasks
It's used to make predictions based on data used to train the model
You don't explicitly program the rules, you just give data to the algorithm and it creates its own model to classify or understand how the data is being structured
What is Deep Learning (DL)
It is a subset of ML
It uses neurons and synapses (like our brain) to train a model
It's able to process more complex patterns in the data than traditional ML
It's called Deep Learning because there's more than one layer of learning
NLP: text classification, sentiment analysis, machine translation, language generation
To have a good DL model you need a very large amount of input data and a GPU
What is Generative AI (GenAI)
It's a subset of Deep Learning
It's a multipurpose foundation model backed by neural networks
They can be fine-tuned if necessary to better fit our use cases
GenAI utilizes Transformer Models (LLM)
They're able to process a sentence as a whole instead of word by word
It provides faster and more efficient text processing (less training time)
It give relative importance to specific words in a sentence (more coherent sentences)
Transformer-based LLMs
Powerful models that can understand and generate human-like text
Trained on vast amounts of text data from the internet, books, and other sources, and learn patterns and relationships between words and phrases
Ex: Google BERT, ChatGPT (Generative Pretrained Transformer)
Diffusion models for images
When is ML NOT appropriate
For deterministic problems (the solution can be computed), it's better to write the computer code that is adapted to the problem
If we use (un)supervised learning or reinforcement learning, we may have an "approximation" of the result
Phases of an ML Project
Fill in from udemy video 66
Define Business Goals
ML Problem Framing
Data Collection & Preparation
Model Development
Model Evaluation
Model Deployment
Model Monitoring
Model Iterations
Hyperparameter Tuning
Hyperparameter
Settings that define the model structure and learning algorithm and process
Set before training begins
Examples: learning rate, batch size, number of epochs, and regularization
Hyperparameters have nothing to do with the data, they're just about the algorithm used to train the model
Hyperparameter Tuning
Finding the best hyperparameter values to optimize the model performance
Improves model accuracy, reduces overfitting, and enhances generalization
Implementations
Grid search, random search
Using services such as SageMaker Automatic Model Tuning (AMT)
Important Hyperparameters
Learning rate
How large or small the steps are when updating the model's weights during training
High learning rate can lead to faster convergence, but risks overshooting the optimal solution
Low learning rate may result in more precise but slower convergence
Batch size
How many training examples used to update the model weights in one iteration
Smaller batches can lead to more stable learning, but require more time to compute
Larger batches are faster but may lead to less stable updates
Number of Epochs
How many times the model will iterate over the entire training dataset
Too few epochs can lead to underfitting
Too many epochs may cause overfitting
Regularization
Adjusting the balance between simple and complex model
Increase regularization to reduce overfitting
Training Data
To train our model we must have good data (garbage in --> garbage out)
Collecting data, cleaning it, and ensuring its useful data for your purpose is the one of the most critical part of building a good model
Labeled vs Unlabeled Data
Labeled data: data that includes both input features and corresponding output labels
Ex: dataset with images of animals where each image is labeled with the corresponding animal type (cat, dog, etc.)
Use case: supervised learning, where the model is trained to map inputs to known outputs
Unlabeled data: data that includes only input features without any output labels
Ex: a collection of mimages without any associated labels
Use case: unsupervised learning, where the model tries to find patterns or structures in the data
Structured vs Unstructured Data
Unstructured data: data that is organized in a structured format, often in rows and columns (like Excel)
Tabular data: data is arranged in a table with rows representing records and columns representing features
Ex: customers database with fields such as name, age, and total purchase amount
Time series data: data points collected or recorded at successive points in time
Ex: stock price es recorded daily over a year
Unstructured data: data that doesn't follow a specific structure and is often text-heavy or multimedia content
Text data: unstructured data such as articles, social media posts, or customer reviews
Ex: a collection of product reviews from an e-commerce site
Image data: data in the form of images, which can vary widely in format and content
Ex: images used for object recognition tasks
Training vs Validation vs Test Set
Training set: used to train the model
Percentage: typically 60-80% of the dataset
Ex: 800 labels images from a dataset of 1,000 images
Validation set: used to tune model parameters and validate performance
Percentage: typically 10-20% of the dataset
Ex: 100 labeled images for hyperparameter tuning (tune the settings of the algorithm to make it more efficient)
Test set: used to evaluate the final model performance
Percentage: typically 10-20% of the dataset
Ex: 100 labeled imaged to test the model's accuracy
Feature Engineering
The process of using domain knowledge to select and transform raw data into meaningful features
It helps enhance the performance of ML models
It's especially meaningful for supervised learning
Techniques:
Feature Extraction: extracting useful info from raw data
Ex: deriving age from date of birth
Feature Selection: selecting a subset of relevant features
Ex: choosing a subset of the data to use only the important predictors in a regression model
Feature Transformation: transforming the data for better model performance
Ex: normalizing numerical data
Feature Engineering on Structured Data (tabular data)
Ex: predicting house prices based on features like size, location, and number of rooms
Feature engineering tasks:
Feature creation: deriving new features like "price per square foot"
Feature selection: identifying and retaining important features such as location or number of bedrooms
Feature transformation: normalizing features to ensure they are on a similar scale, which helps algorithms like gradient descent converge faster
Feature Engineering on Unstructured Data (text, images)
Ex: sentiment analysis of customer reviews
Feature engineering tasks
Text Data: converting text into numerical features using techniques like TF-IDF or word embeddings
Image Data: extracting features such as edges or textures using techniques like convolutional neural networks (CNNs)
ML Algorithms
Supervised Learning
Want to learn a mapping function that can predict the output for new unseen input data
Models are trained on labeled data: very powerful, but difficult to perform on millions of datapoints
Techniques:
Classification: predicts a discrete categroical label of the input data
Use cases: scenarios where decisions or predictions need to be made between district categories (fraud, image classification, customer retention, diagnostics)
Examples:
**Binary classification **(one or the other): classify emails as "spam" or "not spam"
Multi-class classification (more than two): classify animals in a zoo as "mammal", "bird", or "reptile"
Multi-label classification (can assign multiple to one): assign multiple labels to a movie, like "action" and "comedy"
Key algorithm: K-nearest neighbors (k-NN) model
Regression: predicts a continuous numeric value
Use cases: used hen the goal is to predict a quantity or real value
Examples: probabilities or scores; sales forecasts, temperature predictions
Unsupervised Learning
Models work with unlabeled data to find patterns, relationships, groupings, or underlying structures
The machine must uncover and create the groups itself, but humans still put labels on the output groups
Even though it uses unlabeled data, feature engineering can still help improve the quality of the training data
Techniques:
Clustering: used to group similar data points together into clusters based on their features
Use cases: customer segmentation, targeted marketing, recommender systems
Example: customer segmentation:
Scenario: e-commerce company wants to segment its customers to understand different purchasing behaviors
Data: a dataset contains customer purchase history (e.g. purchase frequency, average order value)
Goal: identify distinct groups of customer based on their purchasing behavior
Technique: K-means clustering
Outcome: the company can target each segment with tailored marketing strategies
Dimensionality Reduction
Anomaly Detection:
Example: Fraud Detection
Scenario: detect fraudulent credit card transactions β’ Data: transaction data, including amount, location, and time
Goal: identify transactions that deviate significantly from typical behavior
Technique: Isolation Forest
Outcome: the system flags potentially fraudulent transactions for further investigation
Association Rule Learning:
Example: Market Basket Analysis
Scenario: supermarket wants to understand which products are frequently bought together
Data: transaction records from customer purchases β’ Goal: Identify associations between products to optimize product placement and promotions
Technique: Apriori algorithm
Outcome: the supermarket can place associated products together to boost sales
Semi-Supervised Learning
Use a small amount of labeled data and a large amount of unlabeled data to train systems
It's useful because labeling data is useful but expensive, so providing some is a happy medium
After that, the partially trained algorithm itself labels the unlabeled data
This is called pseudo-labeling
Now that everything is labeled, retrain the model on the entire dataset
Self-Supervised Learning
Have a model generate pseudo-labels for its own data without having humans label any data first
This is useful since labeling data as humans can be expensive
Then, using the pseudo labels, solve problems traditionally solved by supervised learning
This is widely used in NLP (to create the BERT and GPT models for example) and in image recognition tasks
Self-supervised learning intuitive example:
Create "pre-text tasks" to have the model solve simple tasks to learn patterns in the dataset
Pretext tasks are not "useful" as-is, but will teach out model to create a "representation" of our dataset
Predict any part of the input from any other part
Predict the future from the past
Predict the masked from the visible
Predict any occluded part form all available parts
After solving the pre-text tasks, we have a model trained that can solve our end goal: "downstream tasks"
Reinforcement Learning (RL)
A type of ML where an agent interacts with an environment and learns to make the maximal decisions by receiving rewards or penalties