ML Concepts - liniribeiro/machine_learning GitHub Wiki

Supervised learning

Supervised learning is a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns.

How does supervised learning work?

The data used in supervised learning is labeled — meaning that it contains examples of both inputs (called features) and correct outputs (labels). The algorithms analyze a large dataset of these training pairs to infer what a desired output value would be when asked to make a prediction on new data.

Types of supervised learning

Supervised learning in machine learning is generally divided into two categories: classification and regression.

Classification

Classification algorithms are used to group data by predicting a categorical label or output variable based on the input data. Classification is used when output variables are categorical, meaning there are two or more classes.

Regression

Regression algorithms are used to predict a real or continuous value, where the algorithm detects a relationship between two or more variables.

Real world supervised learning examples

Supervised learning models can be used for a number of different business use cases that help address a wide range of problems. Common supervised learning examples include the following:

  • Risk assessment: Supervised machine learning models can help banks and other financial services companies determine whether customers are likely to default loans, helping to minimize risk in their portfolios.
  • Image classification: Supervised machine learning algorithms are often trained to classify objects in images and videos. For example, an algorithm might be used to recognize a person in an image and automatically tag them on a social media platform.
  • Fraud detection: Supervised learning underpin many fraud detection systems, enabling enterprises to recognize fraudulent activity. These models are trained on datasets that contain both fraudulent and non-fraudulent activity so they can be used to flag suspicious activity in real time.
  • Recommendation systems: Supervised learning algorithms are used by online platforms and streaming services to power recommendations based on previous customer behavior or shopping history. The models extract important information about a user's behavior and suggest similar products and content.

Supervised learning vs. unsupervised learning

When it comes to understanding the difference between supervised learning vs. unsupervised, the primary difference is the type of input data used to train the model. Supervised learning uses labeled training datasets to try and teach a model a specific, pre-defined goal.

By comparison, unsupervised learning uses unlabeled data and operates autonomously to try and learn the structure of the data without being given any explicit instructions.

Unsupervised Learning

Unsupervised learning in artificial intelligence is a type of machine learning that learns from data without human supervision. Unlike supervised learning, unsupervised machine learning models are given unlabeled data and allowed to discover patterns and insights without any explicit guidance or instruction.

Unsupervised machine learning methods

In general, there are three types of unsupervised learning tasks: clustering, association rules, and dimensionality reduction.

Clustering

Clustering is a technique for exploring raw, unlabeled data and breaking it down into groups (or clusters) based on similarities or differences. It is used in a variety of applications, including customer segmentation, fraud detection, and image analysis. Clustering algorithms split data into natural groups by finding similar structures or patterns in uncategorized data.

Clustering is one of the most popular unsupervised machine learning approaches. There are several types of unsupervised learning algorithms that are used for clustering, which include exclusive, overlapping, hierarchical, and probabilistic.

  • Exclusive clustering: Data is grouped in a way where a single data point can only exist in one cluster. This is also referred to as “hard” clustering. A common example of exclusive clustering is the K-means clustering algorithm, which partitions data points into a user-defined number K of clusters.
  • Overlapping clustering: Data is grouped in a way where a single data point can exist in two or more clusters with different degrees of membership. This is also referred to as “soft” clustering.
  • Hierarchical clustering: Data is divided into distinct clusters based on similarities, which are then repeatedly merged and organized based on their hierarchical relationships. There are two main types of hierarchical clustering: agglomerative and divisive clustering. This method is also referred to as HAC—hierarchical cluster analysis.
  • Probabilistic clustering: Data is grouped into clusters based on the probability of each data point belonging to each cluster. This approach differs from the other methods, which group data points based on their similarities to others in a cluster.

Association

Association rule mining is a rule-based approach to reveal interesting relationships between data points in large datasets. Unsupervised learning algorithms search for frequent if-then associations—also called rules—to discover correlations and co-occurrences within the data and the different connections between data objects.

Dimensionality reduction

Dimensionality reduction is an unsupervised learning technique that reduces the number of features, or dimensions, in a dataset. More data is generally better for machine learning, but it can also make it more challenging to visualize the data.

Dimensionality reduction extracts important features from the dataset, reducing the number of irrelevant or random features present. This method uses principle component analysis (PCA) and singular value decomposition (SVD) algorithms to reduce the number of data inputs without compromising the integrity of the properties in the original data.

Real-world unsupervised learning examples

  • Anomaly detection: Unsupervised clustering can process large datasets and discover data points that are atypical in a dataset.
  • Recommendation engines: Using association rules, unsupervised machine learning can help explore transactional data to discover patterns or trends that can be used to drive personalized recommendations for online retailers.
  • Customer segmentation: Unsupervised learning is also commonly used to generate buyer persona profiles by clustering customers’ common traits or purchasing behaviors. These profiles can then be used to guide marketing and other business strategies.
  • Fraud detection: Unsupervised learning is useful for anomaly detection, revealing unusual data points in datasets. These insights can help uncover events or behaviors that deviate from normal patterns in the data, revealing fraudulent transactions or unusual behavior like bot activity.
  • Natural language processing (NLP): Unsupervised learning is commonly used for various NLP applications, such as categorizing articles in news sections, text translation and classification, or speech recognition in conversational interfaces.
  • Genetic research: Genetic clustering is another common unsupervised learning example. Hierarchical clustering algorithms are often used to analyze DNA patterns and reveal evolutionary relationships.

Unsupervised learning is well suited for tasks that require exploring large amounts of unlabeled data. This approach makes it easier for businesses to gain insights from data when no labels are present, helping them to understand the underlying structure of a dataset and identify patterns and relationships between datasets without the need for a human to teach them.