Machine Learning - sgml/signature GitHub Wiki

Data preprocessing involves gathering and cleaning data, which is essential for any machine learning task.

In new research accepted for publication in Chaos, they showed that improved predictions of chaotic systems like the Kuramoto-Sivashinsky equation become possible by hybridizing the data-driven, machine-learning approach and traditional model-based prediction. Ott sees this as a more likely avenue for improving weather prediction and similar efforts, since we don’t always have complete high-resolution data or perfect physical models. “What we should do is use the good knowledge that we have where we have it,” he said, “and if we have ignorance we should use the machine learning to fill in the gaps where the ignorance resides.”

AI is skilled at games Machine learning is skilled at statistics "Deep learning is also highly susceptible to bias. When Google's facial recognition system was initially rolled out, for instance, it tagged many black faces as gorillas.

Child Psychology

Storytelling

I think stories are what make us different from chimpanzees and Neanderthals. And if story-understanding is really where it’s at, we can’t understand our intelligence until we understand that aspect of it.

Obscurity

writers_editors:
  - name: "Walter Bagehot (New Zealand)"
    url: "https://www.economicshelp.org/blog/26107/economics/walter-bagehot/"

  - name: "Benjamin Constant (France)"
    url: "https://plato.stanford.edu/entries/constant/"

  - name: "Giuseppe Mazzini (Italy)"
    url: "https://spartacus-educational.com/ITmazzini.htm"

  - name: "Juan Pablo II (Poland)"
    url: "https://www.jstor.org/stable/10.5325/jjohnpajstud.4.2.0001"

  - name: "José Martí (Cuba)"
    url: "https://www.jstor.org/stable/30209112"

Scaffolding

Concepts

Research

SQL Algorithms

Data Quality

Techniques

OCR

Decision Trees

Relational Data Interoperability

Full Text Faceted Search Engine Marketing

NLP

Music

NLG

Gender Prediction

Codegen

Datasets

Packages

Classifiers

Deepfake Detection

Fantasy Basketball

Fantasy Football

Dictionaries

Chatbots

Slackbots

Finite State Machine

Relational Data

Semantic Data

Recognition

Video Games

Prediction

Deepfakes

Vector Features in RDBMSs

Database Feature Description URL Version Introduced
PostgreSQL pgvector An open-source extension that adds support for vector operations and similarity searches. pgvector 12.4
MySQL MySQL HeatWave Includes support for vector store and generative AI capabilities, performing similarity searches with LLMs. HeatWave 8.0
MariaDB MariaDB Vector Allows storing and searching vector data using a modified HNSW algorithm for fast similarity searches. Vector 11.7
Sybase Sybase Features Currently does not have built-in vector database features similar to PostgreSQL, MySQL, and MariaDB. Sybase Features N/A
Teradata Teradata Features Teradata provides advanced vector capabilities for data analysis and machine learning applications. Teradata Features 16.20

Bad Fit

What types of data is most poorly labeled among publicly traded companies?

Publicly traded companies often struggle with labeling certain types of data accurately. Some of the most poorly labeled data include:

  1. Soft Information: This includes intangible assets like the value of research and development, employee training, and morale. These are difficult to quantify and often lead to inconsistencies in reporting.
  2. Financial Data: Despite efforts to standardize financial reporting with formats like XBRL (Extensible Business Reporting Language), there are still issues with comparability and accuracy.
  3. Non-Financial Metrics: Data related to environmental, social, and governance (ESG) factors can be inconsistently labeled and reported, leading to difficulties in comparison and analysis.

Would you like to know more about how companies can improve their data labeling practices?

Bad Data

create_llm_with_bad_quality_data:
  steps_and_pitfalls:
    data_collection:
      poor_data_sources: "Using unreliable or unverified sources can result in collecting irrelevant or incorrect information."
      lack_of_diversity: "If the data lacks diversity in language, style, and context, the model will struggle to generalize and understand different inputs."
    data_preprocessing:
      minimal_cleaning: "Not cleaning or preprocessing data properly leads to noisy inputs, including spelling mistakes, grammatical errors, and inconsistent formatting."
      biased_data: "Training with biased data can reinforce harmful stereotypes and provide skewed responses."
    model_training:
      overfitting: "Training with poor quality data can cause overfitting, where the model performs well on the training data but poorly on new, unseen data."
      low_accuracy: "The model will have low accuracy and poor generalization capabilities due to the flawed input data."
    evaluation_and_tuning:
      inaccurate_evaluation: "Evaluating the model with low-quality validation data results in misleading performance metrics."
      poor_tuning: "Inadequate hyperparameter tuning can further degrade the model's performance."
  consequences:
    unreliable_outputs: "The model will generate inaccurate and unreliable responses, undermining its usefulness."
    reinforcement_of_biases: "Using biased data can perpetuate and amplify existing biases."
    increased_risks: "Deploying such a model can lead to misinformation and ethical concerns."

Purposeful Exacerbation of Logical Fallacies in Algorithms

When algorithms are purposefully designed to exacerbate logical fallacies like the Gambler's Fallacy and the Hot-Hand Fallacy, it is often to exploit certain behaviors or patterns for specific outcomes, typically in trading or investment contexts. Here are some ways it might happen:

Market Manipulation

Algorithms can be designed to create artificial demand or supply in the market by repeatedly trading based on recent performance, making it appear as though certain stocks are trending. This can mislead other traders into believing there is a sustained trend, which isn't actually based on underlying fundamentals.

Reinforcing Biases

By emphasizing recent trends and neglecting the inherent randomness, these algorithms can exploit the Hot-Hand Fallacy. This can drive up the prices of certain assets artificially, creating a bubble that savvy traders might plan to exploit by shorting once the market corrects itself.

Echo Chamber Effect

Algorithms that amplify the Gambler's Fallacy might focus on past losses or downturns, driving prices down further than warranted by fundamentals. This can lead to undervaluation of assets, which can be exploited later when market corrections occur.

Encouraging Risky Behavior

By giving undue weight to recent successes, such algorithms can encourage investors to make increasingly risky bets, believing their 'streak' will continue. This can lead to greater market volatility, which can be advantageous to certain trading strategies.

High-Frequency Trading (HFT)

In HFT, algorithms may exploit micro-trends by making thousands of trades per second, based on minute price movements and past performance trends. This can distort market prices and create opportunities for profit by capitalizing on these artificial fluctuations.

It's worth noting that such practices are often scrutinized and regulated to prevent market abuse and protect investors.

Data Cleaning

https://www.kaggle.com/code/loganlauton/basic-data-clean-helper-nba-players-team-data

Case Studies on Garbage Data

studies:
  - title: "Quantifying Outlierness of Funds from their Categories using Supervised Similarity"
    description: "This study explores the impact of miscategorization in mutual funds using a machine learning approach. The researchers found a strong relationship between miscategorization and future returns, highlighting the significant implications for allocation decisions and investment fund managers."
    url: "https://arxiv.org/abs/2003.02924"

  - title: "Bias and Unfairness in Machine Learning Models: A Systematic Review"
    description: "This systematic review examines the current knowledge on bias and unfairness in machine learning models. It discusses various datasets, tools, fairness metrics, and methods for identifying and mitigating bias. The review emphasizes the importance of addressing miscategorization to ensure fair and unbiased models."
    url: "https://www.mdpi.com/2076-3417/10/18/6462"

  - title: "Evolution and Impact of Bias in Human and Machine Learning Algorithm Interaction"
    description: "This research investigates the iterative interaction between humans and machine learning algorithms. The study highlights how biased data and miscategorization can lead to algorithmic bias, which can further exacerbate the problem through iterative processes."
    url: "https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0226801"