Machine Learning - sgml/signature GitHub Wiki

Data preprocessing involves gathering and cleaning data, which is essential for any machine learning task.

In new research accepted for publication in Chaos, they showed that improved predictions of chaotic systems like the Kuramoto-Sivashinsky equation become possible by hybridizing the data-driven, machine-learning approach and traditional model-based prediction. Ott sees this as a more likely avenue for improving weather prediction and similar efforts, since we don’t always have complete high-resolution data or perfect physical models. “What we should do is use the good knowledge that we have where we have it,” he said, “and if we have ignorance we should use the machine learning to fill in the gaps where the ignorance resides.”

AI is skilled at games Machine learning is skilled at statistics "Deep learning is also highly susceptible to bias. When Google's facial recognition system was initially rolled out, for instance, it tagged many black faces as gorillas.

Child Psychology

Storytelling

I think stories are what make us different from chimpanzees and Neanderthals. And if story-understanding is really where it’s at, we can’t understand our intelligence until we understand that aspect of it.

Obscurity

writers_editors:
  - name: "Walter Bagehot (New Zealand)"
    url: "https://www.economicshelp.org/blog/26107/economics/walter-bagehot/"

  - name: "Benjamin Constant (France)"
    url: "https://plato.stanford.edu/entries/constant/"

  - name: "Giuseppe Mazzini (Italy)"
    url: "https://spartacus-educational.com/ITmazzini.htm"

  - name: "Juan Pablo II (Poland)"
    url: "https://www.jstor.org/stable/10.5325/jjohnpajstud.4.2.0001"

  - name: "José Martí (Cuba)"
    url: "https://www.jstor.org/stable/30209112"

Scaffolding

Concepts

Research

SQL Algorithms

Data Quality

Techniques

OCR

https://www.makeuseof.com/tag/convert-handwriting-text-ocr/

Decision Trees

Relational Data Interoperability

Full Text Faceted Search Engine Marketing

NLP

Music

Patterns vs Modeling in OOP terminology

OOP metaphors for regex vs machine learning:

Regex as a static class: defined once, tightly scoped, and does exactly what it says with no surprises.
Regex as a factory pattern: it instantiates only those objects (strings) that precisely match its blueprint.
Regex as an interface contract: every string must implement the expected “methods” (e.g., format) to be accepted.
Regex as a sealed class: behavior is closed for modification; the structure is rigidly defined.
Regex generation as compile-time metaprogramming: optimized, constrained, zero guesswork.
ML as an abstract base class: it learns how to implement methods by example, leaving behaviors open-ended.
ML as a dynamic dispatch system: chooses its output based on learned runtime context.
ML as a strategy pattern: selects among learned approaches depending on the situation.
ML as a polymorphic generator: a single method (generate_text) might yield wildly different results depending on training.
ML as reflection: introspecting data and adapting to it—even if no prior blueprint exists.

NLG

Gender Prediction

Codegen

Datasets

Packages

Classifiers

Overkill / Over Engineering

options:
  - name: "GenAI"
    suited_for: "Large-scale creative generation tasks"
    cpu_cost: "High"
    gpu_cost: "High"
    cloud_IDE_time: "High"
    money_efficiency: "Low"
    comment: "Overkill for a CSV dataset of 500 rows."

  - name: "Discriminative AI"
    suited_for: "Large-scale classification or predictive tasks"
    cpu_cost: "Medium to High"
    gpu_cost: "High"
    cloud_IDE_time: "High"
    money_efficiency: "Low"
    comment: "Not cost-effective for only 500 rows."

  - name: "Machine Learning"
    suited_for: "Statistical modeling and predictions on moderate to large datasets"
    cpu_cost: "Moderate to High"
    gpu_cost: "Potentially high if complex models are used"
    cloud_IDE_time: "High (due to setup and iterative tuning)"
    money_efficiency: "Low to Moderate"
    comment: "The added overhead does not justify its use for such a small dataset."

  - name: "MechanicalTurk"
    suited_for: "Human-powered tasks (e.g., annotation) rather than computation"
    cpu_cost: "N/A"
    gpu_cost: "N/A"
    cloud_IDE_time: "Not applicable"
    money_efficiency: "Low (costs for human labor are high)"
    comment: "Not applicable for computational analysis of the dataset."

  - name: "Pandas"
    suited_for: "Data manipulation and analysis on small to moderate datasets"
    cpu_cost: "Very low"
    gpu_cost: "None required"
    cloud_IDE_time: "Minimal (runs locally on a standard CPU)"
    money_efficiency: "High (free and open source)"
    comment: "Ideal choice for a 500-row CSV dataset."

  - name: "Spreadsheet Macro"
    suited_for: "Simple data tasks in tools like Excel or Google Sheets"
    cpu_cost: "Very low"
    gpu_cost: "Not applicable"
    cloud_IDE_time: "Minimal (often built into desktop or web apps)"
    money_efficiency: "High (if software/subscriptions are already in use)"
    comment: "Works well for basic tasks but lacks the flexibility of Pandas for more in-depth analysis."

Deepfake Detection

Fantasy Basketball

Fantasy Football

Dictionaries

https://www.tutorialspoint.com/adobe_robohelp/adobe_robohelp_creating_indexes_keywords.htm

Chatbots

Slackbots

Finite State Machine

Relational Data

Semantic Data

Recognition

Video Games

Prediction

Biased Data

Deepfakes

Missing Data

Invention	Inventor	Location	Family	Heirs	Legacy	Reason for Unknowns
Wheel-lock Musket	Unknown	Europe	Not specifically documented	Not specifically documented	The invention paved the way for future advancements in firearm design, influencing the development of more sophisticated ignition mechanisms such as the flintlock.	The specific inventor is not documented due to the collaborative and evolving nature of firearm technology during this period.
Horizontal Water Wheel	Unknown	Europe	Not specifically documented	Not specifically documented	Its broader application and inventor of this specific design are not clearly documented.	The broader application and specific inventor are not documented, as many designs were often conceptualized and adapted by various individuals over time.

Vector Features in RDBMSs

Database	Feature	Description	URL	Version Introduced
PostgreSQL	pgvector	An open-source extension that adds support for vector operations and similarity searches.	pgvector	12.4
MySQL	MySQL HeatWave	Includes support for vector store and generative AI capabilities, performing similarity searches with LLMs.	HeatWave	8.0
MariaDB	MariaDB Vector	Allows storing and searching vector data using a modified HNSW algorithm for fast similarity searches.	Vector	11.7
Sybase	Sybase Features	Currently does not have built-in vector database features similar to PostgreSQL, MySQL, and MariaDB.	Sybase Features	N/A
Teradata	Teradata Features	Teradata provides advanced vector capabilities for data analysis and machine learning applications.	Teradata Features	16.20

Bad Fit

What types of data is most poorly labeled among publicly traded companies?

Publicly traded companies often struggle with labeling certain types of data accurately. Some of the most poorly labeled data include:

Soft Information: This includes intangible assets like the value of research and development, employee training, and morale. These are difficult to quantify and often lead to inconsistencies in reporting.
Financial Data: Despite efforts to standardize financial reporting with formats like XBRL (Extensible Business Reporting Language), there are still issues with comparability and accuracy.
Non-Financial Metrics: Data related to environmental, social, and governance (ESG) factors can be inconsistently labeled and reported, leading to difficulties in comparison and analysis.

Would you like to know more about how companies can improve their data labeling practices?

Bad Data

create_llm_with_bad_quality_data:
  steps_and_pitfalls:
    data_collection:
      poor_data_sources: "Using unreliable or unverified sources can result in collecting irrelevant or incorrect information."
      lack_of_diversity: "If the data lacks diversity in language, style, and context, the model will struggle to generalize and understand different inputs."
    data_preprocessing:
      minimal_cleaning: "Not cleaning or preprocessing data properly leads to noisy inputs, including spelling mistakes, grammatical errors, and inconsistent formatting."
      biased_data: "Training with biased data can reinforce harmful stereotypes and provide skewed responses."
    model_training:
      overfitting: "Training with poor quality data can cause overfitting, where the model performs well on the training data but poorly on new, unseen data."
      low_accuracy: "The model will have low accuracy and poor generalization capabilities due to the flawed input data."
    evaluation_and_tuning:
      inaccurate_evaluation: "Evaluating the model with low-quality validation data results in misleading performance metrics."
      poor_tuning: "Inadequate hyperparameter tuning can further degrade the model's performance."
  consequences:
    unreliable_outputs: "The model will generate inaccurate and unreliable responses, undermining its usefulness."
    reinforcement_of_biases: "Using biased data can perpetuate and amplify existing biases."
    increased_risks: "Deploying such a model can lead to misinformation and ethical concerns."

Data Quality Limitations

Physical world games of chance have triggered absurd responses from machine learning algorithms, such as this:

import random
from collections import defaultdict

# Initialize move history
move_history = []

# Function to predict the next move
def predict_next_move(history):
    if len(history) < 2:
        return random.choice(['rock', 'paper', 'scissors'])
    
    # Analyze the last two moves
    last_move = history[-1]
    second_last_move = history[-2]
    
    # Predict based on pattern
    if second_last_move == 'rock' and last_move == 'paper':
        return 'scissors'
    elif second_last_move == 'paper' and last_move == 'scissors':
        return 'rock'
    elif second_last_move == 'scissors' and last_move == 'rock':
        return 'paper'
    
    return random.choice(['rock', 'paper', 'scissors'])

# Simulate a game
for _ in range(10):
    opponent_move = random.choice(['rock', 'paper', 'scissors'])
    move_history.append(opponent_move)
    predicted_move = predict_next_move(move_history)
    print(f"Opponent Move: {opponent_move}, Predicted Move: {predicted_move}")

Purposeful Exacerbation of Logical Fallacies in Algorithms

When algorithms are purposefully designed to exacerbate logical fallacies like the Gambler's Fallacy and the Hot-Hand Fallacy, it is often to exploit certain behaviors or patterns for specific outcomes, typically in trading or investment contexts. Here are some ways it might happen:

Market Manipulation

Algorithms can be designed to create artificial demand or supply in the market by repeatedly trading based on recent performance, making it appear as though certain stocks are trending. This can mislead other traders into believing there is a sustained trend, which isn't actually based on underlying fundamentals.

Reinforcing Biases

By emphasizing recent trends and neglecting the inherent randomness, these algorithms can exploit the Hot-Hand Fallacy. This can drive up the prices of certain assets artificially, creating a bubble that savvy traders might plan to exploit by shorting once the market corrects itself.

Echo Chamber Effect

Algorithms that amplify the Gambler's Fallacy might focus on past losses or downturns, driving prices down further than warranted by fundamentals. This can lead to undervaluation of assets, which can be exploited later when market corrections occur.

Encouraging Risky Behavior

By giving undue weight to recent successes, such algorithms can encourage investors to make increasingly risky bets, believing their 'streak' will continue. This can lead to greater market volatility, which can be advantageous to certain trading strategies.

High-Frequency Trading (HFT)

In HFT, algorithms may exploit micro-trends by making thousands of trades per second, based on minute price movements and past performance trends. This can distort market prices and create opportunities for profit by capitalizing on these artificial fluctuations.

It's worth noting that such practices are often scrutinized and regulated to prevent market abuse and protect investors.

Data Cleaning

https://www.kaggle.com/code/loganlauton/basic-data-clean-helper-nba-players-team-data

Case Studies on Garbage Data

studies:
  - title: "Quantifying Outlierness of Funds from their Categories using Supervised Similarity"
    description: "This study explores the impact of miscategorization in mutual funds using a machine learning approach. The researchers found a strong relationship between miscategorization and future returns, highlighting the significant implications for allocation decisions and investment fund managers."
    url: "https://arxiv.org/abs/2003.02924"

  - title: "Bias and Unfairness in Machine Learning Models: A Systematic Review"
    description: "This systematic review examines the current knowledge on bias and unfairness in machine learning models. It discusses various datasets, tools, fairness metrics, and methods for identifying and mitigating bias. The review emphasizes the importance of addressing miscategorization to ensure fair and unbiased models."
    url: "https://www.mdpi.com/2076-3417/10/18/6462"

  - title: "Evolution and Impact of Bias in Human and Machine Learning Algorithm Interaction"
    description: "This research investigates the iterative interaction between humans and machine learning algorithms. The study highlights how biased data and miscategorization can lead to algorithmic bias, which can further exacerbate the problem through iterative processes."
    url: "https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0226801"

Human Friendly, Machine Unfriendly

Legacy MIME types that are completely unusable for machine learning are typically those that don't contain any meaningful data or are not structured in a way that machine learning algorithms can process. Examples include:

- **application/octet-stream**: This is a generic binary data format that doesn't provide any information about the content, making it unusable for machine learning without additional processing.
- **application/x-unknown**: This MIME type indicates that the content type is unknown, which means there's no specific format or structure to work with.
- **text/plain**: While plain text can be used for some machine learning tasks, it lacks the structure and richness of more complex data formats like JSON or XML, making it less useful for many applications.

In general, for machine learning, you want data in structured formats like JSON, XML, CSV, or specific file formats like images (JPEG, PNG) or audio (MP3, WAV) that can be easily parsed and processed by algorithms.

Unusable Data

MetaDatasets:
  - IncompleteData:
      Description: "Datasets heavily fragmented with significant amounts of missing data."
  - NonQuantifiableData:
      Description: "Data that cannot be quantified or is purely qualitative without any structured format."
      Example: "Vague descriptions without standardized terms."
  - InconsistentData:
      Description: "Data that lacks consistency in formats, units, or types."
      Example: "Mixing numerical values with text in the same column."
  - PoorQualityScansForOCR:
      Description: "Handwritten notes or documents that are poorly scanned, blurry, or too faint."
  - RandomOrNoisyData:
      Description: "Data that appears random or contains a high level of noise with no discernible patterns."
  - OutdatedOrIrrelevantData:
      Description: "Data that is too old or not relevant to the current context or domain."
  - UnstructuredTextData:
      Description: "Free-form text data without categorization, tagging, or structuring."

Java

machine_learning_libraries:
  - name: "Weka"
    description: "Popular machine learning library written in Java. Provides tools for data pre-processing, classification, regression, clustering, and visualization. Suitable for smaller-scale ML tasks on a Raspberry Pi."
    github_url: "https://github.com/weka/weka"
  - name: "Deeplearning4j"
    description: "Deep learning library for Java and Scala. Designed to be scalable and run on various hardware, including Raspberry Pi. Optimizations and lightweight neural networks are recommended for efficient performance."
    github_url: "https://github.com/deeplearning4j/deeplearning4j"
  - name: "Java-ML"
    description: "Machine learning library for Java offering algorithms for classification, clustering, regression, and more. Relatively lightweight and suitable for running on a Raspberry Pi."
    github_url: "https://github.com/AbeelLab/javaml"
  - name: "MOA (Massive Online Analysis)"
    description: "Framework for data stream mining and machine learning, written in Java. Used for real-time analytics on the Raspberry Pi. Designed to handle large-scale data streams efficiently."
    github_url: "https://github.com/Waikato/moa-2014"

Sports and Stocks

1. Associative Property

Sports: In team sports, the order in which players or strategies are combined can lead to the same outcome. For example, the outcome of a relay race is the same regardless of the order in which the team's legs are combined. Learn more about Associative Property
Stocks: In portfolio management, the order in which returns are compounded over time follows the associative property. For example, the overall return of a portfolio is the same regardless of the order in which individual stock returns are compounded. Learn more about Associative Property

2. Commutative Property

Sports: In scoring systems, the order of scoring can be changed without affecting the total score. For example, a basketball team scoring 2 points and then 3 points results in the same total score as scoring 3 points and then 2 points. Learn more about Commutative Property
Stocks: The commutative property applies to the addition of returns in a diversified portfolio. For example, the sum of returns from two stocks is the same regardless of the order in which they are added. Learn more about Commutative Property

3. Distributive Property

Sports: In resource allocation, distributing resources (e.g., energy, time, training) across different players or activities follows the distributive property. For example, the total training time can be distributed across different exercises for each player. Learn more about Distributive Property
Stocks: In financial analysis, distributing investment across different asset classes follows the distributive property. For example, the overall risk of a portfolio can be distributed across stocks, bonds, and other assets. Learn more about Distributive Property

4. Linear Regression

Sports: Linear regression is used to predict player performance based on various factors such as training, past performance, and physical attributes. Learn more about Linear Regression
Stocks: Linear regression is used to analyze the relationship between stock prices and various economic indicators, helping to predict future stock prices. Learn more about Linear Regression

5. Game Theory

Sports: Game theory is used to analyze strategic interactions between teams or players, helping to determine optimal strategies in competitive scenarios. Learn more about Game Theory
Stocks: Game theory is used to analyze the interactions between market participants, helping to determine optimal trading strategies in competitive markets. Learn more about Game Theory

6. Optimization

Sports: Optimization techniques are used to improve team performance, player training schedules, and game strategies. Learn more about Optimization
Stocks: Optimization techniques are used in portfolio management to maximize returns and minimize risk. Learn more about Optimization

7. Probability and Statistics

Sports: Probability and statistics are used to analyze player and team performance, predict game outcomes, and make strategic decisions. Learn more about Probability and Statistics
Stocks: Probability and statistics are used to analyze market trends, predict stock prices, and assess investment risks. Learn more about Probability and Statistics

Mathematics and Large Language Models (LLMs)

When it comes to mathematics and large language models (LLMs), there are some interesting considerations:

Mathematical Operations: As we discussed, mathematical operations can be idempotent in certain contexts. For example, adding zero to a number repeatedly or multiplying a number by one doesn't change the result.
Consistency in Responses: While LLMs are not inherently idempotent due to their probabilistic nature, they can still provide consistent responses for straightforward mathematical queries. For example, if you ask for the sum of 2 + 2, the model should consistently respond with 4.
Complex Calculations: For more complex mathematical problems, LLMs may use built-in functions or external tools to perform calculations. While the underlying computations are consistent, the model's response might vary slightly in phrasing or presentation due to its language generation capabilities.

ASCII

Artist / Group	Active Era	Notable For	URL / Reference
Early Anonymous ASCII Pioneers	1960s–1970s	Pioneering the creation of computer-based text art on early mainframes and teletype systems.	SCI Python – ASCII Art
Scott Fahlman	1982	Coining the first emoticon (:-)) as a simple form of text-based expression—a precursor to broader ASCII art culture.	Wikipedia: Emoticon
Joan Stark (jgs)	1990s	Compiling and popularizing a vast collection of intricate ASCII art; her work remains one of the most recognized online.	ASCII Art Archive at asciiart.eu
Modern ASCII Art Community	2000s–Present	A diverse group of online contributors using updated tools and techniques to keep the ASCII art tradition alive.	ASCII Everything

XML Dialects

xml_dialects:
  - name: AIML (Artificial Intelligence Markup Language)
    url: http://www.aiml.foundation/doc.html
    creation_date: 2001
    last_update_date: 2018
  - name: Collada (Collaborative Design Activity)
    url: https://www.khronos.org/collada/
    creation_date: 2004
    last_update_date: 2016
  - name: CityGML
    url: https://www.ogc.org/publications/standard/citygml/
    creation_date: 2008
    last_update_date: 2012
  - name: BeerXML
    url: http://beerxml.com/beerxml.htm
    creation_date: 2003
    last_update_date: Unknown
  - name: CellML
    url: https://www.cellml.org/specifications/cellml_2.0
    creation_date: 1999
    last_update_date: 2019
  - name: XBRL (eXtensible Business Reporting Language)
    url: https://specifications.xbrl.org/
    creation_date: 1998
    last_update_date: 2023
  - name: IMS Content Packaging
    url: https://www.imsglobal.org/specifications.html
    creation_date: 2000
    last_update_date: 2022
  - name: Akoma Ntoso
    url: https://docs.oasis-open.org/legaldocml/akn-core/v1.0/akn-core-v1.0-part2-specs.html
    creation_date: 2004
    last_update_date: 2018
  - name: CMIS (Content Management Interoperability Services)
    url: https://specifications.oasis-open.org/cmis/
    creation_date: 2008
    last_update_date: 2017
  - name: DITA (Darwin Information Typing Architecture)
    url: https://www.oasis-open.org/committees/dita/
    creation_date: 2005
    last_update_date: 2023
  - name: OPML (Outline Processor Markup Language)
    url: https://dev.opml.org/
    creation_date: 2000
    last_update_date: 2022
  - name: MSBuild (Microsoft Build Engine)
    url: https://learn.microsoft.com/en-us/visualstudio/msbuild/msbuild
    creation_date: 2003
    last_update_date: 2023
  - name: OpenSearch
    url: https://opensearch.org/
    creation_date: 2005
    last_update_date: 2021
  - name: EPUB (Electronic Publication)
    url: https://www.w3.org/publishing/epub3/
    creation_date: 1999
    last_update_date: 2023

Limitations

comparison:
  concepts:
    Wet-on-Wet Painting:
      description: "Involves continuous blending of colors on a wet surface, allowing for organic transitions."
      plateaus:
        - plateaus due to formulaic use: "Over-reliance on predictable blending methods results in paintings that lack variation or originality."
        - loss of skill refinement: "Artists may neglect controlled brush techniques, reducing their ability to create detailed or layered effects."
        - handicap in manual execution: "Without intentional layering discipline, artists struggle with dry brush techniques or structured painting styles."
        - over adaptive to external inputs: "Environmental conditions (humidity, drying time) dictate the painting process more than deliberate artist control."
        - creativity restrictions: "The expectation of fluid blending limits experimental approaches such as defined edges, hard contrasts, and unconventional textures."
    
    Coding with AI Assistance:
      description: "Developers interact dynamically with AI-generated suggestions, refining code iteratively."
      plateaus:
        - plateaus due to formulaic use: "Repeated reliance on AI-suggested patterns leads to generic coding structures, reducing innovation in problem-solving."
        - loss of skill refinement: "Critical coding techniques like algorithm design, memory optimization, and debugging may weaken over time."
        - handicap in manual execution: "Developers struggle when coding without AI prompts, finding it harder to construct solutions independently."
        - over adaptive to external inputs: "AI biases in suggestions may override better manual approaches, resulting in suboptimal code structures."
        - creativity restrictions: "Programmers may avoid unconventional coding patterns or experimental solutions that AI does not readily suggest."
    
    GPS-Based Driving:
      description: "Real-time navigation adjusts dynamically based on external inputs like traffic and road conditions."
      plateaus:
        - plateaus due to formulaic use: "Drivers default to GPS routes instead of exploring alternatives, limiting their awareness of geography and route diversity."
        - loss of skill refinement: "Ability to manually plan routes, read road signs, and estimate travel times deteriorates over time."
        - handicap in manual execution: "In situations where GPS fails, drivers struggle to navigate using maps, intuition, or spatial reasoning."
        - over adaptive to external inputs: "Drivers become overly dependent on live traffic updates, reacting passively rather than proactively choosing efficient paths."
        - creativity restrictions: "Rigid adherence to suggested routes prevents improvisation, such as scenic detours or alternate paths that might be more efficient."

  comparison_summary: "Each technique—painting, coding with AI assistance, and GPS-based driving—faces plateaus when used repetitively or with excessive dependence on automated suggestions. Over-reliance can lead to skill degradation, reduced manual problem-solving ability, and diminished creativity."

footer:
  links:
    Wet-on-Wet Painting Overview: "https://www.arts.gov/stories/blog/2021/wet-wet-technique-artistic-expression"
    AI-Assisted Coding Overview: "https://www.nist.gov/news-events/news/2023/ai-assistance-coding-nist-insights"
    GPS Navigation Algorithms: "https://www.transportation.gov/research-and-technology/gps-navigation-impact-traffic"

Paywall Trade Secrets

API Name	Primary Use Case	Access Requirements	Documentation Availability	Alternative Public API (Limited Scope)
Xignite API	Market data, stock quotes, financial analytics	Paid subscription required	Limited public details; full API behind paywall	Alpha Vantage
Bloomberg Terminal API	Real-time financial data, analytics, trading insights	Requires Bloomberg Terminal subscription	Restricted to Bloomberg clients	Twelve Data
FactSet API	Financial research, portfolio management	Enterprise-level subscription	Only available to FactSet clients	Financial Modeling Prep
Morningstar Direct API	Investment research, fund analysis	Morningstar Direct subscription	Limited public access; full API restricted	Quandl Free Tier
S&P Capital IQ API	Company fundamentals, market insights	S&P Capital IQ subscription	Restricted to paying customers	OpenFIGI
Refinitiv Eikon API	Market data, financial analytics, trading tools	Requires Refinitiv Eikon subscription	Limited public details; full API behind paywall	IEX Cloud
Quandl Premium APIs	Alternative financial data, economic indicators	Paid subscription for premium datasets	Limited details for premium datasets	EOD Historical Data

Benchmarking

https://arxiv.org/abs/2310.03128