Machine Learning - sgml/signature GitHub Wiki

Data preprocessing involves gathering and cleaning data, which is essential for any machine learning task.

In new research accepted for publication in Chaos, they showed that improved predictions of chaotic systems like the Kuramoto-Sivashinsky equation become possible by hybridizing the data-driven, machine-learning approach and traditional model-based prediction. Ott sees this as a more likely avenue for improving weather prediction and similar efforts, since we don’t always have complete high-resolution data or perfect physical models. “What we should do is use the good knowledge that we have where we have it,” he said, “and if we have ignorance we should use the machine learning to fill in the gaps where the ignorance resides.”

AI is skilled at games Machine learning is skilled at statistics "Deep learning is also highly susceptible to bias. When Google's facial recognition system was initially rolled out, for instance, it tagged many black faces as gorillas.

Child Psychology

Storytelling

I think stories are what make us different from chimpanzees and Neanderthals. And if story-understanding is really where it’s at, we can’t understand our intelligence until we understand that aspect of it.

Scaffolding

Concepts

Research

SQL Algorithms

Data Quality

Techniques

OCR

Decision Trees

Relational Data Interoperability

Full Text Faceted Search Engine Marketing

NLP

Music

NLG

Gender Prediction

Codegen

Datasets

Packages

Classifiers

Deepfake Detection

Fantasy Basketball

Fantasy Football

Dictionaries

Chatbots

Slackbots

Finite State Machine

Relational Data

Semantic Data

Recognition

Video Games

Prediction

Deepfakes

Bad Fit

What types of data is most poorly labeled among publicly traded companies?

Publicly traded companies often struggle with labeling certain types of data accurately. Some of the most poorly labeled data include:

  1. Soft Information: This includes intangible assets like the value of research and development, employee training, and morale. These are difficult to quantify and often lead to inconsistencies in reporting.
  2. Financial Data: Despite efforts to standardize financial reporting with formats like XBRL (Extensible Business Reporting Language), there are still issues with comparability and accuracy.
  3. Non-Financial Metrics: Data related to environmental, social, and governance (ESG) factors can be inconsistently labeled and reported, leading to difficulties in comparison and analysis.

Would you like to know more about how companies can improve their data labeling practices?

Bad Data

create_llm_with_bad_quality_data:
  steps_and_pitfalls:
    data_collection:
      poor_data_sources: "Using unreliable or unverified sources can result in collecting irrelevant or incorrect information."
      lack_of_diversity: "If the data lacks diversity in language, style, and context, the model will struggle to generalize and understand different inputs."
    data_preprocessing:
      minimal_cleaning: "Not cleaning or preprocessing data properly leads to noisy inputs, including spelling mistakes, grammatical errors, and inconsistent formatting."
      biased_data: "Training with biased data can reinforce harmful stereotypes and provide skewed responses."
    model_training:
      overfitting: "Training with poor quality data can cause overfitting, where the model performs well on the training data but poorly on new, unseen data."
      low_accuracy: "The model will have low accuracy and poor generalization capabilities due to the flawed input data."
    evaluation_and_tuning:
      inaccurate_evaluation: "Evaluating the model with low-quality validation data results in misleading performance metrics."
      poor_tuning: "Inadequate hyperparameter tuning can further degrade the model's performance."
  consequences:
    unreliable_outputs: "The model will generate inaccurate and unreliable responses, undermining its usefulness."
    reinforcement_of_biases: "Using biased data can perpetuate and amplify existing biases."
    increased_risks: "Deploying such a model can lead to misinformation and ethical concerns."

Purposeful Exacerbation of Logical Fallacies in Algorithms

When algorithms are purposefully designed to exacerbate logical fallacies like the Gambler's Fallacy and the Hot-Hand Fallacy, it is often to exploit certain behaviors or patterns for specific outcomes, typically in trading or investment contexts. Here are some ways it might happen:

Market Manipulation

Algorithms can be designed to create artificial demand or supply in the market by repeatedly trading based on recent performance, making it appear as though certain stocks are trending. This can mislead other traders into believing there is a sustained trend, which isn't actually based on underlying fundamentals.

Reinforcing Biases

By emphasizing recent trends and neglecting the inherent randomness, these algorithms can exploit the Hot-Hand Fallacy. This can drive up the prices of certain assets artificially, creating a bubble that savvy traders might plan to exploit by shorting once the market corrects itself.

Echo Chamber Effect

Algorithms that amplify the Gambler's Fallacy might focus on past losses or downturns, driving prices down further than warranted by fundamentals. This can lead to undervaluation of assets, which can be exploited later when market corrections occur.

Encouraging Risky Behavior

By giving undue weight to recent successes, such algorithms can encourage investors to make increasingly risky bets, believing their 'streak' will continue. This can lead to greater market volatility, which can be advantageous to certain trading strategies.

High-Frequency Trading (HFT)

In HFT, algorithms may exploit micro-trends by making thousands of trades per second, based on minute price movements and past performance trends. This can distort market prices and create opportunities for profit by capitalizing on these artificial fluctuations.

It's worth noting that such practices are often scrutinized and regulated to prevent market abuse and protect investors.