AI Discussions - jmadison222/knowledge GitHub Wiki
| Home |
General AI discussions for reference.
Summaries on this page were either taken from the original work or written by AI.
My current ranking of the major AI engines:
-
Claude - Gives concise answers. Actually admits when it doesn’t know. Has a bias toward coding. Spares me the compliments.
-
Gemini - Good all around solution.
-
Copilot - Too verbose. But when I do need verbose answer I go here. Will run you in circles when it doesn’t know. Too many sugary compliments.
-
ChatGPT - Didn’t care for it. And Sam Altman is evil so I refuse to use it.
-
A Small Number of Samples Can Poison LLMs of Any Size - In a joint study with the UK AI Security Institute and the Alan Turing Institute, we found that as few as 250 malicious documents can produce a "backdoor" vulnerability in a large language model—regardless of model size or training data volume. Although a 13B parameter model is trained on over 20 times more training data than a 600M model, both can be backdoored by the same small number of poisoned documents. Our results challenge the common assumption that attackers need to control a percentage of training data; instead, they may just need a small, fixed amount. Our study focuses on a narrow backdoor (producing gibberish text) that is unlikely to pose significant risks in frontier models. Nevertheless, we’re sharing these findings to show that data-poisoning attacks might be more practical than believed, and to encourage further research on data poisoning and potential defenses against it.
-
Computing Machinery and Intelligence - Paper by Alan Turing. Poses the question "Can machines think?" but Turing found this formulation problematic since defining "machines" and "think" could lead to misleading conclusions. Instead, he reframed it using what he called the "Imitation Game" (now known as the Turing Test). In this test, a human judge converses with both a computer and a human through typed messages, and both try to convince the judge they are human Wikipedia. If the judge cannot consistently distinguish between them, the machine passes the test.
-
Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability - AI assistants like GitHub Copilot and Cursor improve developer speed, but their impact on maintainability is less understood. This preregistered, two‑phase experiment with 151 mostly professional developers tested whether AI‑assisted code is harder or easier for others to evolve. Developers first added a feature with or without AI, and then new developers later modified those solutions without AI. The study found no meaningful differences in evolution time or code quality between AI‑assisted and non‑assisted code. While AI users in Phase 1 were significantly faster, the resulting code showed no systematic maintainability advantages or drawbacks. The authors note that future work should examine risks such as code bloat and cognitive offloading.
-
Summary Video - Summary of the paper by Modern Software Engineering.
-
-
Firm Data on AI - We present the first representative international data on firm-level AI use. We survey almost 6000 CFOs, CEOs and executives from stratified firm samples across the US, UK, Germany and Australia. We find four key facts. First, around 70% of firms actively use AI, particularly younger, more productive firms. Second, while over two thirds of top executives regularly use AI, their average use is only 1.5 hours a week, with one quarter reporting no AI use. Third, firms report little impact of AI over the last 3 years, with over 80% of firms reporting no impact on either employment or productivity. Fourth, firms predict sizable impacts over the next 3 years, forecasting AI will boost productivity by 1.4%, increase output by 0.8% and cut employment by 0.7%. We also survey individual employees who predict a 0.5% increase in employment in the next 3 years as a result of AI. This contrast implies a sizable gap in expectations, with senior executives predicting reductions in employment from AI and employees predicting net job creation.
-
Summary Video - Summary of the paper, by The Tech Report.
-
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - Recent advancements in Large Language Models (LLMs) have sparked interest in their mathematical reasoning capabilities. While performance on the widely popular GSM8K benchmark has improved, questions remain about whether reported evaluation metrics are reliable, and reasoning abilities of LLMs have advanced. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the gen eration of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models. Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer. Overall, our work provides a more nuanced understanding of LLMs’ capabilities and limitations in mathematical reasoning.
-
My note: The paper that points out if you scramble the benchmark trivially, performance drops by 65%. The benchmark found its way into the training set. Oops.
-
-
Hallucination Stations - On Some Basic Limitations of Transformer-Based Language Models - With widespread adoption of transformer-based language models (“LLMs”) in AI, there is significant interest in the limits of LLMs’ capabilities, specifically so-called “hallucinations”, occurrences in which LLMs provide spurious, factually incorrect or nonsensical [1, 2] information when prompted on certain subjects. Furthermore, there is growing interest in “agentic” uses of LLMs - that is, using LLMs to create "agents" that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.
-
Summary Video - Summary of the video, by Calib Ulku.
-
-
Learning Representations by Back-Propogating Errors - Published in 1986, this paper broke through the long-standing problem of how to move past a single-layer neural network. It provided a clear, general algorithm for training multi-layer neural networks — what we now simply call backpropagation. The key problem before back-propogation was: how do you assign credit or blame to hidden units in a network? If the output is wrong, which intermediate neurons caused it, and by how much? Backpropagation solves this by applying the chain rule of calculus in reverse — propagating the error gradient from the output layer back through each layer, computing how much each weight contributed to the error. Weights are then adjusted to reduce that error via gradient descent.
-
Learning Video - Video that animates the concept by 3Blue1Brown.
-
-
LLMs Will Keep Producing Bugs Forever - Video by Zoran. Large language models will always produce bugs, and not because of poor training or careless engineering, but because of the mathematics that governs how they work. LLMs generate text by predicting the next most likely token, which means they operate on probability rather than logic. That probabilistic nature guarantees that some portion of their output will always be wrong, especially in fields like programming where precision is non‑negotiable.
-
PWC’s 29th Global CEO Survey - Leading Through Uncertainty in the Age of AI - According to a recent survey by professional services network PwC, more than half of the 4,454 CEO respondents said “their companies aren’t yet seeing a financial return from investments in AI. Only 30 percent reported increased revenue from AI in the last 12 months. However, a far more significant 56 percent said AI has failed to either boost revenue or lower costs. A mere 12 percent of CEOs reported that it’d accomplished both goals.
-
Summary Article - Summary of the paper by Futurism.
-
-
Remote Labor Index: Measuring AI Automation of Remote Work - Paper by the Center for AI Safety and Scale AI. The Remote Labor Index (RLI), a new benchmark designed to measure how well AI agents can perform real, economically valuable remote work. Unlike traditional AI benchmarks that focus on narrow tasks or academic problems, RLI consists of full end‑to‑end freelance projects sourced directly from platforms like Upwork. Each project includes the original client brief, the input files, and the professional human deliverable, allowing evaluators to judge whether an AI’s output would realistically be accepted by a paying client.
-
Summary Video - Summary of the paper by ColdFusion.
-
-
The Curse of Recursion: Training on Generated Data Makes Models Forget - Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as model collapse1 and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet
-
My Summary: Training LLMs on LLM-generated content causes model collapse.
-