Core AI concepts 26.0406 - terrytaylorbonn/auxdrone GitHub Wiki
26.0405 Lab notes (Gdrive) Git
LLMs are mostly black boxes. The recent Anthrophic leak revealed the reason for that.
This wiki page (WIP) summarizes my take on AI (based on GPT-3). Below: 3d visualizations of inference algorithms. Beautiful. But I am still not sure how they (like many of the accepted technical narratives) are supposed to help someone understand how LLM AI really works.

- 1a Human intelligence is Something magical (that AI does not have).
- 1b Machine (AI) intelligence is merely Binary computation. Nothing more. No real intelligence. 0. The strength of AI is electronic signal/switching speeds and virtually unlimited power/size (compared to biological platforms).
-
2 An LLM (GPT-3) is programmed (not "trained") to output a text response to text input (and recently other media). The response is a statistical approximation based on text input used to program ("train") the LLM NN ("neural" network). An LLM has 2 main components:
- Transformer (TF) that generates output tokens based on input tokens (the NN).
- Internal "agent". A deterministic (like Python) program loop that interfaces between the human user and the TF.
- 3a LLM TF main components. The FFN (that detects meanings) and the AH (that xxxx).
- 3b LLM TF main loop (GPT-3). AI is computational statistics running on clocked binary logic.
- 3c LLM internal Agent (iAgent) main loop. (TODO)
-
4 The last half century of development that has led to practical AI. Its only recently become possible to implement brute-force AI computing methods (this situation is similar to how PC-based video was not possible until required computing HW for 3D graphics was developed). The key developments were
- 4.1 Binary computing based on integrated semiconductors
- 4.2 AI based on binary computing
- 5 Why LLM transformers (TFs) can be used for many types of applications (not just for LLMs). TFs can be used for any function that (1) is too complex to model with deterministic equations and (2) can be "trained" (programmed) using massive amounts of training data.
-
6 Training (GPT-3). "training" uses complex SW tools and techniques to program TF weights and biases. The training SW compares the input and output of massive amounts of training text to tweak TF weight/bias values. This is perhaps the most important topic:
- TF design is driven by trainability.
- Training is hit or miss, trial and error. Its a art. It often fails, and must start from 0.
- The trainers do not see weight and bias values. They do not control the exact values.
- Researchers run tests to discover where model "intelligence" is actually implemented in the TF.
- Limitations and hype
A mystery that is hosted/incarnated in the human brain. The most fascinating thing in the universe. Something brilliant all-knowing AI gurus, those who will lead us to a more enlightened future, compare to crude brute-force clocked binary circuit algorithms.
Human communication is pure magic:
- Input = hearing or reading words. Humans can hear and see. AI just has no senses. None.
- Language (written, spoken) = a very primitive inexact method of communicating the vastly complex thoughts, concepts, and ideas of humans.
- Each language has different structures for distilling / encoding meaning into written/audio form (I speak 4 languages, I know this from personal experience).
- You read words, you formulate thoughts, and you distill your response into language-specific text/words. Your thoughts are the driving force. AI (LLM TFs) just compute the next token.
Intelligence is the ability to understand things, imagine things, see relationships. Its what makes us human. AI has none of these things. None. AI's only value is as a tool for intelligent humans. And in that role AI is revolutionary.
Human intel is bound to time. Its like music.. a snapshot in time means nothing.
Diagram: The very term "artificial (man-made) intelligence" is a marketing gimmick
AI "intel" is clocked binary computation. A clock pulse causes circuits to compute outputs from inputs. after enough time has passed for the output signals to reach the inputs of other circuits, the next clock pulse is generated. and so on. AI intel is simply a state machine. time itself means nothing. you can take a snapshot that reveals the entire state of the machine and its "intel".
AI as a man-machine interface. previous attempts were putting the machine in the human brain. that failed mostly (noteable were attempt to read magnetic fields to determine what the human was thinking... the brain is not electromagnetic, electrochemical, not mechanical). Current AI is a succesfull MMI. the interface? words, audio, video, imagery.
GPT: Why current AI succeeded. Not because we solved “thinking”. But because we used: language = already-encoded human thought.
An LLM is programmed (not "trained") to output a text response to text input (and recently other media). The response is a statistical approximation based on text input used to program ("train") the LLM NN ("neural" network). An LLM has 2 main components: - Transformer (TF) that generates output tokens based on input tokens (the NN). - Internal "agent". A deterministic (like Python) program loop that interfaces between the human user and the TF.
GPT-3 LLM consists of
- transformer (TF) (the token sequence generator NN)
- an "internal agent", the code that interfaces between the human user and the TF.
This section describes the 2 main components of the TF
- FFN (red box below)
- AH (orange box)
The diagram below shows (GPT-3) the FFN/AH for1 of 2048 tokens and 1 of 96 layers.
DIAGRAM 3a
The FFN detects meanings or patterns in the 12288 dimensions for each token. This section discusses
- 1 Neurons and interconnections
- 2 Detectors
-
- Each of 12288 token hidden layer dimensions (floating point numbers x1...x12288) are input to each of 49152 h layer detector neurons (h1... h49152).
-
- Each 49152 detector neuron outputs (h1... h49152) is input to each of 12288 y layer output neurons (y1...y12288).
DIAGRAM
The h neurons perform detection (detection is explained later).
For each h layer neuron
- multiply x1 * W(x1) (weight)
- repeat for x2...x12288
- add all 12288 results
- add bias
- run GELU and output as result
The detections are merged back into the 12288 dims in the y layer.
For each y layer neuron
- multiply h1 * W(h1) (weight)
- repeat for h2...h49152
- add all 49152 results
- add bias
- run GELU (??) and output as final result
A hidden layer turns a problem that cannot be solved by adding inputs into one that can — by first building detectors.
This section describes detector with an example. Many of the concepts and diagrams in this section are my original.
An XOR is used to detect when only 1 of 2 inputs is 1. This is a good example for FFN dectors which are often used to detect such an relationship between 2 dims for a token.
diagram 2.1: boolean xor gate
The following table shows example values
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
The reason for doing it this way is that to implement an XOR in the FFN we dont have the AI gate equivalents of an XOR. We need to use "simple" gates (I am not sure what else to call them).
diagram 2.2: boolean xor using simple gates
The following table shows example values
x1 x2 h1 h2 y
0 0 0 1 0
0 1 0 0 1
1 0 0 0 1
1 1 1 0 0
The h and y neurons will serve as our "additive" gates. That means that you have 12K or 49K inputs, each of which is multiplied by a weight and added to the internal sum. Add the bias ot the sum, then do GELU and output.
diagram 2.3a: neuron xor
We want an XOR FFN using neurons that
- add the weighted inputs
- add bias to the sum
- (h only) output non-linear (detection only) outputs (0 or 1, not -1)
h1 (Hidden layer neuron):
input x1 with W(h1.x1) = 1
input x2 with W(h1.x2) = 1
bias = -1
GELU
h1 = x1 AND x2
h2:
input x1 with W(h2.x1) = -1
input x2 with W(h2.x2) = -1
bias = +1
GELU
h2 = x1 NOR x2
y (output layer neuron):
input h1 with W(y.h1) = -1
input h2 with W(y.h2) = -1
bias = +0.5
GELU (?)
y = h1 NOR h2
result:
y = x1 XOR x2
diagram 2.3b: h1, h2 as f ( x1, x2)
diagram 2.3c: y1 as f (h1, h2)
What the diagrams above show is that our additive gates will perform the (rough) equivalent to an XOR gate. There is much more explanation to write here (TODO).
AH is based on Wx + b computations with "activation" function Softmax (not GELU).
This drawing summarizes all the computations in __ziptieai_BOOK2_LLM_v86_251230 section 17.1.1 "AH (attention heads)". Quite complex.
- Below: The upper example shows 2 matrices multiplied. A 1x128 W matrix on left and 128x1 hidden layer on right produces 1x1 scaler (like FFN transistors: many inputs, one output).
- Below: The lower half shows a W maxtrix 128 x 12288 dot 12288 x 1 (all hidden layer vectors). Very typical for AH, many inputs, many outputs (AH is more modification and sharing of info, not FFN style detection). This is the equivalent of 128 transistors.
- QK determines how much one token (128 dims) influences another token (the 128 dim V matrix is added to target token).
- Softmax (SM) zeroes out the influence of all but 1-3 strongly related tokens (similar to how GELU zeroes out the influence of negative sums).
This section describes the AH, FFN h layer, and FFN y output layer (inference only; for training see section 6).
See also "_ziptieai_BOOK2_LLM.docx" for
- Ch1 "LLM basics"
- Sections "5.1.1 Inference workflow overview" and "17.1.1 AH (attention heads)" (for more diagrams)
TOC
- 2.0 Overview diagram
- STEP 2.1 initial embedding
- STEP 2.2 AH
- STEP 2.3 FFN (overview)
- STEP 2.4 95 more layers / Simple Emergent Structure develops
- STEP 2.5 (after layer 96) predict next token from final state
- STEP 2.6 then start whole process all over at STEP 1 (with new token appended)
Following shows how the input to the TF starts out as a prompt, and then after each loop the new token (computed by the TF) is added to the input.
All steps.
Steps diagram.
For simplicity, I use example diagrams with 3 tokens as input.
Input 3 tokens (3 sets of about 1-6 ASCII numbers). These are converted to 12288 FP numbers per token (embedding).

Each token has 12288 scalars that define everything about the token.
The initial embedding is then input to the 96 layer loop. When input, it is converted into whats called hidden state (still 12288 numbers), in the "hidden layer".
STEP 2.2 AH NEW (insider hidden layers)
The algorithm is quite complex.
AHs share info between related tokens.
The handwritten diagram below shows what is happening in the second diagram below.
"_ziptieai_BOOK2_LLM.docx" section 17.1.1 shows the notation I created that makes the details clear (GPT confirmed that my notation and text are correct).
In reality in GPT-3 there (I believe about 500 million AH connections for 2048 tokens (max size).
NOTE: The FFN (to me) is the main part of the UFA. "3.2.2 detectors" of this wiki page shows a simple 2 x, 4 h, and 1 y FFN.
A basic FFN example is discussed in "3 Very basic UFA example" of docx #600.
In reality, in GPT-3 there are
- x1… x12288 inputs to each h layer.
- 49152 (12288x4) total h layers (h1… h49152).
- y outputs y1..y12288.
A super massive computational surface.
- Repeat the above loop 95 times.
- The 12288 scalars that define each token state “interact” during these 96 layers.
- Inside those 12288 scalars an “emergent structure” is created.
See “#514 Simple emergent structure (as layers progress)”.
During these 96 loops, (LLM Attention Across Layers)
First (the 12288 means the 12288 number for each token are being (1) AH shared and (2) FFN detected within a token.
Middle
Late
Final
https://youtu.be/wjZofJX0v4M?t=1317
Until the TF or the agent stops the generation of tokens.
There is no better example of AI hype than at the beginning of this video when Marc Andreessen claims that the computer industry 80 years took a wrong turn when it decided to build binary computers rather than AI computers. That is one very bizarre claim (I find it hard to believe Andreessen really said that). So I asked GPT "Were “AI Computers” Possible 80 years ago? The answer:
"Modern AI systems require:
- Massive matrix multiplications
- Floating-point arithmetic
- Huge memory bandwidth
- Billions of parameters
Even a small GPT-class model would have been unimaginable in 1950 hardware terms. So saying the industry “chose binary instead of AI” is historically misleading. AI requires digital computation. It was not an alternative path." "Historically misleading". Typical GPT politicized answer. AI is booming now because the required foundational tech only recently became available. Pretty basic stuff.
The following summarizes the basic history of binary and AI computing (my own perspective). See also
- "__ziptieai_BOOK2_LLM_.docx" ch2 "How we got here"
- docx #600
4.1a For the past 50 years, we sometimes did not even know why certain aspects of integrated semiconductors worked. Researchers simply tested, discovered, and then tried to figure out why it worked.
4.1b But it all stabilized gradually. Binary computing based on integrated semiconductor transistors has been making unimaginable gains ever since. When I worked in an IBM VLSI test lab in the early 1980s, state of the art was 2-3 microns (2000-3000 nm) and 25MHz with yields ~10%. Now state of the art (set by world leader Taiwan) is clock frequencies in GHz and 3-5 nm.
Below: Intel 4004 from 1971
4.2a The era of practical AI has only recently started thanks to recent gains in performance in the semiconductor transistor-based GPUs that it runs on. Experts are often not sure exactly how certain aspects of AI work, but they continue to test, discover, and then develop theories that explain it all.
4.2b But AI is developing at a much faster rate than semiconductors. Where this will lead is hard to say. But as I recently mentioned in a chat with ChatGPT, the stunning performance and usefulness of GPT (something I could not have imagined just a few years ago) shows that the massive investments in AI are not a bubble.
Below: GeForce RTX (left), UFA 3d map (center; from Welch labs), and resulting logic map (right). The map is of a UFA that has 2 inputs (lat/longitude), and the output is binary (in Belgium or Netherlands).
The analogous similarites of AI GPU algorithms for different types of apps (basic idea: for example, CNN and TF similarities). For details see the following.
"_ziptieai_BOOK2_LLM.docx" ch 16 LLM analogies
"_ziptieai_BOOK2_LLM.docx" ch 15a Other UFA
(my own take). "training" uses training SW to program the TF (set the weights and biases). The training SW compares the input and output for massive amounts of text to compute the weights/bias values. The TF learns nothing.
I dont quite understand the math.. but basically the loss function is a very simple math function (best for easy computing).
The loss function should be a 2d f(x) = (loss equation). But you often see these 3d visualizations. Not really such what these are supposed to be. Changing a single Wx will change the loss curve to some extent for all other Wx's. But this diagram shows apparently 2 inputs and output on the z axis. I still have not figure out the meaning.
https://youtu.be/qx7hirqgfuU?t=900
Assume W = 0.5 is the best fit. Then as W goes to + infinity the square(slope) gets bigger.. as W goes to – infinity then the square(slope) gets bigger (but not quite as fast) these are the gradients.
All the 12288 gradients sum up linearly for an h neuron. Bias is added to the sum. if bias = -2, the output I shifted right by 2. this means that sum does not go positive until the output >2.
see "_ziptieai_BOOK2_LLM.docx" chapters
- 5.1.4
- 19 LLM transformer training