This page is a work in progress (WIP).

LLMs are mostly black boxes. The recent Anthrophic leak revealed the reason for that.

This wiki page (WIP) summarizes my take on AI (based on GPT-3). Below: 3d visualizations of inference algorithms. Beautiful. But I am still not sure how they (like many of the accepted technical narratives) are supposed to help someone understand how LLM AI really works.

1a Human intelligence is Something magical (that AI does not have).
1b Machine (AI) intelligence is merely Binary computation. Nothing more. No real intelligence. 0. The strength of AI is electronic signal/switching speeds and virtually unlimited power/size (compared to biological platforms).
2 An LLM (GPT-3) is programmed (not "trained") to output a text response to text input (and recently other media). The response is a statistical approximation based on text input used to program ("train") the LLM NN ("neural" network). An LLM has 2 main components:
- Transformer (TF) that generates output tokens based on input tokens (the NN).
- Internal "agent". A deterministic (like Python) program loop that interfaces between the human user and the TF.
3a LLM TF main components. The FFN (that detects meanings) and the AH (that xxxx).
3b LLM TF main loop (GPT-3). AI is computational statistics running on clocked binary logic.
3c LLM internal Agent (iAgent) main loop. (TODO)

Also good to understand:

4 The last half century of development that has led to practical AI. Its only recently become possible to implement brute-force AI computing methods (this situation is similar to how PC-based video was not possible until required computing HW for 3D graphics was developed). The key developments were
- 4.1 Binary computing based on integrated semiconductors
- 4.2 AI based on binary computing
5 Why LLM transformers (TFs) can be used for many types of applications (not just for LLMs). TFs can be used for any function that (1) is too complex to model with deterministic equations and (2) can be "trained" (programmed) using massive amounts of training data.
6 Training (GPT-3). "training" uses complex SW tools and techniques to program TF weights and biases. The training SW compares the input and output of massive amounts of training text to tweak TF weight/bias values. This is perhaps the most important topic:
- TF design is driven by trainability.
- Training is hit or miss, trial and error. Its a art. It often fails, and must start from 0.
- The trainers do not see weight and bias values. They do not control the exact values.
- Researchers run tests to discover where model "intelligence" is actually implemented in the TF.
Limitations and hype

1a What is human intelligence?

A mystery that is hosted/incarnated in the human brain. The most fascinating thing in the universe. Something brilliant all-knowing AI gurus, those who will lead us to a more enlightened future, compare to crude brute-force clocked binary circuit algorithms.

Human communication is pure magic:

Input = hearing or reading words. Humans can hear and see. AI just has no senses. None.
Language (written, spoken) = a very primitive inexact method of communicating the vastly complex thoughts, concepts, and ideas of humans.
Each language has different structures for distilling / encoding meaning into written/audio form (I speak 4 languages, I know this from personal experience).
You read words, you formulate thoughts, and you distill your response into language-specific text/words. Your thoughts are the driving force. AI (LLM TFs) just compute the next token.

Intelligence is the ability to understand things, imagine things, see relationships. Its what makes us human. AI has none of these things. None. AI's only value is as a tool for intelligent humans. And in that role AI is revolutionary.

Human intel is bound to time. Its like music.. a snapshot in time means nothing.

1b What is machine (AI) intelligence?

Diagram: The very term "artificial (man-made) intelligence" is a marketing gimmick

AI "intel" is clocked binary computation. A clock pulse causes circuits to compute outputs from inputs. after enough time has passed for the output signals to reach the inputs of other circuits, the next clock pulse is generated. and so on. AI intel is simply a state machine. time itself means nothing. you can take a snapshot that reveals the entire state of the machine and its "intel".

AI as a man-machine interface. previous attempts were putting the machine in the human brain. that failed mostly (noteable were attempt to read magnetic fields to determine what the human was thinking... the brain is not electromagnetic, electrochemical, not mechanical). Current AI is a succesfull MMI. the interface? words, audio, video, imagery.

GPT: Why current AI succeeded. Not because we solved “thinking”. But because we used: language = already-encoded human thought.

2 What is an LLM (GPT-3)

An LLM is programmed (not "trained") to output a text response to text input (and recently other media). The response is a statistical approximation based on text input used to program ("train") the LLM NN ("neural" network). An LLM has 2 main components: - Transformer (TF) that generates output tokens based on input tokens (the NN). - Internal "agent". A deterministic (like Python) program loop that interfaces between the human user and the TF.

3a The 2 main transformer (TF) components (FFN / AH)

GPT-3 LLM consists of

transformer (TF) (the token sequence generator NN)
an "internal agent", the code that interfaces between the human user and the TF.

This section describes the 2 main components of the TF

FFN (red box below)
AH (orange box)

The diagram below shows (GPT-3) the FFN/AH for1 of 2048 tokens and 1 of 96 layers.

DIAGRAM 3a

3a.1 FFN

The FFN detects meanings or patterns in the 12288 dimensions for each token. This section discusses

1 Neurons and interconnections
2 Detectors

1 Neurons and interconnections

1.1 overview for a single token

1. Each of 12288 token hidden layer dimensions (floating point numbers x1...x12288) are input to each of 49152 h layer detector neurons (h1... h49152).
1. Each 49152 detector neuron outputs (h1... h49152) is input to each of 12288 y layer output neurons (y1...y12288).

DIAGRAM

1.2 h neurons

The h neurons perform detection (detection is explained later).

For each h layer neuron

multiply x1 * W(x1) (weight)
repeat for x2...x12288
add all 12288 results
add bias
run GELU and output as result

1.3 y neurons

The detections are merged back into the 12288 dims in the y layer.

For each y layer neuron

multiply h1 * W(h1) (weight)
repeat for h2...h49152
add all 49152 results
add bias
run GELU (??) and output as final result

2 Detector (example)

A hidden layer turns a problem that cannot be solved by adding inputs into one that can — by first building detectors.

This section describes detector with an example. Many of the concepts and diagrams in this section are my original.

2.1 boolean XOR

An XOR is used to detect when only 1 of 2 inputs is 1. This is a good example for FFN dectors which are often used to detect such an relationship between 2 dims for a token.

diagram 2.1: boolean xor gate

The following table shows example values
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0

2.2 boolean XOR using "simple" gates

The reason for doing it this way is that to implement an XOR in the FFN we dont have the AI gate equivalents of an XOR. We need to use "simple" gates (I am not sure what else to call them).

diagram 2.2: boolean xor using simple gates

The following table shows example values
x1 x2 h1 h2 y
0 0 0 1 0
0 1 0 0 1
1 0 0 0 1
1 1 1 0 0

2.3 ai neuron xor using additive gates

The h and y neurons will serve as our "additive" gates. That means that you have 12K or 49K inputs, each of which is multiplied by a weight and added to the internal sum. Add the bias ot the sum, then do GELU and output.

diagram 2.3a: neuron xor

We want an XOR FFN using neurons that

add the weighted inputs
add bias to the sum
(h only) output non-linear (detection only) outputs (0 or 1, not -1)

h1 (Hidden layer neuron):
input x1 with W(h1.x1) = 1
input x2 with W(h1.x2) = 1
bias = -1
GELU
h1 = x1 AND x2

h2:
input x1 with W(h2.x1) = -1
input x2 with W(h2.x2) = -1
bias = +1
GELU
h2 = x1 NOR x2

y (output layer neuron):
input h1 with W(y.h1) = -1
input h2 with W(y.h2) = -1
bias = +0.5
GELU (?)
y = h1 NOR h2

result:
y = x1 XOR x2
diagram 2.3b: h1, h2 as f ( x1, x2)

diagram 2.3c: y1 as f (h1, h2)

What the diagrams above show is that our additive gates will perform the (rough) equivalent to an XOR gate. There is much more explanation to write here (TODO).

3a.2 AH

AH is based on Wx + b computations with "activation" function Softmax (not GELU).

This drawing summarizes all the computations in __ziptieai_BOOK2_LLM_v86_251230 section 17.1.1 "AH (attention heads)". Quite complex.

AH logic gates

Below: The upper example shows 2 matrices multiplied. A 1x128 W matrix on left and 128x1 hidden layer on right produces 1x1 scaler (like FFN transistors: many inputs, one output).
Below: The lower half shows a W maxtrix 128 x 12288 dot 12288 x 1 (all hidden layer vectors). Very typical for AH, many inputs, many outputs (AH is more modification and sharing of info, not FFN style detection). This is the equivalent of 128 transistors.

AH detectors (QK + Softmax)

Softmax ref: https://towardsdatascience.com/understanding-sigmoid-logistic-softmax-functions-and-cross-entropy-loss-log-loss-dbbbe0a17efb/

QK determines how much one token (128 dims) influences another token (the 128 dim V matrix is added to target token).
Softmax (SM) zeroes out the influence of all but 1-3 strongly related tokens (similar to how GELU zeroes out the influence of negative sums).

3b The TF main loop (GPT-3)

This section describes the AH, FFN h layer, and FFN y output layer (inference only; for training see section 6).

2.0 Overview diagram

Following shows how the input to the TF starts out as a prompt, and then after each loop the new token (computed by the TF) is added to the input.

All steps.

Steps diagram.

For simplicity, I use example diagrams with 3 tokens as input.

STEP 2.1 initial embedding

Input 3 tokens (3 sets of about 1-6 ASCII numbers). These are converted to 12288 FP numbers per token (embedding).

Each token has 12288 scalars that define everything about the token.

The initial embedding is then input to the 96 layer loop. When input, it is converted into whats called hidden state (still 12288 numbers), in the "hidden layer".

STEP 2.2 AH NEW (insider hidden layers)

The algorithm is quite complex.

AHs share info between related tokens.

The handwritten diagram below shows what is happening in the second diagram below.

"_ziptieai_BOOK2_LLM.docx" section 17.1.1 shows the notation I created that makes the details clear (GPT confirmed that my notation and text are correct).

In reality in GPT-3 there (I believe about 500 million AH connections for 2048 tokens (max size).

STEP 2.3 FFN (overview) NEW

NOTE: The FFN (to me) is the main part of the UFA. "3.2.2 detectors" of this wiki page shows a simple 2 x, 4 h, and 1 y FFN.

A basic FFN example is discussed in "3 Very basic UFA example" of docx #600.

In reality, in GPT-3 there are

x1… x12288 inputs to each h layer.
49152 (12288x4) total h layers (h1… h49152).
y outputs y1..y12288.

A super massive computational surface.

STEP 2.4 95 more layers / Simple Emergent Structure develops

Repeat the above loop 95 times.
The 12288 scalars that define each token state “interact” during these 96 layers.
Inside those 12288 scalars an “emergent structure” is created.

this is the core “intelligence” of the LLM (yep, that’s it, a bunch of numbers).

See “#514 Simple emergent structure (as layers progress)”.

During these 96 loops, (LLM Attention Across Layers)

First (the 12288 means the 12288 number for each token are being (1) AH shared and (2) FFN detected within a token.

Middle

Late

Final

STEP 2.5 (after layer 96) predict next token from final state

https://youtu.be/wjZofJX0v4M?t=1317

STEP 2.6 Then start whole process all over at STEP 2.1 (with new token appended)

Until the TF or the agent stops the generation of tokens.

3c How the iAgent really works.

4 The last half century of development that has led to practical AI

There is no better example of AI hype than at the beginning of this video when Marc Andreessen claims that the computer industry 80 years took a wrong turn when it decided to build binary computers rather than AI computers. That is one very bizarre claim (I find it hard to believe Andreessen really said that). So I asked GPT "Were “AI Computers” Possible 80 years ago? The answer:

"Modern AI systems require:

Massive matrix multiplications
Floating-point arithmetic
Huge memory bandwidth
Billions of parameters

Even a small GPT-class model would have been unimaginable in 1950 hardware terms. So saying the industry “chose binary instead of AI” is historically misleading. AI requires digital computation. It was not an alternative path." "Historically misleading". Typical GPT politicized answer. AI is booming now because the required foundational tech only recently became available. Pretty basic stuff.

The following summarizes the basic history of binary and AI computing (my own perspective). See also

"__ziptieai_BOOK2_LLM_.docx" ch2 "How we got here"
docx #600

4.1 Binary computing based on integrated semiconductors

4.1a For the past 50 years, we sometimes did not even know why certain aspects of integrated semiconductors worked. Researchers simply tested, discovered, and then tried to figure out why it worked.

4.1b But it all stabilized gradually. Binary computing based on integrated semiconductor transistors has been making unimaginable gains ever since. When I worked in an IBM VLSI test lab in the early 1980s, state of the art was 2-3 microns (2000-3000 nm) and 25MHz with yields ~10%. Now state of the art (set by world leader Taiwan) is clock frequencies in GHz and 3-5 nm.

Below: Intel 4004 from 1971

4.2 AI based on binary computing

4.2a The era of practical AI has only recently started thanks to recent gains in performance in the semiconductor transistor-based GPUs that it runs on. Experts are often not sure exactly how certain aspects of AI work, but they continue to test, discover, and then develop theories that explain it all.

4.2b But AI is developing at a much faster rate than semiconductors. Where this will lead is hard to say. But as I recently mentioned in a chat with ChatGPT, the stunning performance and usefulness of GPT (something I could not have imagined just a few years ago) shows that the massive investments in AI are not a bubble.

Below: GeForce RTX (left), UFA 3d map (center; from Welch labs), and resulting logic map (right). The map is of a UFA that has 2 inputs (lat/longitude), and the output is binary (in Belgium or Netherlands).

5 Why TFs can be used for many types of applications (not just for LLMs)

The analogous similarites of AI GPU algorithms for different types of apps (basic idea: for example, CNN and TF similarities). For details see the following.

Docx #600

"_ziptieai_BOOK2_LLM.docx" ch 16 LLM analogies

"_ziptieai_BOOK2_LLM.docx" ch 15a Other UFA

6 Training (GPT-3) details

(my own take). "training" uses training SW to program the TF (set the weights and biases). The training SW compares the input and output for massive amounts of text to compute the weights/bias values. The TF learns nothing.

AH softmax

https://www.geeksforgeeks.org/machine-learning/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss/

I dont quite understand the math.. but basically the loss function is a very simple math function (best for easy computing).

FFN training

About the 3d loss "hills and valleys" diagrams

The loss function should be a 2d f(x) = (loss equation). But you often see these 3d visualizations. Not really such what these are supposed to be. Changing a single Wx will change the loss curve to some extent for all other Wx's. But this diagram shows apparently 2 inputs and output on the z axis. I still have not figure out the meaning.

https://youtu.be/qx7hirqgfuU?t=900

h layer W

Assume W = 0.5 is the best fit. Then as W goes to + infinity the square(slope) gets bigger.. as W goes to – infinity then the square(slope) gets bigger (but not quite as fast) these are the gradients.

h layer bias ??

All the 12288 gradients sum up linearly for an h neuron. Bias is added to the sum. if bias = -2, the output I shifted right by 2. this means that sum does not go positive until the output >2.

h layer GELU ??

y (output) layer W

y (output) layer b (??)

For details

see "_ziptieai_BOOK2_LLM.docx" chapters

5.1.4
19 LLM transformer training