This wiki page (WIP) focuses on

LLMs: Summarizes AI LLM concepts (based on GPT-3; the core concepts are probably remarkably similar for the latest LLM models). Much of the material is my original.
Inference: Focuses on the mechanics (code, algorithms, math, etc) of GPT-3 inference (run-time prompt processing). ML engineers try everything to improve LLM TF (transformer) capabilities and performance. The end product is amazing.
Training: Only touches on basic (TF) training concepts (most of us will never train a model). The main constraint of TF (transformer) design is training convergence.

Note: AI terminology is a challenge. "Attention heads" (AHs) (that have nothing to do with attention), feed forward networks (FFNs) (that dont feed anything), token "vectors" (12288 FP numbers), "hyperspace", "embeddings", "hidden states", "hidden layers", etc etc. AI chat tools use the official lingo. I try to explain using simple terms wherever possible.

1 Real intelligence is something magical (that only higher order animals have).
2 Machine (AI) intelligence is an amazing technological revolution. The strength of AI is electronic signal/switching speeds and virtually unlimited power/size (compared to biological platforms).
3 LLMs contain a transformer (TF) neural network (NN) and an internal "agent". Both working together create the simulation of intelligence.
4 LLM TFs (GPT-3) generate output tokens based on input tokens. TFs contain
- 4.1 TF main loop,
- 4.2 AHs ("attention heads") that mix context info between tokens (modifying token "hidden layers" that started out as token "embeddings"), and
- 4.3 FFNs ("feed forward networks"; the name is historical, not descriptive) that modify the hidden layer within individual tokens.
5a Internal agents (iAgents) are deterministic (non-AI) program loops that interface between the human user and the TF.
5b External agents (typically Python scripts) are external to the LLM (written by app devs).

Also good to know:

6 AI technical foundations. Half a century of development has led to practical AI. This situation is similar to how PC-based video was not possible until the required computing HW for 3D graphics was developed.
7 TF versatility. TFs can be used for many types of applications (not just LLMs). TFs can be used for any function that (1) is too complex to model with deterministic equations and (2) can be "trained" (programmed) using massive amounts of training data.
8 TF training. Training is the most important, most complex, and most secretive aspect of LLM development. The number one requirement for TF design is trainability.
Limitations and hype.

1 Real intelligence

Real intelligence is the a mystery that is

Hosted exclusively (as far as we know) in brains of higher order animals. Such animals have feelings, senses (vision, hearing, touch, etc) and an awareness of one's self. AI devices do not.
Bound to time. Thoughts are like music in that they only exist with the passage of time.

Written languages are an amazing invention customized to the unique capabilities of humans.

They are a primitive but effective method of communicating complex thoughts, concepts, and ideas.
Each language has different structures for distilling / encoding meaning into written/audio form.

2 Machine (AI) intelligence

AI "intelligence" is based on clocked binary computation. A clock pulse causes circuits to compute outputs from inputs. After enough time has passed for the output signals to become input signals (propagation), the next clock pulse is generated. And so on.

AI has become so successful recently because it is the ultimate man-machine interface (MMI). Bridging the gap between time-based intelligence and machine computation.

3 LLM

An LLM is programmed ("trained") to output a text response to text input (and recently other media besides text). The response is a statistical approximation based on the text used to program the LLM NN (neural network). An LLM has 2 main components:

Transformer (TF) that generates output tokens based on input tokens (the NN).
Internal "agent", a deterministic (Python, etc) program loop that interfaces between the human user and the TF.

GPT: LLM overall workflow (generated by GPT and from BOOK 2)

4 LLM transformer (TF)

This section describes inference (token creation, not training) for the GPT-3 TF

4.1 Main loop,
4.2 AHs ("attention heads") that mix context info between tokens, and
4.3 FFNs ("feed forward networks") that modify the hidden layer of individual tokens.

See also "_ziptieai_BOOK2_LLM.docx" ch1 "LLM basics", "5.1.1 Inference workflow overview", and "17.1.1 AH (attention heads)" for more diagrams.

4.1 TF main loop

Each GPT-3 TF main loop creates a new token that is then appended to the next loop [ prompt + response ]. The following diagram shows the 3 main functions of the main loop:

(1) AH (attention heads). Sharing context info between tokens.
(2) FFN (feed forward network). Detecting and injecting meaning within a token.
(3) After doing (1) and (2) 96 times: Generating a new token (based on extensive statistical data (the "emergent structure") stored in the last token).

The following describes the 6 main steps of the TF main loop:

S1 Create initial embedding (once only for each new token generation)
S2 AH (for more details see 4.2)
S3 FFN (for more details see 4.3)
S4 Repeat AH/FFN (total of 96 subloops)
S5 S5 Select new token based on the last token final emergent structure (12288 numbers) (for more details see 4.4)
S6 Repeat the main loop (at S2) (with new token added to the prompt + current answer as the TF input)

S1 Create initial embedding

At the beginning of the loop, each token is converted to an embedding, which is the initial 12288 numbers per token based only on the token ("token" = minimal part of a word that represents meaning). For GPT-3 there are about 50K tokens in the entire vocabulary.

Note: These 12288 numbers can be FP16 or BF16 (on GPUs) sometimes FP32 internally (for accumulation). FP = floating point. GPT insists on calling these numbers "vectors", a term you will run into constantly. You will see videos about LLMs with arrows depicting vectors in something like "meaning hyperspace". They are therefore often referred to as token "dimensions" (dims). I usually refer to them simply as "numbers".

S2 AH (attention heads): Mix information between tokens

For AH details see 4.2.

Note the following:

An attention head refers to 128 vector sections of the 12288 vectors (96 AHs x 128 vectors = 12288). AH's were split up in GPT-3 because of computational limitations (this is an excellent example of how AI performance is limited by computational power).
During most of the computation of attention, mixing only occurs selectively between the AH's of tokens (black arrow in diagram below)
Mixing between token AH's (red arrows in diagram below) only occurs at the end of the AH layer.
One "layer" (loop) of AH can require up to 10^12 FLOPs of computation (for 2048 tokens, the max for GPT-3).
After being modified in the first layer AH, a token's 12288 numbers (vectors) are referred to as the token "hidden states", the layer is called a "hidden layer". This is because the values are now "hidden" from the external world (they are on the GPU). Something like that.

Information mixing between tokens during AH

S3 FFN: Detect token patterns / update token vectors

For FFN details see 4.3.

Note the following:

(10^12 FLOPs)

Detect token patterns / update token vectors (simple demo: 2 inputs, 1 output; detect XOR pattern)

S4 Repeat AH/FFN (total of 96 loops)

What I've never seen discussed anywhere, except in my discussion with GPT, is what happens to the token hidden layer during each layer of AH/FFN. There are over 12000 FP numbers in the HL for a reason: They are used to store a vast amount of analytical info about not only the token, but the entire storyline (from the viewpoint of the token). Things must be done this way because LLMs have no (0) intelligence. Its a brute force method of "AI". Each layer refines the emergent structure (EH) defined in the token HL.

For emergent structure details see 4.4.

Note:

A simple emergent structure (ES) develops in each token as the layers progress from 1 to 96.
The concepts (what I can the "storyline") in the ES become more high level with each layer.
The structure for ES develops during training independent of direct human design.
The final ES for the last token is used to generate the probability list for the next token (used for selection in S5).
(10^14 FLOPs)

Emergent structure for a token after final layer 96 (GPT depiction; TODO: create better diagram)"

S5 Select new token based on the last token final emergent structure (12288 numbers)

After AH/FFN layer 96 the emergent structure has reached completion (because it was trained on 96 layers; if the LLM had been trained on 97 layers, we'd need 97 layers during inference). Now its time to

(1) generate a probability for each vocabulary token that it should be the new token and
(2) select a token as the new token.

The following equation and diagram shows how the probabilities of each vocabulary tokens (50257 total) as the next token are computed. Basically

Wu maps the hidden state to one score (logit) per token
A bias is added to each score
Softmax converts the scores into probabilities

A token is chosen from the probabilities: either the top token (argmax) or a sampled token when temperature is applied.

S6 Repeat the main loop (at S2) (until stop occurs)

Note:

The new (computed) token is added to the original prompt + current (computed) response tokens to be become the new TF input.
End token generation if the TF detects a stop condition or agent commands to stop.
Otherwise repeat the main loop starting at S1 (with new token appended).
- Computation starts with a "clean slate" because the algorithm fails otherwise (this is the only way they could get the algorithm to work).

4.2 AHs

AH in the TF algorithm

The AH algorithm performs "context adjustment" ("attention" suggest intelligence, so its better marketing term). The algorithm is the result of extensive experimentation (researchers tried everything, and this is what worked best).

Floating-point matrix multiply-accumulate operations to generate 1 new token: MAX = 10^14 (with 2048 input tokens) (GPT-3).

See also __ziptieai_BOOK2_LLM_.docx section 17.1.1 for details (a bit outdated, but with details).

4.2.1 Main steps

AH adjusts a token's hidden layer values (12288 numbers) based on its relationship with the few tokens that have the biggest effect on the token's context. (T1 = token 1; GPT-3 has 2048 tokens).

Note that in GPT-3 the token hidden layer (12288 numbers) was divided into 96 attention "heads" (128 numbers each). The reason for this was probably because of HW computational limitations. The diagram below shows how the 96 AH heads are mixed at the end of a layer.

4.2.2 Details (diagrams)

The diagrams below (my own) summarize the AH algorithm. A great overview, but an incomplete explanation.

The following diagrams give the real detailed required for understanding.

These diagrams (26.0405) give the required detail. The AH algorithm is quite complex, but can be understood with a bit of effort. The reason I show them here is not to suggest you learn these details, but to show how AI is brute force probability computation (not intelligence). I plan to add text to these diagrams in the near future.

4.3 FFNs

The FFN detects meanings or patterns in the 12288 dimensions for each token. This section discusses

4.3.1 Neurons and interconnections
4.3.2 Detector example

4.3.1 Neurons and interconnections

4.3.1.1 Overview of neurons (single token)

Each of 12288 token hidden layer dimensions (floating point numbers x1...x12288) are input to each of 49152 h layer detector neurons (h1... h49152).
Each 49152 detector neuron outputs (h1... h49152) is input to each of 12288 y layer output neurons (y1...y12288) (for the single token).

4.3.1.2 h (hidden layer) neurons (detection)

The following pseudo equation and the diagram below define what happens in the h layer (the FFN detection layer).

The following pseudo-equations sum up the h layer computations

The following is a text description

For h1out
- multiply x1 * W[h1,x1]
- repeat for x2...x12288
- add all 12288 results
- add bias
- run GELU and output as result as h1out
Do the same for h2out ... h49152out

4.3.1.3 y (output layer) neurons

The detections are merged back into the 12288 dims in the y layer. The following pseudo equation and the diagram below define what happens in the y layer (adding back into the hidden layer).

The following is a text description

For y1out
- multiply h1 * W(y1,h1)
- repeat for h2...h49152
- add all 49152 results
- add bias
- output as final result (no GELU)
Do the same for y2out ... y12288out

4.3.2 Detector example

The FFN hidden layer h detects patterns in the input (x1...x12288). The FFN output layer y combines the detected patterns to produce new values for all token dimensions (y1...y12288).

The best way to really understand whats really happening is a simplified example. Most of this demo is my original.

Note that in this demo there are only

2 inputs (x1, x2) (FFN has 12288)
2 h layer detector neurons (FFN has 49152)
1 y output (FFN has 12288)

4.3.2.1 Boolean XOR

An XOR is makes a detection (outputs 1) when only 1 of 2 inputs is 1. This is a good example for FFN dectors which are often used to detect such mutually exclusive relationships between token hidden layers numbers (12288).

Boolean xor gate

4.3.2.2 Boolean XOR using "simple" gates (no mutual exclusion)

Mutual exclusion (XOR) is not possible with a single AI neuron. But AND and NOR are possible. So the XOR needs to be built with AND and XOR gates.

Boolean XOR using simple gates

4.3.2.3 AI neuron XOR using additive gates

The h and y neurons serve as our "additive" gates. That means

12K (h) or 49K (y) inputs
multiple all inputs by a weight and add to the internal sum
add a bias to the sum
(h only) do GELU
output

AI neuron XOR

The following describes in detail:

The following 2 diagrams show the result visually

In the h layer, h1 and h2 (green) are a function of (x1, x2).
The dark red stripe / dot show where the example FFN output (y1) is closest to 1.
This is roughly the equivalent of an XOR (the inputs and outputs, however, are not boolean)

Note that in a real FFN there can be up to

12288 inputs to an h neuron
49152 inputs to a y neuron

This creates a platform that can make a virtually infinite number of different detection combinations (the inputs are FP numbers).

h1, h2 as f ( x1, x2)

y1 as f (h1, h2)

4.4 Simple Emergent Structure development

during the 96 layers and 96 loops.

The 12288 scalars that define each token state “interact”.
Inside those 12288 scalars an “emergent structure” is created.

First (the 12288 means the 12288 number for each token are being (1) AH shared and (2) FFN detected within a token.

Middle

Late

Final

5a Internal agents

The following is from BOOK 2 section "18.2 Overall workflow". Generated by GPT.

5b External agents

The following is from #603_PAL_.docx on the gdrive.

Demo PAL v1. The simplest example of how a python program (the external agent) interacts with a model API to create "agentic AI".

The LLM has an "internal agent" (the code leaked recently by Anthropic was probably for the iAgent) that interfaces between the human and the LLM. Its design is confidential (for good reason: iAgent code would expose the truth about LLMs that we are told will have "super human intelligence" in just a year (Musk) or a few years (Anthropic)).

The real magic (see code below):

you dont have to program the LLM. You merely use human language to tell it what data structure to repond with.
you simply put the DB data in the prompt
the response is human langage style.

Magic.

response.

6 AI technical foundations

There is no better example of AI hype than at the beginning of this video when Marc Andreessen claims that the computer industry 80 years took a wrong turn when it decided to build binary computers rather than AI computers. That is one very bizarre claim (I find it hard to believe Andreessen really said that). So I asked GPT "Were “AI Computers” Possible 80 years ago? The answer:

"Modern AI systems require:

Massive matrix multiplications
Floating-point arithmetic
Huge memory bandwidth
Billions of parameters

Even a small GPT-class model would have been unimaginable in 1950 hardware terms. So saying the industry “chose binary instead of AI” is historically misleading. AI requires digital computation. It was not an alternative path." AI is booming now because the required foundational tech only recently became available. Pretty basic stuff.

The following docs summarize the basic history of binary and AI computing (my own perspective): "__ziptieai_BOOK2_LLM_.docx" ch2 "How we got here" and docx #600.

6.1 Binary computing based on integrated semiconductors

For the past 50 years, we sometimes did not even know why certain aspects of integrated semiconductors worked. Researchers simply tested, discovered, and then tried to figure out why it worked.
But it all stabilized gradually. Binary computing based on integrated semiconductor transistors has been making unimaginable gains ever since. When I worked in an IBM VLSI test lab in the early 1980s, state of the art was 2-3 microns (2000-3000 nm) and 25MHz with yields ~10%. Now state of the art (set by world leader Taiwan) is clock frequencies in GHz and 3-5 nm.

Intel 4004 from 1971

6.2 AI based on binary computing

The era of practical AI has only recently started thanks to recent gains in performance in the semiconductor transistor-based GPUs that it runs on. Experts are often not sure exactly how certain aspects of AI work, but they continue to test, discover, and then develop theories that explain it all.
But AI is developing at a much faster rate than semiconductors. Where this will lead is hard to say. But as I recently mentioned in a chat with ChatGPT, the stunning performance and usefulness of GPT (something I could not have imagined just a few years ago) shows that the massive investments in AI are not a bubble.

Below: GeForce RTX

7 TF versatility

(WIP) TFs can be used for many types of applications (not just for LLMs).

Docx #600

"_ziptieai_BOOK2_LLM.docx" ch 16 LLM analogies

"_ziptieai_BOOK2_LLM.docx" ch 15a Other UFA

8 TF training

(WIP) Training is the most important, most complex, and most secretive aspect of LLM development. It's also the aspect that most of us will never be involved in.

The number one requirement for TF design is trainability.
Training uses complex SW tools and techniques that program TF weights and biases based on massive amounts of training text. The training SW compares the input and output for massive amounts of text to compute the weights/bias values. The TF itself learns nothing.
Training is hit or miss, trial and error. Its a art. It often fails (requiring a restart from zero).
Researchers run tests to discover how AI functionality is actually implemented in the TF.

See "_ziptieai_BOOK2_LLM.docx" chapters 5.1.4 and 19 "LLM transformer training".

8.1 AH training

I dont understand much about AH training. Basically the loss function is a very simple math function (best for easy computing).

8.2 FFN training

(WIP) The loss function should be a 2d f(x) = (loss equation). But you often see these 3d visualizations. Not really sure what these are supposed to represent. Changing a single Wx will change the loss curve to some extent for all other Wx's. But this diagram shows apparently 2 inputs (x,y) and the output (z). Not sure.

Gradient decent (Youtube)

h layer W

Assume W = 0.5 is the best fit. Then as W goes to + infinity square(slope) gets bigger. As W goes to – infinity then square(slope) gets bigger (but not quite as fast). These are the gradients.

h layer bias

All 12288 gradients sum up linearly for an h neuron. Bias is added to the sum. If bias = -2, the output is shifted right by 2. This means that sum does not go positive until the output >2.

h layer GELU

xxx

y (output) layer W

y (output) layer b

xxx