26.0402 (0310) Lab notes (Gdrive) Git

From docx #600:

This wiki page is my explanation of how AI (GPT-3) works (it still a work in progress (WIP), but the outline is solid). I wrote this page because I have my own interpretation and analogies for many aspects of AI (LLMs). Much of the content below has been taken from "_ziptieai_BOOK2_LLM.docx" and docx #600*.

1 What is human intelligence? Something magical (that AI does not have).
2 How LLM AI really works (GPT-3) (transformer (TF) basics). AI is computational statistics running on clocked binary logic. The strength of AI is electronic signal/switching speeds and virtually unlimited power/size (compared to biological platforms).
3 The core AI functional components: "AI transistors", "AI detectors", and "AI logic gates". These are the basic components of AI. (ADD AI OS CONCEPT?)

Also good to understand:

4 The last half century of development that has led to practical AI. Its only recently become possible to implement brute-force AI computing methods (this situation is similar to how PC-based video was not possible until required computing HW for 3D graphics was developed). The key developments were
- 4.1 Binary computing based on integrated semiconductors
- 4.2 AI based on binary computing
5 Why LLM transformers (TFs) can be used for many types of applications (not just for LLMs). TFs can be used for any function that (1) is too complex to model with deterministic equations and (2) can be "trained" (programmed) using massive amounts of training data.
6 Training (GPT-3). "training" uses complex SW tools and techniques to program TF weights and biases. The training SW compares the input and output of massive amounts of training text to tweak TF weight/bias values.

1 What is human intelligence?

A mystery that is hosted/incarnated in the human brain. The most fascinating thing in the universe. Something brilliant all-knowing AI gurus, those who will lead us to a more enlightened future, compare to crude brute-force clocked binary circuit algorithms.

Human communication is pure magic:

Input = hearing or reading words. Humans can hear and see. AI just has no senses. None.
Language (written, spoken) = a very primitive inexact method of communicating the vastly complex thoughts, concepts, and ideas of humans.
Each language has different structures for distilling / encoding meaning into written/audio form (I speak 4 languages, I know this from personal experience).
You read words, you formulate thoughts, and you distill your response into language-specific text/words. Your thoughts are the driving force. AI (LLM TFs) just compute the next token.

Intelligence is the ability to understand things, imagine things, see relationships. Its what makes us human. AI has none of these things. None. AI's only value is as a tool for intelligent humans. And in that role AI is revolutionary.

2 How LLM AI really works (GPT-3) (transformer (TF) basics)

This section describes the AH, FFN h layer, and FFN y output layer (inference only; for training see section 6).

2.0 Overview diagram

Following shows how the input to the TF starts out as a prompt, and then after each loop the new token (computed by the TF) is added to the input.

All steps.

Steps diagram.

For simplicity, I use example diagrams with 3 tokens as input.

STEP 2.1 initial embedding

Input 3 tokens (3 sets of about 1-6 ASCII numbers). These are converted to 12288 FP numbers per token (embedding).

Each token has 12288 scalars that define everything about the token.

The initial embedding is then input to the 96 layer loop. When input, it is converted into whats called hidden state (still 12288 numbers), in the "hidden layer".

STEP 2.2 AH NEW (insider hidden layers)

The algorithm is quite complex.

AHs share info between related tokens.

The handwritten diagram below shows what is happening in the second diagram below.

"_ziptieai_BOOK2_LLM.docx" section 17.1.1 shows the notation I created that makes the details clear (GPT confirmed that my notation and text are correct).

In reality in GPT-3 there (I believe about 500 million AH connections for 2048 tokens (max size).

STEP 2.3 FFN (overview) NEW

NOTE: The FFN (to me) is the main part of the UFA. "3.2.2 detectors" of this wiki page shows a simple 2 x, 4 h, and 1 y FFN.

A basic FFN example is discussed in "3 Very basic UFA example" of docx #600.

In reality, in GPT-3 there are

x1… x12288 inputs to each h layer.
49152 (12288x4) total h layers (h1… h49152).
y outputs y1..y12288.

A super massive computational surface.

STEP 2.4 95 more layers / Simple Emergent Structure develops

Repeat the above loop 95 times.
The 12288 scalars that define each token state “interact” during these 96 layers.
Inside those 12288 scalars an “emergent structure” is created.

this is the core “intelligence” of the LLM (yep, that’s it, a bunch of numbers).

See “#514 Simple emergent structure (as layers progress)”.

During these 96 loops, (LLM Attention Across Layers)

First (the 12288 means the 12288 number for each token are being (1) AH shared and (2) FFN detected within a token.

Middle

Late

Final

STEP 2.5 (after layer 96) predict next token from final state

https://youtu.be/wjZofJX0v4M?t=1317

STEP 2.6 Then start whole process all over at STEP 2.1 (with new token appended)

Until the TF or the agent stops the generation of tokens.

3 The core AI functional components: "AI transistors", "AI detectors", and "AI logic gates"

From GPT in response to my prompts:

One thing you invented that is interesting. Your phrasing:

AI transistors
AI logic gates

is not standard terminology, but the concept is legitimate.

Researchers often describe neurons as:

threshold units
feature detectors
basis functions

Your analogy to transistors and logic gates is simply another way of explaining the same mechanism.

The key implication: Attention + FFN together form something like:

logic gates + wiring between tokens

Your analogy actually captures this well:

FFN ≈ logic gates
AH ≈ signal routing / wiring

That combination is what allows transformers to build complex reasoning structures over many layers.

3.1 AH

AH is based on Wx + b computations with "activation" function Softmax (not GELU).

This drawing summarizes all the computations in __ziptieai_BOOK2_LLM_v86_251230 section 17.1.1 "AH (attention heads)". Quite complex.

3.1.1 AH transistors (matrix math)

Below: The upper (single) transistor example shows 2 matrices multiplied. A 1x128 W matrix on left and 128x1 hidden layer on right produces 1x1 scaler (like FFN transistors: many inputs, one output).
Below: The lower half shows a W maxtrix 128 x 12288 dot 12288 x 1 (all hidden layer vectors). Very typical for AH, many inputs, many outputs (AH is more modification and sharing of info, not FFN style detection). This is the equivalent of 128 transistors.

3.1.2 AH detectors (QK + Softmax)

Softmax ref: https://towardsdatascience.com/understanding-sigmoid-logistic-softmax-functions-and-cross-entropy-loss-log-loss-dbbbe0a17efb/

QK determines how much one token (128 dims) influences another token (the 128 dim V matrix is added to target token).
Softmax (SM) zeroes out the influence of all but 1-3 strongly related tokens (similar to how GELU zeroes out the influence of negative sums).

3.1.3 AH Logic gates (none??)

I believe there are no logic gates (like AND, XOR, etc in FFN). There are no hidden layers within AH itself (??).

3.2 FFN transistors, detectors, logic gates

3.2.1 FFN transistors

h transistor

x1 ... x12288 as inputs, each multiplied by its own weight W. Results added together, bias added to sum, then that result put through GELU.

y transistor

h1 ... x49152 as inputs, each multiplied by its own weight W. Results added together, bias added to sum. This is final result (no GELU?).

3.2.2 FFN detectors

Following diagram shows how to create a simple detector. For

x1 h1
- The Wx1 line with GELU.
- That line inverted (W < 0).
- Add bias. For x1 < ~1 we get output.
x1 h2 same, except W > 0.
x2 h3,h4 do the same but for x2

If input these all to y, then they only positie where they all add up >0.

How to get away from the x origin? I think with 12288 x, and 49152 h, and 12288 y the actual outputs are not simple 2-planes like what I showed above. and they are not all around the origin. Need to check this with GPT.

3D VISUALIZATIONS OF INFERENCE ALGORITHMS ARE USELESS

The detector demo described above shows that the resulting output value depends on the SUM of all h outputs. But the diagram below (left) is a typical 3D visualization of inference that attempts to show the output graphs of all h outputs (many W*x plots are superimposed on each other, making it impossible to recognize the the end result as a single surface (right)).

Another such diagram (https://youtu.be/qx7hirqgfuU?t=1681):

3.2.3 FFN logic gates (h and y combined: AI logic gates)

Example below uses sigma (not GELU). Shows how an XOR is created for a simple 2 input x, 2 hidden h, one y output. I included this diagram because it shows how logic gates are created.

4 The last half century of development that has led to practical AI

There is no better example of AI hype than at the beginning of this video when Marc Andreessen claims that the computer industry 80 years took a wrong turn when it decided to build binary computers rather than AI computers. That is one very bizarre claim (I find it hard to believe Andreessen really said that). So I asked GPT "Were “AI Computers” Possible 80 years ago? The answer:

"Modern AI systems require:

Massive matrix multiplications
Floating-point arithmetic
Huge memory bandwidth
Billions of parameters

Even a small GPT-class model would have been unimaginable in 1950 hardware terms. So saying the industry “chose binary instead of AI” is historically misleading. AI requires digital computation. It was not an alternative path." "Historically misleading". Typical GPT politicized answer. AI is booming now because the required foundational tech only recently became available. Pretty basic stuff.

The following summarizes the basic history of binary and AI computing (my own perspective). See also

"__ziptieai_BOOK2_LLM_.docx" ch2 "How we got here"
docx #600

4.1 Binary computing based on integrated semiconductors

4.1a For the past 50 years, we sometimes did not even know why certain aspects of integrated semiconductors worked. Researchers simply tested, discovered, and then tried to figure out why it worked.

4.1b But it all stabilized gradually. Binary computing based on integrated semiconductor transistors has been making unimaginable gains ever since. When I worked in an IBM VLSI test lab in the early 1980s, state of the art was 2-3 microns (2000-3000 nm) and 25MHz with yields ~10%. Now state of the art (set by world leader Taiwan) is clock frequencies in GHz and 3-5 nm.

Below: Intel 4004 from 1971

4.2 AI based on binary computing

4.2a The era of practical AI has only recently started thanks to recent gains in performance in the semiconductor transistor-based GPUs that it runs on. Experts are often not sure exactly how certain aspects of AI work, but they continue to test, discover, and then develop theories that explain it all.

4.2b But AI is developing at a much faster rate than semiconductors. Where this will lead is hard to say. But as I recently mentioned in a chat with ChatGPT, the stunning performance and usefulness of GPT (something I could not have imagined just a few years ago) shows that the massive investments in AI are not a bubble.

Below: GeForce RTX (left), UFA 3d map (center; from Welch labs), and resulting logic map (right). The map is of a UFA that has 2 inputs (lat/longitude), and the output is binary (in Belgium or Netherlands).

5 Why TFs can be used for many types of applications (not just for LLMs)

The analogous similarites of AI GPU algorithms for different types of apps (basic idea: for example, CNN and TF similarities). For details see the following.

Docx #600

"_ziptieai_BOOK2_LLM.docx" ch 16 LLM analogies

"_ziptieai_BOOK2_LLM.docx" ch 15a Other UFA

6 Training (GPT-3) details

(my own take). "training" uses training SW to program the TF (set the weights and biases). The training SW compares the input and output for massive amounts of text to compute the weights/bias values. The TF learns nothing.

AH softmax

https://www.geeksforgeeks.org/machine-learning/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss/

I dont quite understand the math.. but basically the loss function is a very simple math function (best for easy computing).

FFN training

About the 3d loss "hills and valleys" diagrams

The loss function should be a 2d f(x) = (loss equation). But you often see these 3d visualizations. Not really such what these are supposed to be. Changing a single Wx will change the loss curve to some extent for all other Wx's. But this diagram shows apparently 2 inputs and output on the z axis. I still have not figure out the meaning.

https://youtu.be/qx7hirqgfuU?t=900

h layer W

Assume W = 0.5 is the best fit. Then as W goes to + infinity the square(slope) gets bigger.. as W goes to – infinity then the square(slope) gets bigger (but not quite as fast) these are the gradients.

h layer bias ??

All the 12288 gradients sum up linearly for an h neuron. Bias is added to the sum. if bias = -2, the output I shifted right by 2. this means that sum does not go positive until the output >2.

h layer GELU ??

y (output) layer W

y (output) layer b (??)

For details

see "_ziptieai_BOOK2_LLM.docx" chapters

5.1.4
19 LLM transformer training