Transformers - doraithodla/notes GitHub Wiki

Freed forward neural network
self-attention
Encoder-Decoder Attention
Output Generation

Questions

How are transformer architectures parallelizable?

https://willthompson.name/what-we-know-about-llms-primer When people say “Large Language Models”, they typically are referring to a type of deep learning architecture called a Transformer. Transformers are models that work with sequence data (e.g. text, images, time series, etc) and are part of a larger family of models called Sequence Models.

What differentiates the Transformer from its predecessors is it’s ability to learn the contextual relationship of values within a sequence through a mechanism called (self-) Attention.

Marsha is a functional, higher-level, English-based programming language that gets compiled into tested Python software by an LLM

https://github.com/alantech/marsha"Transformers can be generally categorized into one of three categories: “encoder only” (a la BERT); “decoder only” (a la GPT); and having an “encoder-decoder” architecture (a la T5). Although all of these architectures can be rigged for a broad range of tasks (e.g. classification, translation, etc), encoders are thought to be useful for tasks where the entire sequence needs to be understood (such as sentiment classification), whereas decoders are thought to be useful for tasks where text needs to be completed (such as completing a sentence). Encoder-decoder architectures can be applied to a variety of problems, but are most famously associated with language translation.""

Decoder-only Transformers such as ChatGPT & GPT-4 are the class of LLM that are ubiquitously referring to as “generative AI”.

The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization.