Basics of Large Language Models (LLMs) - 180D-FW-2024/Knowledge-Base-Wiki GitHub Wiki

Introduction

Large Language Models (LLMs) are a form of deep learning models that are trained on vast amounts of data, allowing them to generate and understand natural language. By training on large text datasets, these models are able to learn the linguistic structure, patterns, contexts, and nuances of human language. As deep learning has evolved, the capabilities of LLMs have progressed with the introduction of neural networks and transformers, allowing LLMs to process language more effectively, with greater parameters. LLMs are a breakthrough in natural language processing (NLP), poised to transform a wide range of applications.

How LLMs Work

Using current technologies, LLMs rely on transformers and an extensive training process. There are five main components that compose a large language model: embedding, tokenization, attention, pre-training, and transfer learning.

Embedding

A primary function for an LLM is creating embeddings, which are vector representations of words. In this stage, each word is transformed into a multi-dimensional vector. This allows the model to understand the semantic relationships between words, based on how close their vectors are in the vector space. For example, the word “cat” will likely have an embedding closer to “dog” than to “stethoscope” because “cat” and “dog” are semantically more similar (both animals) than “cat” and “stethoscope” (an object).

Screenshot 2024-11-08 175949

Figure 1: Example of 2D semantic space as vectors

Tokenization

Tokenization breaks the input text into smaller chunks known as “tokens,” which allows the LLM to manage its vocabulary. These tokens are converted into embeddings. An example of why this is important is in the sentence, “They aren’t able to eat!” The text can be tokenized into [“They”, “aren”, “‘t”, “able”, “to”, “eat!”] or perhaps [“They”, “aren’t”, “able”, “to”, “eat”]. Tokenization is important for the model to be able to handle specific vocabulary, characters, and ambiguity.

Attention

A defining feature of the transformer models that comprise LLMs is the self-attention mechanism, which lets the model weigh the importance between a word and all other words in the sentence. Unlike with embeddings, the self-attention weights do not directly correspond to semantic meaning. This mechanism lets transformers capture relationships between words regardless of their position in the sentence.

Figure 2: Visualization of attention in transformers

Figure 2 shows the model assigning a higher weight between “it” and “monkey” and “banana” as “it” is contextually linked to a noun, which could be banana or monkey.

Pre-Training

The pre-training phase where the model undergoes unsupervised/self-supervised learning on large text datasets. Training on billions of sentences, the model learns linguistic patterns such as grammar and context in order to later predict words and sentences. In many cases, LLMs can learn factual knowledge if there is a commonality between points in the data.

Transfer Learning

Transfer learning occurs after pre-training, when the model is fine-tuned. This phase adapts the model to the requirements of specific tasks. After the model understands general language patterns, it is further customized for tasks such as translation or question answering.

Applications of LLMs

Large Language Models are becoming more integrated in many industries, redefining workflows and business processes.

Content Generation

Large Language Models can create text for blogs, social media, product descriptions, and more. These models produce coherent text that closely mimic human language patterns, including nuances. Additionally, LLMs can be used for generating ideas, which may be insightful for various fields.

AI Assistants

LLMs can be used to develop chatbots and virtual assistants, which are common in customer service fields. These chatbots take queries from customers and may process refunds, file reports, and answer questions without the need for a human agent. Integrating these models streamline operations and automates repetitive tasks for employees.

Code Generation

LLMs trained in programming languages can assist developers by suggesting code snippets, offering solutions to common coding problems, and debugging errors. Advanced models, such as GitHub Copilot, which are trained on a variety of programming languages, help improve productivity by providing real-time code completions and reduce time spent on trivial code.

Figure 3: Github CoPilot generating a unit test

Certain Large Language Models, such as Gemini and ChatGPT, are not restricted to a specific task specialization, making them applicable in a wide range of fields. These models act as conversational agents, coding assistants, and text generators. Such models, however, require significant computational resources for training and inference.

Challenges of LLMs

Inaccuracy

Since LLMs use predictive decision making, their outputs are based on probabilities derived from their datasets as opposed to verified knowledge or logical reasoning. This may result in inaccurate or illogical statements, especially when attempting to generate information outside the scope of its training data. While outputs may appear reasonable, they can hold inconsistencies with real-world information. Such issues are resolved through validation processes in testing.

Figure 4: ChatGPT incorrectly responding to question

Bias

Similar to other deep learning models, LLMs are prone to inheriting biases present in their training data. This is a significant ethical consideration when developing Large Language Models. These biases lead to potentially harmful outputs that can discriminate against an individual’s gender, race, culture, etc.. These biases are mitigated through diverse datasets and auditing in the training process to identify any harmful patterns.

Cost

The development, training, and deployment of Large Language Models come with significant costs. Training typically requires specialized GPUs or TPUs, which lead to large electricity and infrastructure costs. ChatGPT, for example, is stated to have cost hundreds of millions of dollars for computing infrastructure alone.

Data acquisition is another cost of developing LLMs. Models may utilize several billions of text samples, which must be gathered from diverse sources to ensure robustness. The preprocessing of this data requires significant storage, labor, and processing power.

After deployment, LLMs require continuous computational resources for inference. A challenge in LLMs is managing the cost of server maintenance, cloud infrastructure, and low-latency responses.

Future of LLMs

Reinforcement Learning from Human Feedback (RLHF)

A trend in LLM research is RLHF, which uses human feedback for the model’s learning process. This would steer the model’s outputs toward human preference and desired behaviors.

Efficiency

As model size increases, so does the computational requirement. Developers aim to improve the efficiency of LLMs through optimized software techniques such as pruning and quantization as well as harnessing advanced hardware. These techniques work towards reducing operational costs and their environmental impacts.

Conclusion

Large Language Models signify the capabilities of natural language processing. By continuously evolving in a cycle of embedding and training with extensive datasets, these models become more effective in their applications in a wide range of industries. While the cost and reliability of LLMs require improvement, ongoing research into their optimization are expected to make these models more accessible and sustainable.

Sources

https://aws.amazon.com/what-is/large-language-model/ https://arxiv.org/pdf/2307.06435 https://www.cs.cmu.edu/~dst/WordEmbeddingDemo/tutorial.html https://cedar.buffalo.edu/~srihari/CSE676/12.4.6%20AttentionModels.pdf https://www.researchgate.net/publication/350714675_Deep_Learning_Enabled_Semantic_Communication_Systems https://community.openai.com/t/incorrect-count-of-r-characters-in-the-word-strawberry/829618