AI Platforms journey in 2025 LLM‐s GenAI - mazsola2k/genaiprompt GitHub Wiki
Background
Machine Learning (ML) is a broad field enabling computers to learn patterns and make predictions from data using various learning methods such as supervised learning (with labeled data), unsupervised learning (without labels, e.g., clustering customers), and reinforcement learning (learning by feedback/reward). Neural networks are one approach in ML, but not all ML models use them (e.g., linear regression and clustering do not).
AI
Generative AI (GenAI) is a subset of ML focused on creating new content—such as text, images, or code—rather than just predictions (e.g., generating images with Stable Diffusion or DALL-E). GenAI primarily relies on neural networks and often uses unsupervised or self-supervised learning for pretraining, sometimes combined with supervised or reinforcement learning for fine-tuning.
Large Language Models (LLMs) are a type of GenAI specialized in understanding and generating human-like text (e.g., chatbots like GPT-4 or Llama-3). LLMs are built using deep neural networks—typically transformers—trained first via self-supervised learning on massive collections of written texts, then often fine-tuned with supervised or reinforcement learning.
In summary: LLM is a subset of GenAI, which is a subset of Machine Learning, with neural networks and a blend of learning methods (supervised, unsupervised, self-supervised, reinforcement) providing the foundation for GenAI and LLMs.
Major Cloud / On-prem platforms / implementation, libraries:
Feature | OpenAI GPT (API) | Google Gemini (ex-Bard) | Amazon Bedrock (AI) | HuggingFace | llama.cpp | Ollama |
---|---|---|---|---|---|---|
Provider/Infra | OpenAI (Azure, AWS) | Google Cloud | AWS | Community/Cloud/Local | Local (C/C++) | Local (Go, wraps llama.cpp) |
Language | REST, Python, JS, etc. | REST, Python, JS, etc. | REST, Python, JS, etc. | Python (PyTorch, TF) | C/C++ | Go |
Model Format | Proprietary | Proprietary | Proprietary (Titan, Claude, etc.) | PyTorch, Safetensors, etc. | GGUF/GGML | GGUF |
Training/Fine-tuning | No (API only) | No (API only) | No (API only) | Yes | No | No |
Inference | Cloud only | Cloud only | Cloud only | GPU/CPU (flexible) | CPU (main), some GPU | CPU (main), some GPU |
Quantization | No | No | No | Some support | Full, custom (2–8 bit) | Full, via llama.cpp |
API/Integration | REST API, SDKs | REST API, SDKs | REST API, SDKs | Python API | C++ API, CLI | REST API, CLI |
Typical Use | SaaS, production apps | SaaS, production apps | SaaS, production apps | R&D, prod, cloud/local | Desktop, edge | Local/private, easy |
Model Download/Sharing | Not allowed | Not allowed | Not allowed | HuggingFace Hub | Manual/3rd party repos | Built-in (like Docker) |
Multi-modal Support | Yes (GPT-4o, etc.) | Yes | Some | Yes (audio, vision, etc.) | No (text only) | No (text only) |
Popular Models
Model Name | Parameters (B) | VRAM (GB) | RAM (GB) | Summary of Size Differences | On-Prem | License |
---|---|---|---|---|---|---|
TinyLlama 1.1B | 1.1B | 1 | 2 | Extremely small, ultra-lightweight, for edge devices and minimal hardware | Yes | Apache-2.0 |
Gemma 2B | 2B | 2 | 5 | Ultra-lightweight, for minimal hardware, simple queries only | Yes | Apache-2.0 |
Phi-3 Mini | 3.8B | 4 | 7 | Very small, lightweight, good for simple queries | Yes | MIT |
Mistral 1B | 1B | 1 | 2 | Smallest Mistral, fast, for very lightweight tasks and simple queries | Yes | Apache-2.0 |
Mistral 3B | 3B | 3 | 6 | Small, fast, more capable than 1B, still very resource-light | Yes | Apache-2.0 |
Mistral 7B | 7B | 6 | 10 | Small, fast, efficient, good for quick/simple tasks | Yes | Apache-2.0 |
Llama 2 7B | 7B | 6 | 10 | Smallest Llama 2, fast, needs little RAM/VRAM, less accurate for complex tasks | Yes | Meta (restr.) |
Llama 3 8B | 8B | 6 | 10 | Small, efficient, good for basic tasks and fast responses | Yes | Meta (restr.) |
Llama 2 13B | 13B | 12 | 20 | Medium, better language skills and reasoning, higher resource use | Yes | Meta (restr.) |
Llama 4 Scout 17B | 17B | 14 | 24 | Mid-large, strong performance, modern architecture, balances skill and efficiency | Yes | Meta (restr.) |
Mixtral 8x7B | 46B (Mixture) | 24 | 50 | Uses multiple 7B experts, high performance, needs more RAM/VRAM | Yes | Apache-2.0 |
DeepSeek 7B | 7B | 6 | 10 | Small, efficient, strong at code and reasoning for its size | Yes | DeepSeek |
DeepSeek 67B | 67B | 36 | 88 | Large, excels at coding and complex tasks, high-end hardware required | Yes | DeepSeek |
Llama 2 70B | 70B | 40 | 100 | Large, much better at nuanced tasks, requires high-end hardware | Yes | Meta (restr.) |
Llama 3 70B | 70B | 40 | 100 | Large, state-of-the-art, more context and accuracy, high resource demands | Yes | Meta (restr.) |
Llama 3.1 405B | 405B | 200 | 600 | Extremely large, SOTA reasoning, needs cluster-scale hardware | Yes | Meta (restr.) |
GPT-3.5 | ~13B* | N/A | N/A | Very capable, good for most tasks, only available via API/service | No | Proprietary |
GPT-4 | ~70B* | N/A | N/A | Highly capable, advanced reasoning, only available via API/service | No | Proprietary |
GPT-4.1 | ~70B* | N/A | N/A | Most advanced GPT (as of 2024), improved reasoning, only via API/service | No | Proprietary |
*Parameter counts for GPT-3.5 and GPT-4 are not official; they are estimated for comparison purposes.
Notes:
- "Yes" under On-Prem means you can run the model on your own hardware with enough resources (usually with GGUF, PyTorch, or similar formats).
Parameters
“Billion parameters” refers to the number of adjustable values in an AI model’s neural network; more parameters generally mean a larger and more capable (but also more resource-intensive) model. For example, Mistral 7B has 7 billion parameters, while Mixtral 46B has 46 billion. The larger Mixtral 46B model can handle more complex language understanding, generate more detailed responses, and maintain context better than Mistral 7B, but it also needs more memory and computing power.
Real-life example:
If you ask both models for detailed instructions on how to build a complex software system, Mixtral 46B might provide a thorough, step-by-step plan, while Mistral 7B could miss some important details or oversimplify. However, even the bigger Mixtral 46B might not know about a brand-new programming framework released yesterday, since its training data only goes up to a certain date and it can’t access the latest information.
Bottleneck for On-Prem: Video RAM & RAM
VRAM bottleneck happens when your GPU doesn’t have enough memory to load and run large AI models efficiently. This limits the size and speed of models you can use, especially for deep learning and large language models.
Solution: Model Quantization
- Quantization shrinks AI models by reducing numerical precision (e.g., from 16-bit to 4-bit), drastically lowering memory use.
- This allows large models to run on standard laptops, desktops, and even some mobile devices—no need for powerful GPUs.
- GGUF is a modern, efficient file format tailored for quantized models and is supported by tools like
llama.cpp
.
The final GGUF file is much smaller and runs efficiently, with only minor quality trade-offs. - Quantized GGUF models unlock advanced AI capabilities for everyone, enabling fast, local inference on regular hardware and reducing dependency on expensive GPUs.
AI models require both VRAM (GPU memory) and RAM (system memory), and these are separate resources with different limits on most computers. When running a large language model (like Llama 2 13B), the model and its data must fit into VRAM for fast GPU inference, while RAM is used for loading the model and managing system tasks. For example, a quantized Llama 2 13B model in GGUF format might need 10–14GB VRAM and 16–24GB RAM. If either memory type is insufficient, the model may not run or will run much slower, often falling back to CPU-only processing. Quantization and efficient formats like GGUF help reduce these requirements, making powerful models accessible on consumer hardware.
Bottleneck for Cloud: Training/Fine-tuning Not User-Available
Major cloud AI models like OpenAI GPT, Google Gemini (Bard), and Amazon Bedrock do not allow direct user training or fine-tuning; only their providers can update the core models.
Users can customize behaviors using prompt engineering, API parameters, or—in some cases—limited “customization” tools, but this is not true training.
Solution: Open-source Models Through Local Tools
True training and fine-tuning are supported for open-source models (like Llama, Mistral, Mixtral) via platforms such as HuggingFace or local tools (llama.cpp, Ollama).
For advanced, tailored AI solutions, organizations should use open models or cloud platforms that explicitly support bring-your-own-model (BYOM) and user-level training.
Cloud AI APIs are best for production, scale, and up-to-date capabilities, but come with restrictions on customization and data privacy.
More Restrictions - Licensing
- Meta (restr.): Commercial use allowed with limits (e.g., not for >700M monthly users or training competing LLMs); some sector and user restrictions.
- DeepSeek: Permissive, but blocks use by certain big companies and for training competitors.
- Apache-2.0: Very permissive; requires attribution and license inclusion.
- MIT: Highly permissive; requires attribution.
- Proprietary: API/service only; no download, modification, or self-host.
Try Ollama Platform
Ollama is a platform and toolset that simplifies running large language models locally on your own hardware. It provides an easy interface for downloading, managing, and interacting with open-source AI models, focusing on privacy, speed, and user control. Ollama is popular for enabling fast, offline AI model inference without relying on cloud services.
Download
Download and Install from https://ollama.com/download - available now also on Windows!
Python Chatbot prompt script to interface with Ollama
View Ollama Python client sample
Key sections:
OLLAMA_MODEL = "llama4" # alternative you can try others for ex: "mistral"
OLLAMA_URL = "http://localhost:11434"
def ensure_model(model):
...
def generate_response(prompt):
...
if __name__ == "__main__":
print("Ollama Mistral CLI. Type your prompt and press Enter. Type 'exit' to quit.\n")
if ensure_model(OLLAMA_MODEL):
while True:
prompt = input("Prompt: ")
if prompt.strip().lower() == "exit":
break
generate_response(prompt)
Good to go for testing
python ollama-llama4.py
Ollama Llama4 CLI. Type your prompt and press Enter. Type 'exit' to quit.
Pulling model 'llama4' from Ollama Hub...
pulling 9d507a36062c
Model 'llama4' pulled successfully.
Enter your prompt: please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service
Generating response...
Generated Output:
**Installing Nginx with Podman using Ansible**
====================================================
Below is an example Ansible playbook snippet that installs an Nginx container using Podman and exposes port 443.
```yml
---
- name: Install and run Nginx container with Podman
hosts: [localhost]
become: yes
tasks:
- name: Install Podman
dnf:
name: podman
state: present
- name: Pull Nginx image
podman_image:
name: nginx
state: present
- name: Run Nginx container
podman_container:
name: nginx-container
image: nginx
published_ports:
- "443:443"
restart_policy: always
state: started
Example Usage:
Save the playbook to a file (e.g., nginx-podman.yml
) and run it using the command:
ansible-playbook nginx-podman.yml
Try llama_cpp Platform
- llama.cpp: Fast, open-source C/C++ library for running Llama and similar LLMs locally on CPUs or lightweight hardware, supporting quantized models and cross-platform use.
- lama-cpp-python: Python bindings for llama.cpp, making it easy to use llama.cpp’s efficient local inference from Python code and integrate with Python ML workflows.
For llama_cpp you have to download the models yourself
# Examples for Quantized GGUF models:
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
https://huggingface.co/mradermacher/Llama-4-Scout-17B-6E-Instruct-i1-GGUF/resolve/main/Llama-4-Scout-17B-6E-Instruct.i1-Q4_K_S.gguf
If you are using llama3 or earlier models - pip install
pip install llama-cpp-python
If you use llama4 you need to patch and recompile
Additionally if you are on Windows 11:
# Ensure Visual Studio Build Tools installed with the below options:
# - MSVC v143 - VS 2022 C++ x64/x86 build tools
# - C++ CMake tools for Windows
# - Windows 10 SDK (not 11, be careful)
Start the patched version pip package compilation and install:
pip install --no-cache-dir --force-reinstall llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121/
You should see after the latest version compiled and installed:
pip show llama-cpp-python
Name: llama_cpp_python
Version: 0.3.9
Summary: Python bindings for the llama.cpp library
Home-page: https://github.com/abetlen/llama-cpp-python
Python Chatbot prompt script with llama_cpp
Key Sections
llm = Llama(
model_path="./meta-llama-3-8b.Q4_K_M.gguf", # Update to your downloaded file
n_ctx=4096,
n_gpu_layers=-1, # Full GPU offload
verbose=False,
)
...
while True:
prompt = input("\nEnter your prompt (or type 'exit' to quit):\n> ")
if prompt.strip().lower() == "exit":
break
print("Calculating response...", end="", flush=True)
start = time.time()
output = llm(
prompt,
max_tokens=256,
temperature=0.7,
top_p=0.95,
)
You are good to go and test:
python llama_cpp-llama4-17b-gguf.py
Enter your prompt (or type 'exit' to quit):
> please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service
Response ready in 59.9 seconds.
## To answer your question, here's an Ansible snippet to automate the installation of an Nginx Podman container and expose port 443 to the service:
---
- name: Install and configure Nginx Podman container
...
Try Hugging Face
Hugging Face is known for its open-source Transformers library, which provides easy access to advanced machine learning models for natural language processing. It also hosts the Hugging Face Hub, a platform for sharing models and datasets, supporting collaborative and accessible AI development.
Preparations
- CUDA: If you have an NVIDIA card, install CUDA Toolkit
- Hugging Face account: Register at huggingface.co and get a token.
- Models approval: Request access for some models (e.g., Meta’s Llama models may require approval).
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install accelerate huggingface-hub
huggingface-cli login
Python Chatbot prompt script with Hugging Face / Pytorch
View Pytorch Hugging Face sample
Key Sections
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, token=hf_token).to(device)
inputs = tokenizer(f"Question: {prompt}\nAnswer:", return_tensors="pt").to(device)
output_ids = model.generate(
**inputs,
max_new_tokens=150,
do_sample=True,
top_p=0.8,
top_k=10,
temperature=0.7,
)
Good to go for testing:
python pytorch-llama2-7b.py
Enter your Hugging Face token (leave blank to use default login):
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 2/2 [00:00<00:00, 40.64it/s]
Enter your prompt (or type 'exit' to quit):
> please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service
Here is an example Ansible snippet that can be used to automate the installation of an Nginx Podman container and expose port 443 to the service:
---
- name: Install Nginx and expose port 443
podman:
image: nginx:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /path/to/nginx.conf:/etc/nginx/conf.d/default.conf
ports:
- "443:443"
command: ["nginx", "-g", "daemon off;"]