AI Platforms journey in 2025 LLM‐s GenAI - mazsola2k/genaiprompt GitHub Wiki

Background

Machine Learning (ML) is a broad field enabling computers to learn patterns and make predictions from data using various learning methods such as supervised learning (with labeled data), unsupervised learning (without labels, e.g., clustering customers), and reinforcement learning (learning by feedback/reward). Neural networks are one approach in ML, but not all ML models use them (e.g., linear regression and clustering do not).

AI

Generative AI (GenAI) is a subset of ML focused on creating new content—such as text, images, or code—rather than just predictions (e.g., generating images with Stable Diffusion or DALL-E). GenAI primarily relies on neural networks and often uses unsupervised or self-supervised learning for pretraining, sometimes combined with supervised or reinforcement learning for fine-tuning.

Large Language Models (LLMs) are a type of GenAI specialized in understanding and generating human-like text (e.g., chatbots like GPT-4 or Llama-3). LLMs are built using deep neural networks—typically transformers—trained first via self-supervised learning on massive collections of written texts, then often fine-tuned with supervised or reinforcement learning.

In summary: LLM is a subset of GenAI, which is a subset of Machine Learning, with neural networks and a blend of learning methods (supervised, unsupervised, self-supervised, reinforcement) providing the foundation for GenAI and LLMs.

Major Cloud / On-prem platforms / implementation, libraries:

Feature	OpenAI GPT (API)	Google Gemini (ex-Bard)	Amazon Bedrock (AI)	HuggingFace	llama.cpp	Ollama
Provider/Infra	OpenAI (Azure, AWS)	Google Cloud	AWS	Community/Cloud/Local	Local (C/C++)	Local (Go, wraps llama.cpp)
Language	REST, Python, JS, etc.	REST, Python, JS, etc.	REST, Python, JS, etc.	Python (PyTorch, TF)	C/C++	Go
Model Format	Proprietary	Proprietary	Proprietary (Titan, Claude, etc.)	PyTorch, Safetensors, etc.	GGUF/GGML	GGUF
Training/Fine-tuning	No (API only)	No (API only)	No (API only)	Yes	No	No
Inference	Cloud only	Cloud only	Cloud only	GPU/CPU (flexible)	CPU (main), some GPU	CPU (main), some GPU
Quantization	No	No	No	Some support	Full, custom (2–8 bit)	Full, via llama.cpp
API/Integration	REST API, SDKs	REST API, SDKs	REST API, SDKs	Python API	C++ API, CLI	REST API, CLI
Typical Use	SaaS, production apps	SaaS, production apps	SaaS, production apps	R&D, prod, cloud/local	Desktop, edge	Local/private, easy
Model Download/Sharing	Not allowed	Not allowed	Not allowed	HuggingFace Hub	Manual/3rd party repos	Built-in (like Docker)
Multi-modal Support	Yes (GPT-4o, etc.)	Yes	Some	Yes (audio, vision, etc.)	No (text only)	No (text only)

Popular Models

Model Name	Parameters (B)	VRAM (GB)	RAM (GB)	Summary of Size Differences	On-Prem	License
TinyLlama 1.1B	1.1B	1	2	Extremely small, ultra-lightweight, for edge devices and minimal hardware	Yes	Apache-2.0
Gemma 2B	2B	2	5	Ultra-lightweight, for minimal hardware, simple queries only	Yes	Apache-2.0
Phi-3 Mini	3.8B	4	7	Very small, lightweight, good for simple queries	Yes	MIT
Mistral 1B	1B	1	2	Smallest Mistral, fast, for very lightweight tasks and simple queries	Yes	Apache-2.0
Mistral 3B	3B	3	6	Small, fast, more capable than 1B, still very resource-light	Yes	Apache-2.0
Mistral 7B	7B	6	10	Small, fast, efficient, good for quick/simple tasks	Yes	Apache-2.0
Llama 2 7B	7B	6	10	Smallest Llama 2, fast, needs little RAM/VRAM, less accurate for complex tasks	Yes	Meta (restr.)
Llama 3 8B	8B	6	10	Small, efficient, good for basic tasks and fast responses	Yes	Meta (restr.)
Llama 2 13B	13B	12	20	Medium, better language skills and reasoning, higher resource use	Yes	Meta (restr.)
Llama 4 Scout 17B	17B	14	24	Mid-large, strong performance, modern architecture, balances skill and efficiency	Yes	Meta (restr.)
Mixtral 8x7B	46B (Mixture)	24	50	Uses multiple 7B experts, high performance, needs more RAM/VRAM	Yes	Apache-2.0
DeepSeek 7B	7B	6	10	Small, efficient, strong at code and reasoning for its size	Yes	DeepSeek
DeepSeek 67B	67B	36	88	Large, excels at coding and complex tasks, high-end hardware required	Yes	DeepSeek
Llama 2 70B	70B	40	100	Large, much better at nuanced tasks, requires high-end hardware	Yes	Meta (restr.)
Llama 3 70B	70B	40	100	Large, state-of-the-art, more context and accuracy, high resource demands	Yes	Meta (restr.)
Llama 3.1 405B	405B	200	600	Extremely large, SOTA reasoning, needs cluster-scale hardware	Yes	Meta (restr.)
GPT-3.5	~13B*	N/A	N/A	Very capable, good for most tasks, only available via API/service	No	Proprietary
GPT-4	~70B*	N/A	N/A	Highly capable, advanced reasoning, only available via API/service	No	Proprietary
GPT-4.1	~70B*	N/A	N/A	Most advanced GPT (as of 2024), improved reasoning, only via API/service	No	Proprietary

*Parameter counts for GPT-3.5 and GPT-4 are not official; they are estimated for comparison purposes.

Notes:

"Yes" under On-Prem means you can run the model on your own hardware with enough resources (usually with GGUF, PyTorch, or similar formats).

Parameters

“Billion parameters” refers to the number of adjustable values in an AI model’s neural network; more parameters generally mean a larger and more capable (but also more resource-intensive) model. For example, Mistral 7B has 7 billion parameters, while Mixtral 46B has 46 billion. The larger Mixtral 46B model can handle more complex language understanding, generate more detailed responses, and maintain context better than Mistral 7B, but it also needs more memory and computing power.

Real-life example:
If you ask both models for detailed instructions on how to build a complex software system, Mixtral 46B might provide a thorough, step-by-step plan, while Mistral 7B could miss some important details or oversimplify. However, even the bigger Mixtral 46B might not know about a brand-new programming framework released yesterday, since its training data only goes up to a certain date and it can’t access the latest information.

Bottleneck for On-Prem: Video RAM & RAM

VRAM bottleneck happens when your GPU doesn’t have enough memory to load and run large AI models efficiently. This limits the size and speed of models you can use, especially for deep learning and large language models.

Solution: Model Quantization

Quantization shrinks AI models by reducing numerical precision (e.g., from 16-bit to 4-bit), drastically lowering memory use.
This allows large models to run on standard laptops, desktops, and even some mobile devices—no need for powerful GPUs.
GGUF is a modern, efficient file format tailored for quantized models and is supported by tools like llama.cpp.
The final GGUF file is much smaller and runs efficiently, with only minor quality trade-offs.
Quantized GGUF models unlock advanced AI capabilities for everyone, enabling fast, local inference on regular hardware and reducing dependency on expensive GPUs.

AI models require both VRAM (GPU memory) and RAM (system memory), and these are separate resources with different limits on most computers. When running a large language model (like Llama 2 13B), the model and its data must fit into VRAM for fast GPU inference, while RAM is used for loading the model and managing system tasks. For example, a quantized Llama 2 13B model in GGUF format might need 10–14GB VRAM and 16–24GB RAM. If either memory type is insufficient, the model may not run or will run much slower, often falling back to CPU-only processing. Quantization and efficient formats like GGUF help reduce these requirements, making powerful models accessible on consumer hardware.

Bottleneck for Cloud: Training/Fine-tuning Not User-Available

Major cloud AI models like OpenAI GPT, Google Gemini (Bard), and Amazon Bedrock do not allow direct user training or fine-tuning; only their providers can update the core models.

Users can customize behaviors using prompt engineering, API parameters, or—in some cases—limited “customization” tools, but this is not true training.

Solution: Open-source Models Through Local Tools

True training and fine-tuning are supported for open-source models (like Llama, Mistral, Mixtral) via platforms such as HuggingFace or local tools (llama.cpp, Ollama).
For advanced, tailored AI solutions, organizations should use open models or cloud platforms that explicitly support bring-your-own-model (BYOM) and user-level training.

Cloud AI APIs are best for production, scale, and up-to-date capabilities, but come with restrictions on customization and data privacy.

More Restrictions - Licensing

Meta (restr.): Commercial use allowed with limits (e.g., not for >700M monthly users or training competing LLMs); some sector and user restrictions.
DeepSeek: Permissive, but blocks use by certain big companies and for training competitors.
Apache-2.0: Very permissive; requires attribution and license inclusion.
MIT: Highly permissive; requires attribution.
Proprietary: API/service only; no download, modification, or self-host.

Try Ollama Platform

Ollama is a platform and toolset that simplifies running large language models locally on your own hardware. It provides an easy interface for downloading, managing, and interacting with open-source AI models, focusing on privacy, speed, and user control. Ollama is popular for enabling fast, offline AI model inference without relying on cloud services.

Download

Download and Install from https://ollama.com/download - available now also on Windows!

Python Chatbot prompt script to interface with Ollama

View Ollama Python client sample

Key sections:

OLLAMA_MODEL = "llama4" # alternative you can try others for ex: "mistral"
OLLAMA_URL = "http://localhost:11434"
def ensure_model(model):
...
def generate_response(prompt):
...
if __name__ == "__main__":
    print("Ollama Mistral CLI. Type your prompt and press Enter. Type 'exit' to quit.\n")
    if ensure_model(OLLAMA_MODEL):
        while True:
            prompt = input("Prompt: ")
            if prompt.strip().lower() == "exit":
                break
            generate_response(prompt)

Good to go for testing

python ollama-llama4.py

Ollama Llama4 CLI. Type your prompt and press Enter. Type 'exit' to quit.

Pulling model 'llama4' from Ollama Hub...
pulling 9d507a36062c

Model 'llama4' pulled successfully.
Enter your prompt: please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service
Generating response...

Generated Output:
**Installing Nginx with Podman using Ansible**
====================================================

Below is an example Ansible playbook snippet that installs an Nginx container using Podman and exposes port 443.

```yml
---
- name: Install and run Nginx container with Podman
  hosts: [localhost]
  become: yes

  tasks:
  - name: Install Podman
    dnf:
      name: podman
      state: present

  - name: Pull Nginx image
    podman_image:
      name: nginx
      state: present

  - name: Run Nginx container
    podman_container:
      name: nginx-container
      image: nginx
      published_ports:
        - "443:443"
      restart_policy: always
      state: started

Example Usage:
Save the playbook to a file (e.g., nginx-podman.yml) and run it using the command:

ansible-playbook nginx-podman.yml

Try llama_cpp Platform

llama.cpp: Fast, open-source C/C++ library for running Llama and similar LLMs locally on CPUs or lightweight hardware, supporting quantized models and cross-platform use.
lama-cpp-python: Python bindings for llama.cpp, making it easy to use llama.cpp’s efficient local inference from Python code and integrate with Python ML workflows.

For llama_cpp you have to download the models yourself

# Examples for Quantized GGUF models:
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
https://huggingface.co/mradermacher/Llama-4-Scout-17B-6E-Instruct-i1-GGUF/resolve/main/Llama-4-Scout-17B-6E-Instruct.i1-Q4_K_S.gguf

If you are using llama3 or earlier models - pip install

pip install llama-cpp-python

If you use llama4 you need to patch and recompile

Additionally if you are on Windows 11:

# Ensure Visual Studio Build Tools installed with the below options:
# - MSVC v143 - VS 2022 C++ x64/x86 build tools
# - C++ CMake tools for Windows
# - Windows 10 SDK (not 11, be careful)

Start the patched version pip package compilation and install:

pip install --no-cache-dir --force-reinstall llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121/

You should see after the latest version compiled and installed:

pip show llama-cpp-python
Name: llama_cpp_python
Version: 0.3.9
Summary: Python bindings for the llama.cpp library
Home-page: https://github.com/abetlen/llama-cpp-python

Python Chatbot prompt script with llama_cpp

View llama-cpp-python sample

Key Sections

llm = Llama(
    model_path="./meta-llama-3-8b.Q4_K_M.gguf",  # Update to your downloaded file
    n_ctx=4096,
    n_gpu_layers=-1,  # Full GPU offload
    verbose=False,
)
...
while True:
    prompt = input("\nEnter your prompt (or type 'exit' to quit):\n> ")
    if prompt.strip().lower() == "exit":
        break
    print("Calculating response...", end="", flush=True)
    start = time.time()
    output = llm(
        prompt,
        max_tokens=256,
        temperature=0.7,
        top_p=0.95,
    )

You are good to go and test:

python llama_cpp-llama4-17b-gguf.py

Enter your prompt (or type 'exit' to quit):
> please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service
Response ready in 59.9 seconds.

## To answer your question, here's an Ansible snippet to automate the installation of an Nginx Podman container and expose port 443 to the service:

---
- name: Install and configure Nginx Podman container
...

Try Hugging Face

Hugging Face is known for its open-source Transformers library, which provides easy access to advanced machine learning models for natural language processing. It also hosts the Hugging Face Hub, a platform for sharing models and datasets, supporting collaborative and accessible AI development.

Preparations

CUDA: If you have an NVIDIA card, install CUDA Toolkit
Hugging Face account: Register at huggingface.co and get a token.
Models approval: Request access for some models (e.g., Meta’s Llama models may require approval).

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install accelerate huggingface-hub
huggingface-cli login

Python Chatbot prompt script with Hugging Face / Pytorch

View Pytorch Hugging Face sample

Key Sections

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, token=hf_token).to(device)

inputs = tokenizer(f"Question: {prompt}\nAnswer:", return_tensors="pt").to(device)
output_ids = model.generate(
    **inputs,
    max_new_tokens=150,
    do_sample=True,
    top_p=0.8,
    top_k=10,
    temperature=0.7,
)

Good to go for testing:

python pytorch-llama2-7b.py
Enter your Hugging Face token (leave blank to use default login):
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 2/2 [00:00<00:00, 40.64it/s]

Enter your prompt (or type 'exit' to quit):
> please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service

Here is an example Ansible snippet that can be used to automate the installation of an Nginx Podman container and expose port 443 to the service:

---
- name: Install Nginx and expose port 443
  podman:
    image: nginx:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /path/to/nginx.conf:/etc/nginx/conf.d/default.conf
    ports:
      - "443:443"
    command: ["nginx", "-g", "daemon off;"]