AI Platforms journey in 2025 LLM‐s GenAI - mazsola2k/genaiprompt GitHub Wiki

Background

Machine Learning (ML) is a broad field enabling computers to learn patterns and make predictions from data using various learning methods such as supervised learning (with labeled data), unsupervised learning (without labels, e.g., clustering customers), and reinforcement learning (learning by feedback/reward). Neural networks are one approach in ML, but not all ML models use them (e.g., linear regression and clustering do not).

AI

Generative AI (GenAI) is a subset of ML focused on creating new content—such as text, images, or code—rather than just predictions (e.g., generating images with Stable Diffusion or DALL-E). GenAI primarily relies on neural networks and often uses unsupervised or self-supervised learning for pretraining, sometimes combined with supervised or reinforcement learning for fine-tuning.

Large Language Models (LLMs) are a type of GenAI specialized in understanding and generating human-like text (e.g., chatbots like GPT-4 or Llama-3). LLMs are built using deep neural networks—typically transformers—trained first via self-supervised learning on massive collections of written texts, then often fine-tuned with supervised or reinforcement learning.

In summary: LLM is a subset of GenAI, which is a subset of Machine Learning, with neural networks and a blend of learning methods (supervised, unsupervised, self-supervised, reinforcement) providing the foundation for GenAI and LLMs.


Major Cloud / On-prem platforms / implementation, libraries:

Feature OpenAI GPT (API) Google Gemini (ex-Bard) Amazon Bedrock (AI) HuggingFace llama.cpp Ollama
Provider/Infra OpenAI (Azure, AWS) Google Cloud AWS Community/Cloud/Local Local (C/C++) Local (Go, wraps llama.cpp)
Language REST, Python, JS, etc. REST, Python, JS, etc. REST, Python, JS, etc. Python (PyTorch, TF) C/C++ Go
Model Format Proprietary Proprietary Proprietary (Titan, Claude, etc.) PyTorch, Safetensors, etc. GGUF/GGML GGUF
Training/Fine-tuning No (API only) No (API only) No (API only) Yes No No
Inference Cloud only Cloud only Cloud only GPU/CPU (flexible) CPU (main), some GPU CPU (main), some GPU
Quantization No No No Some support Full, custom (2–8 bit) Full, via llama.cpp
API/Integration REST API, SDKs REST API, SDKs REST API, SDKs Python API C++ API, CLI REST API, CLI
Typical Use SaaS, production apps SaaS, production apps SaaS, production apps R&D, prod, cloud/local Desktop, edge Local/private, easy
Model Download/Sharing Not allowed Not allowed Not allowed HuggingFace Hub Manual/3rd party repos Built-in (like Docker)
Multi-modal Support Yes (GPT-4o, etc.) Yes Some Yes (audio, vision, etc.) No (text only) No (text only)

Popular Models

Model Name Parameters (B) VRAM (GB) RAM (GB) Summary of Size Differences On-Prem License
TinyLlama 1.1B 1.1B 1 2 Extremely small, ultra-lightweight, for edge devices and minimal hardware Yes Apache-2.0
Gemma 2B 2B 2 5 Ultra-lightweight, for minimal hardware, simple queries only Yes Apache-2.0
Phi-3 Mini 3.8B 4 7 Very small, lightweight, good for simple queries Yes MIT
Mistral 1B 1B 1 2 Smallest Mistral, fast, for very lightweight tasks and simple queries Yes Apache-2.0
Mistral 3B 3B 3 6 Small, fast, more capable than 1B, still very resource-light Yes Apache-2.0
Mistral 7B 7B 6 10 Small, fast, efficient, good for quick/simple tasks Yes Apache-2.0
Llama 2 7B 7B 6 10 Smallest Llama 2, fast, needs little RAM/VRAM, less accurate for complex tasks Yes Meta (restr.)
Llama 3 8B 8B 6 10 Small, efficient, good for basic tasks and fast responses Yes Meta (restr.)
Llama 2 13B 13B 12 20 Medium, better language skills and reasoning, higher resource use Yes Meta (restr.)
Llama 4 Scout 17B 17B 14 24 Mid-large, strong performance, modern architecture, balances skill and efficiency Yes Meta (restr.)
Mixtral 8x7B 46B (Mixture) 24 50 Uses multiple 7B experts, high performance, needs more RAM/VRAM Yes Apache-2.0
DeepSeek 7B 7B 6 10 Small, efficient, strong at code and reasoning for its size Yes DeepSeek
DeepSeek 67B 67B 36 88 Large, excels at coding and complex tasks, high-end hardware required Yes DeepSeek
Llama 2 70B 70B 40 100 Large, much better at nuanced tasks, requires high-end hardware Yes Meta (restr.)
Llama 3 70B 70B 40 100 Large, state-of-the-art, more context and accuracy, high resource demands Yes Meta (restr.)
Llama 3.1 405B 405B 200 600 Extremely large, SOTA reasoning, needs cluster-scale hardware Yes Meta (restr.)
GPT-3.5 ~13B* N/A N/A Very capable, good for most tasks, only available via API/service No Proprietary
GPT-4 ~70B* N/A N/A Highly capable, advanced reasoning, only available via API/service No Proprietary
GPT-4.1 ~70B* N/A N/A Most advanced GPT (as of 2024), improved reasoning, only via API/service No Proprietary

*Parameter counts for GPT-3.5 and GPT-4 are not official; they are estimated for comparison purposes.

Notes:

  • "Yes" under On-Prem means you can run the model on your own hardware with enough resources (usually with GGUF, PyTorch, or similar formats).

Parameters

“Billion parameters” refers to the number of adjustable values in an AI model’s neural network; more parameters generally mean a larger and more capable (but also more resource-intensive) model. For example, Mistral 7B has 7 billion parameters, while Mixtral 46B has 46 billion. The larger Mixtral 46B model can handle more complex language understanding, generate more detailed responses, and maintain context better than Mistral 7B, but it also needs more memory and computing power.

Real-life example:
If you ask both models for detailed instructions on how to build a complex software system, Mixtral 46B might provide a thorough, step-by-step plan, while Mistral 7B could miss some important details or oversimplify. However, even the bigger Mixtral 46B might not know about a brand-new programming framework released yesterday, since its training data only goes up to a certain date and it can’t access the latest information.


Bottleneck for On-Prem: Video RAM & RAM

VRAM bottleneck happens when your GPU doesn’t have enough memory to load and run large AI models efficiently. This limits the size and speed of models you can use, especially for deep learning and large language models.

Solution: Model Quantization

  • Quantization shrinks AI models by reducing numerical precision (e.g., from 16-bit to 4-bit), drastically lowering memory use.
  • This allows large models to run on standard laptops, desktops, and even some mobile devices—no need for powerful GPUs.
  • GGUF is a modern, efficient file format tailored for quantized models and is supported by tools like llama.cpp.
    The final GGUF file is much smaller and runs efficiently, with only minor quality trade-offs.
  • Quantized GGUF models unlock advanced AI capabilities for everyone, enabling fast, local inference on regular hardware and reducing dependency on expensive GPUs.

AI models require both VRAM (GPU memory) and RAM (system memory), and these are separate resources with different limits on most computers. When running a large language model (like Llama 2 13B), the model and its data must fit into VRAM for fast GPU inference, while RAM is used for loading the model and managing system tasks. For example, a quantized Llama 2 13B model in GGUF format might need 10–14GB VRAM and 16–24GB RAM. If either memory type is insufficient, the model may not run or will run much slower, often falling back to CPU-only processing. Quantization and efficient formats like GGUF help reduce these requirements, making powerful models accessible on consumer hardware.


Bottleneck for Cloud: Training/Fine-tuning Not User-Available

Major cloud AI models like OpenAI GPT, Google Gemini (Bard), and Amazon Bedrock do not allow direct user training or fine-tuning; only their providers can update the core models.

Users can customize behaviors using prompt engineering, API parameters, or—in some cases—limited “customization” tools, but this is not true training.

Solution: Open-source Models Through Local Tools

True training and fine-tuning are supported for open-source models (like Llama, Mistral, Mixtral) via platforms such as HuggingFace or local tools (llama.cpp, Ollama).
For advanced, tailored AI solutions, organizations should use open models or cloud platforms that explicitly support bring-your-own-model (BYOM) and user-level training.

Cloud AI APIs are best for production, scale, and up-to-date capabilities, but come with restrictions on customization and data privacy.


More Restrictions - Licensing

  • Meta (restr.): Commercial use allowed with limits (e.g., not for >700M monthly users or training competing LLMs); some sector and user restrictions.
  • DeepSeek: Permissive, but blocks use by certain big companies and for training competitors.
  • Apache-2.0: Very permissive; requires attribution and license inclusion.
  • MIT: Highly permissive; requires attribution.
  • Proprietary: API/service only; no download, modification, or self-host.

Try Ollama Platform

Ollama is a platform and toolset that simplifies running large language models locally on your own hardware. It provides an easy interface for downloading, managing, and interacting with open-source AI models, focusing on privacy, speed, and user control. Ollama is popular for enabling fast, offline AI model inference without relying on cloud services.

Download

Download and Install from https://ollama.com/download - available now also on Windows!

Python Chatbot prompt script to interface with Ollama

View Ollama Python client sample

Key sections:

OLLAMA_MODEL = "llama4" # alternative you can try others for ex: "mistral"
OLLAMA_URL = "http://localhost:11434"
def ensure_model(model):
...
def generate_response(prompt):
...
if __name__ == "__main__":
    print("Ollama Mistral CLI. Type your prompt and press Enter. Type 'exit' to quit.\n")
    if ensure_model(OLLAMA_MODEL):
        while True:
            prompt = input("Prompt: ")
            if prompt.strip().lower() == "exit":
                break
            generate_response(prompt)

Good to go for testing

python ollama-llama4.py

Ollama Llama4 CLI. Type your prompt and press Enter. Type 'exit' to quit.

Pulling model 'llama4' from Ollama Hub...
pulling 9d507a36062c

Model 'llama4' pulled successfully.
Enter your prompt: please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service
Generating response...

Generated Output:
**Installing Nginx with Podman using Ansible**
====================================================

Below is an example Ansible playbook snippet that installs an Nginx container using Podman and exposes port 443.

```yml
---
- name: Install and run Nginx container with Podman
  hosts: [localhost]
  become: yes

  tasks:
  - name: Install Podman
    dnf:
      name: podman
      state: present

  - name: Pull Nginx image
    podman_image:
      name: nginx
      state: present

  - name: Run Nginx container
    podman_container:
      name: nginx-container
      image: nginx
      published_ports:
        - "443:443"
      restart_policy: always
      state: started

Example Usage:
Save the playbook to a file (e.g., nginx-podman.yml) and run it using the command:

ansible-playbook nginx-podman.yml

Try llama_cpp Platform

  • llama.cpp: Fast, open-source C/C++ library for running Llama and similar LLMs locally on CPUs or lightweight hardware, supporting quantized models and cross-platform use.
  • lama-cpp-python: Python bindings for llama.cpp, making it easy to use llama.cpp’s efficient local inference from Python code and integrate with Python ML workflows.

For llama_cpp you have to download the models yourself

# Examples for Quantized GGUF models:
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
https://huggingface.co/mradermacher/Llama-4-Scout-17B-6E-Instruct-i1-GGUF/resolve/main/Llama-4-Scout-17B-6E-Instruct.i1-Q4_K_S.gguf

If you are using llama3 or earlier models - pip install

pip install llama-cpp-python

If you use llama4 you need to patch and recompile

Additionally if you are on Windows 11:

# Ensure Visual Studio Build Tools installed with the below options:
# - MSVC v143 - VS 2022 C++ x64/x86 build tools
# - C++ CMake tools for Windows
# - Windows 10 SDK (not 11, be careful)

Start the patched version pip package compilation and install:

pip install --no-cache-dir --force-reinstall llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121/

You should see after the latest version compiled and installed:

pip show llama-cpp-python
Name: llama_cpp_python
Version: 0.3.9
Summary: Python bindings for the llama.cpp library
Home-page: https://github.com/abetlen/llama-cpp-python

Python Chatbot prompt script with llama_cpp

View llama-cpp-python sample

Key Sections
llm = Llama(
    model_path="./meta-llama-3-8b.Q4_K_M.gguf",  # Update to your downloaded file
    n_ctx=4096,
    n_gpu_layers=-1,  # Full GPU offload
    verbose=False,
)
...
while True:
    prompt = input("\nEnter your prompt (or type 'exit' to quit):\n> ")
    if prompt.strip().lower() == "exit":
        break
    print("Calculating response...", end="", flush=True)
    start = time.time()
    output = llm(
        prompt,
        max_tokens=256,
        temperature=0.7,
        top_p=0.95,
    )

You are good to go and test:

python llama_cpp-llama4-17b-gguf.py

Enter your prompt (or type 'exit' to quit):
> please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service
Response ready in 59.9 seconds.

## To answer your question, here's an Ansible snippet to automate the installation of an Nginx Podman container and expose port 443 to the service:

---
- name: Install and configure Nginx Podman container
...

Try Hugging Face

Hugging Face is known for its open-source Transformers library, which provides easy access to advanced machine learning models for natural language processing. It also hosts the Hugging Face Hub, a platform for sharing models and datasets, supporting collaborative and accessible AI development.

Preparations

  • CUDA: If you have an NVIDIA card, install CUDA Toolkit
  • Hugging Face account: Register at huggingface.co and get a token.
  • Models approval: Request access for some models (e.g., Meta’s Llama models may require approval).
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install accelerate huggingface-hub
huggingface-cli login

Python Chatbot prompt script with Hugging Face / Pytorch

View Pytorch Hugging Face sample

Key Sections
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, token=hf_token).to(device)

inputs = tokenizer(f"Question: {prompt}\nAnswer:", return_tensors="pt").to(device)
output_ids = model.generate(
    **inputs,
    max_new_tokens=150,
    do_sample=True,
    top_p=0.8,
    top_k=10,
    temperature=0.7,
)

Good to go for testing:

python pytorch-llama2-7b.py
Enter your Hugging Face token (leave blank to use default login):
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 2/2 [00:00<00:00, 40.64it/s]

Enter your prompt (or type 'exit' to quit):
> please generate an ansible snippet to automate to install an nginx podman container and expose port 443 to the service

Here is an example Ansible snippet that can be used to automate the installation of an Nginx Podman container and expose port 443 to the service:

---
- name: Install Nginx and expose port 443
  podman:
    image: nginx:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /path/to/nginx.conf:/etc/nginx/conf.d/default.conf
    ports:
      - "443:443"
    command: ["nginx", "-g", "daemon off;"]