Cpp interface of LLMs - chunhualiao/public-docs GitHub Wiki

Category	Representative libraries / approaches	Main Characteristics
Thin REST wrappers	• libcpr (C++ Requests) • cpp-httplib + nlohmann/json • Boost.Beast (+ your JSON lib) • Community OpenAI wrappers (liboai, openai-cpp)	Minimal abstractions: you build HTTP requests and JSON payloads yourself. Fast to integrate and easy to vendor (often header-only), but you own retries, streaming chunk handling, auth flows, and error policy. ([GitHub][1], [Boost][2])
Full SDKs / service clients (C++)	• AWS SDK for C++ (Amazon Bedrock Runtime) • Google Cloud Vertex AI C++ Client (AI Platform) • NVIDIA Triton Inference Server C++ client	Higher‑level clients with built‑in auth, retries, pagination/streaming (often gRPC) and telemetry hooks; larger dependency trees. Note: OpenAI does not ship an official C++ SDK; use REST or community libs. ([sdk.amazonaws.com][3], [Google Cloud][4], [NVIDIA Docs][5], [OpenAI Platform][6], [GitHub][7])
Embedded / on-device	• llama.cpp (C/C++ API; optional REST server) • TensorRT‑LLM C++ runtime (NVIDIA GPUs) • ONNX Runtime C++ API • whisper.cpp (local ASR) • GGUF/GGML executors	Run models locally (no network). Great for edge/HPC; requires model conversion/quantization and careful memory/VRAM budgeting. Footprint varies from tiny CPU‑only builds to GPU‑accelerated runtimes. ([GitHub][8], [NVIDIA Docs][9], [ONNX Runtime][10])
Framework-style orchestration libs	• llama.cpp server + C++/HTTP clients • instinct.cpp (agent/RAG in C++) • LLMKit++ (LangChain‑inspired, Ollama‑backed)	Opinionated chains/agents, tools, and memory abstractions. Smaller/younger ecosystems in C++ compared to Python; useful when you want built‑ins beyond raw API calls. ([GitHub][8])

Notes & sources (updated Aug 2025):

libcpr releases and repo; cpp-httplib releases; nlohmann/json; Boost.Beast docs. ([GitHub][11], [Boost][2])
Community OpenAI C++ libraries: liboai, openai-cpp. OpenAI’s official SDKs focus on Python/JS/.NET, not C++. ([GitHub][12], [OpenAI Platform][6])
Full SDKs / clients: AWS Bedrock Runtime (C++), Google Cloud Vertex AI C++ client, NVIDIA Triton C++ client libs. ([sdk.amazonaws.com][3], [Google Cloud][4], [NVIDIA Docs][5])
Embedded / on‑device: llama.cpp, GGUF format, TensorRT‑LLM C++ runtime, ONNX Runtime C++ API, whisper.cpp. ([GitHub][8], [NVIDIA Docs][9], [ONNX Runtime][10])