llama cpp reranker - mostlygeek/llama-swap GitHub Wiki

Configuration for supporting the v1/rerank endpoint with llama-server and BGE reranker V2

Download model at gpustack/bge-reranker-v2-m3-GGUF

Config

models:
  "reranker":
    env:
      - "CUDA_VISIBLE_DEVICES=GPU-eb1"
    cmd: |
      /path/to/llama-server/llama-server-latest
      --port ${PORT}
      -ngl 99
      -m /path/to/models/bge-reranker-v2-m3-Q4_K_M.gguf
      --ctx-size 8192
      --reranking
      --no-mmap

[!TIP] path.to.sh used for /path/to/models/... paths in example.

Testing

$ curl -s http://10.0.1.50:8080/v1/rerank \
  -H 'Content-Type: application/json' \
  -d '{
"model": "reranker",
"query": "What is the best way to learn Python?",
"documents": [
    "Python is a popular programming language used for web development and data analysis.",
    "The best way to learn Python is through online courses and practice.",
    "Python is also used for artificial intelligence and machine learning applications.",
    "To learn Python, start with the basics and build small projects to gain experience."
], "max_reranked": 2}' | jq .

Output


{
  "model": "reranker",
  "object": "list",
  "usage": {
    "prompt_tokens": 110,
    "total_tokens": 110
  },
  "results": [
    {
      "index": 0,
      "relevance_score": -2.9403347969055176
    },
    {
      "index": 1,
      "relevance_score": 7.181779861450195
    },
    {
      "index": 2,
      "relevance_score": -4.595512866973877
    },
    {
      "index": 3,
      "relevance_score": 3.0560922622680664
    }
  ]
}