Surveys
- "Retrieval-Augmented Generation for Large Language Models: A Survey"
- "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks"
Tools/Frameworks
Comparisons
Other Resources
- OpenAI Evals
- LLMbench

Surveys

"Retrieval-Augmented Generation for Large Language Models: A Survey"

@misc{gao2024retrievalaugmented, title={Retrieval-Augmented Generation for Large Language Models: A Survey}, author={Yunfan Gao and Yun Xiong and Xinyu Gao and Kangxiang Jia and Jinliu Pan and Yuxi Bi and Yi Dai and Jiawei Sun and Qianyu Guo and Meng Wang and Haofen Wang}, year={2024}, eprint={2312.10997}, archivePrefix={arXiv}, primaryClass={cs.CL} }

https://arxiv.org/abs/2312.10997

The survey considers the following evaluation aspects and metrics applicable for them:

gao2024retrievalaugmented

It lists 3 tools:

gao2024retrievalaugmentedTools

"A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks"

"A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al., arXiv: https://arxiv.org/abs/2406.08216

The survey takes a very pragmatic approach and looks for tools available on GitHub.

It's a brief paper and not very deep, but introduces a number of perspectives/aspects for potential evaluation ("Correctness" is just one, although the most often implemented)

hudsonSoftwareEngineeringPerspective2024

Tools/Frameworks

DeepEval

https://github.com/confident-ai/deepeval: 7k stars, 631 forks, 154 contributors

Confident AI Home Page
Documentation
README: Metrics and Features, Getting Started, Integrations
Blog
Colab notebook
Discord
Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.

The LLM Evaluation Platform

An all-in-one LLM evaluation platform to easily evaluate, compare, and share test results to identify LLM regressions. Automatically Catch Regressions in LLM Systems.

Similar to Pytest but specialized for unit testing of LLM outputs.
Metrics (incorporates the latest research):
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS (average of 4 of the above metrics)
- Hallucination
- Toxicity
- Bias
Implemented benchmarks:
- MMLU
- HellaSwag
- DROP
- BIG-Bench Hard
- TruthfulQA
- HumanEval
- GSM8K
Covers RAG, fine-tuning, LangChain, LlamaIndex
Determine the optimal hyperparameters to improve your RAG pipeline
Transition from OpenAI to hosting your own Llama with confidence
Supply ground truths as benchmarks to evaluate your LLM outputs. Evaluate performance against expected outputs to pinpoint areas for iterations.
Dataset generation: Automatically generate expected queries and responses for evaluation.
Advanced diff tracking to iterate towards the optimal LLM stack
LLM observability and monitoring to identify areas of focus
Output classification: Discover recurring queries and responses to optimize for specific use cases.
Run evaluations on the cloud through simple APIs. Judge your LLM application on one, centralized platform.
Or evaluate LLMs that run locally on your machine

Also:

Modular components that are extremely simple to plug and use. You can easily mix and match different metrics, or even use DeepEval to build your own evaluation pipeline if needed.
Treats evaluations as unit tests. With an integration for Pytest, DeepEval is a complete testing suite most developers are familiar with.
Allows you to generate synthetic datasets using your knowledge base as context, or load datasets from CSVs, JSONs, or Hugging face.
Offers a hosted platform with a generous free tier to run real-time evaluations in production.

Evaluation Details

Tutorial

Effectively writing python unit tests
- Custom metrics available on answers
Fully integrated with Confident AI
Metrics
- Answer, conversation and multimodal metrics
- Answer metrics overall
  - Accept input (question), actual_answer (target) and LLM model
  - Also have various other custom parameters that should be used
  - G-Eval, Tool Correctness
- G-Eval is CoT-based detailed evaluation of the result, effectively extra chatbot-evaluating chatbot
- Tool Correctness is definitely a support metric for our agents/assistants
Major issue: aims at working with GPT-4o, supports calling custom models but support for assistant/agents is clunkier. Still something is offered and probably more coming in the future.
Moderate issue: it seems everything is sent/calculated in the cloud on their end. Weird since you can use local models and they do offer support for sensitive data but it is definitely not just a local library.
Concern: All metrics are based on asking a model whether a model’s response is a thing e.g. biased, toxic, factual, accurate, etc.
Upside: Confirms what metrics and approaches we want to aim for and how to implement them.

DeepEval Full Notes

Open-source evaluation framework for LLMS
1. Easily "unit test" LLM outputs in a similar way to Pytest.
2. Plug-and-use 14+ LLM-evaluated metrics, most with research backing.
3. Synthetic dataset generation with state-of-the-art evolution techniques.
4. Metrics are simple to customize and cover all use cases.
5. Real-time evaluations in production.
Using local models

   deepeval set-local-model \--model-name=\<model\_name\> \\  
   \--base-url="http://localhost:11434/v1/" \\  
   \--api-key="ollama"

Metrics available
1. Answer metrics
  1. Generally each has input, expected answer and evaluating model
  2. G-Eval
    1. Most versatile metric
    2. Steps:
      1. generate a series of evaluation_steps using a chain of thought (CoTs) based on the given criteria;
      2. use the generated steps to determine the final score using the parameters presented in an LLMTestCase.
  3. Summarization
    1. Use LLM to evaluate coverage and accuracy of summary
    2. Summarization = min(Alignment Score, Coverage Score)
    3. GPT-4o by default but can use custom models
    4. Details on building the metric: A Step-By-Step Guide to Evaluating an LLM Text Summarization Task - Confident AI
  4. Faithfulness
    1. Requires retrieval_context in addition to other parameters
    2. Faithfulness = Claims / Truthful Claims
  5. Answer Relevancy
    1. Relevancy = Relevant Statements / Total Statements
  6. Contextual Relevancy
    1. Might be very relevant to us as an approach when retrieving results so long as we can get accurate target sets
    2. Based on retrieval context
  7. Contextual Precision
    1. Might be very relevant to us as an approach when retrieving results so long as we can get accurate target sets
    2. Based on retrieval context
    3. Strong emphasis on the top result
    4. Rewards relevant ordering in result selection
  8. Contextual Recall
    1. Might be very relevant to us as an approach when retrieving results if we can get accurate target sets
    2. Based on retrieval context
  9. Tool correctness
    1. Define input, actual_input, tools_called and expected_tools
    2. We definitely want this metric but making use of it requires us to do all the heavy lifting essentially
    3. Tool Correctness = Correctly Used Tools / Total Tools Called
  10. Ragas
    1. Generalized RAG pipeline generator and retriever combining 4 metrics
  11. Hallucination
    1. Looks for contradictions between produced and expected result
  12. Toxicity
  13. Bias
    1. Is your model’s response biased? This model will tell you!
  14. Custom metrics
    1. Easy to define, run on their end
    2. The only way to use a not-LLM metric
2. Conversation metrics
  1. Expected data always Turns + [ input + actual_output] per turn
  2. Role adherence
    1. Extra chatbot_role needs to be specified
    2. Adherence = Adhered Turns / Total Turns
  3. Knowledge Retention
    1. Retention = No Knowledge Attrition Turns / Total Turns
  4. Conversation Completeness
    1. Completeness = Satisfied User Intentions / User Intentions
  5. Conversation Relevancy
    1. Relevancy = Turns with Relevant Output / Total Turns
3. Multimodal metrics
  1. Text to Image
  2. Image Editing
Using custom LLMs

MLflow

https://github.com/mlflow/mlflow 20.7k stars, 856 contributors, 4.6k forks, 62.1k declared users

MLflow is a platform for the machine learning lifecycle (traditional ML and LLM).

Characteristics:

Open Source: Integrate with any ML library and platform
Comprehensive: Manage end-to-end ML and GenAI workflows, from development to production
Unified: Unified platform for both traditional ML and GenAI applications

Features:

Improve generative AI quality
Enhance LLM observability with tracing
Build applications with prompt engineering
Track progress during fine-tuning
Package and deploy models
Securely host LLMs at scale with MLflow Deployments

LLM Evaluate is a modular and simplistic package that allows you to run evaluations in your own evaluation pipelines. It offers RAG evaluation and QA evaluation. Intuitive developer experience.

Docs
Model Evaluation
MLflow LLM Evaluation
MLflow Tracing
LLM Tracking
OpenAI autologging
LangChain trace logging
Use MLflow Evaluate with the Prompt Engineering UI
Blog
LLM as judge

It is available as managed service in Databricks.

Evaluation Details

MLflow is tightly integrated in Databricks.

A tracking system that lets the user track model parameters and model metrics
Streamlines the model development and deployment processes
Ensures reproducibility, scalability, and traceability
Smooth quick start
The tracking server can be deployed locally
Nice interface that allows for easy comparison of results and parameters

MLflowInterface

Models can be loaded using the mlflow.pyfunc module
Autologging is available
Supported libraries

MLflowLibs

MLflow Tracking

Tracking is central to the MLflow ecosystem, facilitating the systematic organization of experiments and runs:

Experiments and Runs: Each experiment encapsulates a specific aspect of your research, and each experiment can house multiple runs. Runs document critical data like metrics, parameters, and the code state.
Artifacts: Store crucial output from runs, be it models, visualizations, datasets, or other metadata. This repository of artifacts ensures traceability and easy access.
Metrics and Parameters: By allowing users to log parameters and metrics, MLflow makes it straightforward to compare different runs, facilitating model optimization.
Dependencies and Environment: The platform automatically captures the computational environment, ensuring that experiments are reproducible across different setups.
Input Examples and Model Signatures: These features allow developers to define the expected format of the model’s inputs, making validation and debugging more straightforward.
UI Integration: The integrated UI provides a visual overview of all runs, enabling easy comparison and deeper insights.
Search Functionality: Efficiently sift through your experiments using MLflow’s robust search functionality.
APIs: Comprehensive APIs are available, allowing users to interact with the tracking system programmatically, integrating it into existing workflows.

MLflow Evaluate

Ensuring model quality is paramount:

Auto-generated Metrics: MLflow automatically evaluates models, providing key metrics for regression (like RMSE, MAE) and classification (such as F1-score, AUC-ROC).
Visualization: Understand your model better with automatically generated plots. For instance, MLflow can produce confusion matrices, precision-recall curves, and more for classification tasks.
Extensibility: While MLflow provides a rich set of evaluation tools out of the box, it’s also designed to accommodate custom metrics and visualizations.

MLflow Model Registry

This feature acts as a catalog for models:

Versioning: As models evolve, keeping track of versions becomes crucial. The Model Registry handles versioning, ensuring that users can revert to older versions or compare different iterations.
Annotations: Models in the registry can be annotated with descriptions, use-cases, or other relevant metadata.
Lifecycle Stages: Track the stage of each model version, be it ‘staging’, ‘production’, or ‘archived’. This ensures clarity in deployment and maintenance processes.

MLflow Deployment

MLflow simplifies the transition from development to production:

Consistency: By meticulously recording dependencies and the computational environment, MLflow ensures that models behave consistently across different deployment setups.
Docker Support: Facilitate deployment in containerized environments using Docker, encapsulating all dependencies and ensuring a uniform runtime environment.
Scalability: MLflow is designed to accommodate both small-scale deployments and large, distributed setups, ensuring that it scales with your needs.

MLflow Notebook

Notebook on Databricks
https://docs.databricks.com/en/mlflow/quick-start-python.html

MLflowNotebook

OpenCompass

https://github.com/open-compass/opencompass 5.4k stars, 585 forks, 151 contributors

Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.

A Universal Evaluation Platform for Foundation Models.

An LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2, GPT-4, LLaMa2, Qwen, GLM, Claude, etc.) over 100+ datasets.

Home: https://opencompass.org.cn/

TruLens

https://github.com/truera/trulens/ 2.5k stars, 218 forks, 62 contributors

Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.

This is a general LLM evaluation tool that focuses on answer quality, and also some other aspects such as safety. It's been called "a framework for explaining deep network behavior".

It supports (instruments and evaluates) LangChain, LlamaIndex and NVIDIA Nemo Guardrails.

Links:

Home: https://www.trulens.org/
Documentation: getting started, detailed (tracking)
Colab Notebook
Slack

"RAG Triad":

https://www.trulens.org/trulens/getting_started/core_concepts/rag_triad/

Context Relevance: is the retrieved context relevant to the question?
Groundedness: is the answer based on the provided context?
Answer Relevance: is the answer relevant to the question?

answerRelevance

Mentions of TruLens in Google Scholar

Exploring Conceptual Soundness with TruLens (NeurIPS 2021):
interactive application built on TruLens that we use to explore the conceptual soundness of various pre-trained models
MedInsight: A Multi-Source Context Augmentation Framework for Generating Patient-Centric Medical Responses using Large Language Models, arXiv 2403:
Quantitative evaluation using the Ragas metric and TruLens for answer similarity and answer correctness.
Safeguarding Large Language Models: A Survey
Focuses on safety
Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems
we use the state-of-the-art automated RAG evaluation tool called the ‘RAG Triad’ provided by TruLens. This approach emphasizes three quality scores of the RAG system
Retrieval augmented generation for large language models: A survey:
Introduces the "RAG triad" concept
Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models, Ital-IA 2024:
Estimates Answer Relevance with TruLens and RAGAS, and Answer Correctness with RAGAS. Explains the evaluation approach (Spearman correlations etc.)
Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost arXiv 2406
Towards increased truthfulness in LLM applications Application-oriented methods from current research: TowardsDataScience blog
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems arXiv 2407
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation arXiv 2408
Exploring Fact Memorization and Style Imitation in LLMs Using QLoRA: An Experimental Study and Quality Assessment Methods arXiv 2406
Enterprise LLMOps: Advancing Large Language Models Operations Practice: 2024 IEEE Cloud Summit
Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models: arXiv 2401
LLMs in Production: book chapter in Large Language Models: A Deep Dive.
Causal Reasoning in Large Language Models using Causal Graph Retrieval Augmented Generation

RAGAS

https://github.com/explodinggradients/ragas 9.4k stars, 925 forks, 201 contributors

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

Reviewed in "Retrieval-Augmented Generation for Large Language Models: A Survey", 2024, Gao et al.

Ragas: Automated evaluation of retrieval augmented generation, arXiv:2309.15217

Works without having to rely on ground truth human annotations
A reference-free (not tied to having ground truth available) evaluation framework for retrieval augmented generation

Resources:

RAGAS best covers the metrics mentioned in the Survey, and actually has more metrics:

ragasScore

Faithfulness
Answer relevancy
Context recall
Context precision
Context utilization
Context entity recall
Noise Sensitivity
Summarization Score

Extra source repos:

https://github.com/alextakele/python-random-quote (given in PWC)

ARES

https://github.com/stanford-futuredata/ares 0.599k stars, 59 forks, 8 contributors

Home: https://ares-ai.vercel.app/

Reviewed in "Retrieval-Augmented Generation for Large Language Models: A Survey", 2024, Gao et al.

Ares: An automated evaluation framework for retrieval-augmented generation systems, arXiv:2311.09476

By creating its own synthetic training data, ARES finetunes lightweight LMM judges to assess the quality of individual RAG components.
To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI).
Tried on eight different knowledge-intensive tasks in KILT, SuperGLUE, and AIS

LangChain Evals

Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.

https://github.com/langchain-ai/langchain, 109k stars, 17.7k forks, 3632 contributors

There are some matches related to evaluation in the LangChain documentation: https://python.langchain.com/. They mostly point to:

LangSmith

Allows you to closely trace, monitor and evaluate your LLM application.
Seamlessly integrates with LangChain and LangGraph, and you can use it to inspect and debug individual steps of your chains and agents as you build.
LangSmith helps with every step of evaluation from creating a dataset to defining metrics to running evaluators.
Provides an evaluation framework that helps you define metrics and run your app against your dataset
Allows you to track results over time and automatically run your evaluators on a schedule or as part of CI/Code

Concepts:

Evaluation is the process of assessing the performance and effectiveness of your LLM-powered applications. It involves testing the model's responses against a set of predefined criteria or benchmarks to ensure it meets the desired quality standards and fulfills the intended purpose. This process is vital for building reliable applications.

evaluator

Tracing gives you observability inside your chains and agents. It is the series of steps that your application takes to go from input to output. Traces contain individual steps called runs. These can be individual calls from a model, retriever, tool, or sub-chains. Tracing gives you observability inside your chains and agents, and is vital in diagnosing issues.

tracing

LangSmith is not open source: https://docs.smith.langchain.com/pricing
Provided as SaaS, but can also be self-hosted. Works on Docker, Kubernetes. Works on all major cloud platforms.

Resources:

LangChain AutoEvaluator

https://github.com/langchain-ai/auto-evaluator, 0.765k stars, 104 forks, 5 contributors

Home: autoevaluator.langchain.com

Description:

Challenge: The quality of QA systems can vary considerably; for example, we have seen cases of hallucination and poor answer quality due specific parameter settings. But, it is not always obvious to (1) evaluate the answer quality in a systematic way and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size) or components (e.g., model or retriever choice).
App overview: This app aims to address the above limitations. Recent work from Anthropic has used model-written evaluation sets. OpenAI and others have shown that model-graded evaluation is an effective way to evaluate models. This app combines both of these ideas into a single workspace, auto-generating a QA test set and auto-grading the result of the specified QA chain.

Giskard

https://github.com/Giskard-AI/giskard, 4.6k stars, 324 forks, 51 contributors

Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.

The Evaluation & Testing framework for LLMs & ML models

Control risks of performance, bias and security issues in AI models.

RAG Evaluation Toolkit (RAGET):

Automatically generate evaluation datasets & evaluate RAG application answers
If you're testing a RAG application, you can get an even more in-depth assessment using RAGET, Giskard's RAG Evaluation Toolkit.
RAGET can generate automatically a list of question, reference_answer and reference_context from the knowledge base of the RAG. You can then use this generated test set to evaluate your RAG agent.
RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness of the agent’s answers on different question types. Components evaluated with RAGET:
- Generator: the LLM used inside the RAG to generate the answers
- Retriever: fetch relevant documents from the knowledge base according to a user query
- Rewriter: rewrite the user query to make it more relevant to the knowledge base or to account for chat history
- Router: filter the query of the user based on his intentions
- Knowledge Base: the set of documents given to the RAG to generate the answers
See raget_demo Demonstration notebook

Uses a freemium model: https://www.giskard.ai/pricing. Doesn't have a free evaluation, but you can deploy it as Python.

promptfoo

https://github.com/promptfoo/promptfoo, 7k stars, 554 forks, 178 contributors

Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.

Discord

Features

Test your prompts, agents, and RAGs.
Red teaming, pentesting, and vulnerability scanning for LLMs.
Compare performance of OpenAI GPT, Azure, Anthropic Claude, Google Gemini, Llama, HuggingFace models, or integrate custom API providers for any LLM API
Simple declarative configs with command line and CI/CD integration
Build reliable prompts, models, and RAGs with benchmarks specific to your use-case
Secure your apps with automated red teaming and pentesting
Speed up evaluations with caching, concurrency, and live reloading
Score outputs automatically by defining metrics
Use as a CLI, library, or in CI/CD
Developer friendly: fast, with quality-of-life features like live reloads and caching.
Battle-tested: Originally built for LLM apps serving over 10 million users in production. Our tooling is flexible and can be adapted to many setups.
Simple, declarative test cases: Define evals without writing code or working with heavy notebooks.
Language agnostic: Use Python, Javascript, or any other language.
Share & collaborate: Built-in share functionality & web viewer for working with teammates.
Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
Private: This software runs completely locally. The evals run on your machine and talk directly with the LLM.

Uptrain

https://github.com/uptrain-ai/uptrain, 2.3k stars, 198 forks, 40 contributors

https://uptrain.ai/ is an open-source unified platform to evaluate and improve Generative AI applications.

Provides grades for 20+ preconfigured evaluations (covering language, code, embedding use cases),
Performs root cause analysis on failure cases and gives insights on how to resolve them.
Covers all your LLMOps needs: Enterprise grade tooling to help you iterate faster and stay ahead of competitors
Faster and Systematic Experimentation: Get quantitative scores and make the right decisions. Eliminate guesswork, subjectivity and hours of manual review.
Automated Regression Testing: Automated testing for each prompt-change/config-change/code-change across a diverse test set. Prompt versioning allows you to roll back changes hassle-free.
Know Where Things Are Going Wrong: Not just monitoring, UpTrain isolates error cases and finds common patterns among them. UpTrain provides root cause analysis and helps make improvements faster.
Enriched Datasets for your testing needs: UpTrain helps create diverse test sets for different use cases. You can also enrich your existing datasets by capturing different edge cases encountered in production.

Notes:

Mentioned as a "provider" in langchain documentation: https://python.langchain.com/docs/integrations/providers/uptrain/
https://demo.uptrain.ai/evals_demo/ - this site can’t provide a secure connection. ERR_SSL_PROTOCOL_ERROR

Lunary

https://github.com/lunary-ai/lunary, 1.3k stars, 155 forks, 9 contributors

https://lunary.ai/

The production toolkit for LLMs. Observability, prompt management and evaluations.

Cost and usage analytics, user tracking, tracing, monitoring, evaluation tools.
Formerly "llmonitor"
Mentioned in LangChain documentation: integrations/providers/llmonitor

Arize Phoenix

https://phoenix.arize.com/

https://docs.arize.com/phoenix

Not open source ?

Open-Source Tracing and Evaluation: Trace, evaluate, and iterate on generative AI applications
AI observability and LLM evaluation platform
Support for LangChain applications
Detailed traces of input, embeddings, retrieval, functions, and output messages.

Docs:

LLM Evaluation. In-depth topics:
- LLM As a Judge
- LLM Model Evals vs. LLM System Evals
- LLM Model Evals
- LLM System Evals
- When To Use Each
- LLM System Evaluation Metrics
- Top LLM System Evaluation Metrics
- LLM RAG Retrieval Metrics
- Exercise: Evaluating Context Relevance
- How To Build An LLM Eval
- LLM Benchmarks: Why Precision, Recall
- How To Run LLM Evals
- Questions To Consider
- Needle In a Haystack Tests
LLM Tracing
RAG Evaluation
- Troubleshooting Retrieval and Responses
- Response Evaluation Metrics
- Retrieval Evaluation Metrics
- Troubleshooting RAG Workflows
- Scenario 1: Good Response, Good Retrieval
- Scenario 2: Bad Response, Bad Retrieval
- Scenario 3: Bad Response, Mixed Retrieval

LabelStudio

https://github.com/HumanSignal/label-studio, 22.3k stars, 2.8k forks, 167 contributors, 844 declared users

Open Source Data Labeling Platform

The most flexible data labeling platform to fine-tune LLMs, prepare training data or validate AI models. Multi-type data labeling and annotation tool with standardized output format.

This is more for creating labeled datasets, not so much for evaluation. Features:

Flexible and configurable: Configurable layouts and templates adapt to your dataset and workflow.
Integrate with your ML/AI pipeline: Webhooks, Python SDK and API allow you to authenticate, create projects, import tasks, manage model predictions, and more.
ML-assisted labeling: Save time by using predictions to assist your labeling process with ML backend integration.
Connect your cloud storage: Connect to cloud object storage and label data there directly with S3 and GCP.
Explore & understand your data: Prepare and manage your dataset in our Data Manager using advanced filters.
Multiple projects and users: Support multiple projects, use cases and data types in one platform.

Freemium: comparison between open source and enterprise version

DeepChecks

https://github.com/deepchecks/deepchecks, 3.8k stars, 269 forks, 52 contributors

Tests for Continuous Validation of ML Models & Data. Holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and models from research to production.

Home: https://www.deepchecks.com/llm-evaluation/
Docs1
Docs2
Blog
Demo
Slack

Notes:

Geared more towards evaluating the LLM itself, rather than LLM systems/applications.
Complicated developer experience
Open-source offering is unique as it focuses heavily on the dashboards and the visualization UI, which makes it easy for users to visualize evaluation results.

ChainForge

Guardrails AI

OpenPipe

Prompt Flow

https://github.com/microsoft/promptflow, 10.4k stars, 982 forks, 107 contributors, 1.9k declared users

By Microsoft.

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

https://microsoft.github.io/promptflow/

Suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, evaluation to production deployment and monitoring. It makes prompt engineering much easier and enables you to build LLM apps with production quality.

With prompt flow, you will be able to:

Create flows that link LLMs, prompts, Python code and other tools together in an executable workflow.
Debug and iterate your flows, especially tracing interaction with LLMs with ease.
Evaluate your flows, calculate quality and performance metrics with larger datasets.
Integrate the testing and evaluation into your CI/CD system to ensure quality of your flow.
Deploy your flows to the serving platform you choose or integrate into your app’s code base easily.
Collaborate with your team by leveraging the cloud version of Prompt flow in Azure AI.

Visualization of a DAG flow in Promptflow using visual studio code

dagFlow

Bee

https://github.com/i-am-bee/bee-agent-framework: 2.5k stars, 293 forks, 42 contributors

By IBM.

This is an open-source platform to discover, run, and compose AI agents from any framework.

SuperAnnotate

https://www.superannotate.com

LLMs & GenAI Playground
This doesn't seem to be open source, which is a deal-breaker

Helicone

https://github.com/Helicone/helicone : 3.9k stars, 381 forks, 83 contributors

Helicone / LLM-Observability for Developers
The open-source LLM-Observability Platform for Developers (logging, monitoring, and debugging).
One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc.
Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain…

helicone

Free tier: monthly 100k requests

Traceloop OpenLLMetry

Neptune AI

Comparisons

By GuyBuildingAI

https://dev.to/guybuildingai/-top-5-open-source-llm-evaluation-frameworks-in-2024-98m

It seems like every week there is a new open-source repo trying to do the same thing as the other 30+ frameworks that already exist.

DeepEval
MLflow
RAGAS
DeepChecks
Arize Phoenix

By DeepChecks

DeepChecks Blog: Best 10 LLM Evaluation Tools in 2024.

DeepChecks
LLMbench
MLflow
Arize Phoenix
DeepEval
RAGAS
ChainForge
Guardrails AI
OpenPipe
Prompt Flow

By SuperAnnotate

https://www.superannotate.com/blog/llm-evaluation-guide#top-10-llm-evaluation-frameworks-and-tools

superannotate

This lists mostly commercial platforms:

Amazon Bedrock
NVidia Nemo
Azure AI Studio
Google Vertex AI Studio
LangSmith

And some open source tools that were already listed:

TruLens
DeepEval
Prompt Flow

By LakeFS

LLM Observability Tools: 2024 Comparison

LLM observability refers to gaining total visibility into all layers of an LLM-based software system, including the application, prompt, and answer.

Each response must be reviewed for cleanliness and relevance. To meet your monitoring objectives, you must set up recording of your LLM prompts and replies, followed by contextual analysis.

Lunary
LangSmith
Portkey
Maybe that's not an eval tool: "a proxy that lets you keep a prompt library and supply variables in the template to access your LLM. The tool maintains all of your integration’s fundamental parameters, including temperature. It provides tools for caching responses, creating load balancing between models, and configuring fallbacks."
Helicone
TruLens
Arize Phoenix
Traceloop OpenLLMetry
Datadog
Maybe that's not an eval tool: "an infrastructure and application monitoring software that has expanded its integrations into the world of LLMs and associated tools. It provides out-of-the-box dashboards for LLM observability. You can enable OpenAI usage tracing"

Other Resources

These are not evaluation tools, but other related resources.

OpenAI Evals

https://github.com/openai/evals, 16.3k stars, 2.7k forks, 459 contributors

A framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

This was the first collection of very diverse LLM evaluation cases. At the beginning, OpenAI asked people to contribute to these evals in order to gain access to GPT+. I guess the depth/quality of evaluations differs by branch, but it's worth exploring the registry.

modelgraded is a set of evals that are graded (evaluated) by LLM
modelgraded/closedqa is the same for closed-answer (factual) Q&A

LLMbench

https://github.com/THUDM/AgentBench, 2.6k stars, 183 forks, 15 contributors

LLMbench consists of 3 benchmarks where 25 LLMs are evaluated:

Agent: LLM capabilities
Safety
Alignment

Here we describe only the first one.

Home: https://llmbench.ai/agent

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to operate as autonomous agents in various scenarios:

Operating System (OS)
Database (DB)
Knowledge Graph (KG)
Digital Card Game (DCG)
Lateral Thinking Puzzles (LTP)
House-Holding (HH) (ALFWorld), e.g. following commands in the kitchen
Web Shopping (WS) (WebShop)
Web Browsing (WB) (Mind2Web)

challenges

It shows significant gap between leading commercial LLM and open source LLMs:

LLM Gap

Eg in the KG environment, GPT4 has score 58 and Llama2 has score 8.

LLM and KGQA Evaluation Tools - statnett/Talk2PowerSystem GitHub Wiki

Surveys

"Retrieval-Augmented Generation for Large Language Models: A Survey"

"A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks"

Tools/Frameworks

DeepEval

Evaluation Details

DeepEval Full Notes

MLflow

Evaluation Details

MLflow Tracking

MLflow Evaluate

MLflow Model Registry

MLflow Deployment

MLflow Notebook

OpenCompass

TruLens

RAGAS

ARES

LangChain Evals

LangChain AutoEvaluator

Giskard

promptfoo

Uptrain

Lunary

Arize Phoenix

LabelStudio

DeepChecks

ChainForge

Guardrails AI

OpenPipe

Prompt Flow

Bee

SuperAnnotate

Helicone

Traceloop OpenLLMetry

Neptune AI

Comparisons

By GuyBuildingAI

By DeepChecks

By SuperAnnotate

By LakeFS

Other Resources

OpenAI Evals

LLMbench

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️