LLM and KGQA Evaluation Tools - statnett/Talk2PowerSystem GitHub Wiki
- Surveys
- Tools/Frameworks
- Comparisons
- Other Resources
@misc{gao2024retrievalaugmented, title={Retrieval-Augmented Generation for Large Language Models: A Survey}, author={Yunfan Gao and Yun Xiong and Xinyu Gao and Kangxiang Jia and Jinliu Pan and Yuxi Bi and Yi Dai and Jiawei Sun and Qianyu Guo and Meng Wang and Haofen Wang}, year={2024}, eprint={2312.10997}, archivePrefix={arXiv}, primaryClass={cs.CL} }
https://arxiv.org/abs/2312.10997
The survey considers the following evaluation aspects and metrics applicable for them:
It lists 3 tools:
"A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks"
"A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al., arXiv: https://arxiv.org/abs/2406.08216
The survey takes a very pragmatic approach and looks for tools available on GitHub.
It's a brief paper and not very deep, but introduces a number of perspectives/aspects for potential evaluation ("Correctness" is just one, although the most often implemented)
https://github.com/confident-ai/deepeval: 7k stars, 631 forks, 154 contributors
- Confident AI Home Page
- Documentation
- README: Metrics and Features, Getting Started, Integrations
- Blog
- Colab notebook
- Discord
- Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.
The LLM Evaluation Platform
An all-in-one LLM evaluation platform to easily evaluate, compare, and share test results to identify LLM regressions. Automatically Catch Regressions in LLM Systems.
-
Similar to Pytest but specialized for unit testing of LLM outputs.
-
Metrics (incorporates the latest research):
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS (average of 4 of the above metrics)
- Hallucination
- Toxicity
- Bias
-
Implemented benchmarks:
- MMLU
- HellaSwag
- DROP
- BIG-Bench Hard
- TruthfulQA
- HumanEval
- GSM8K
-
Covers RAG, fine-tuning, LangChain, LlamaIndex
-
Determine the optimal hyperparameters to improve your RAG pipeline
-
Transition from OpenAI to hosting your own Llama with confidence
-
Supply ground truths as benchmarks to evaluate your LLM outputs. Evaluate performance against expected outputs to pinpoint areas for iterations.
-
Dataset generation: Automatically generate expected queries and responses for evaluation.
-
Advanced diff tracking to iterate towards the optimal LLM stack
-
LLM observability and monitoring to identify areas of focus
-
Output classification: Discover recurring queries and responses to optimize for specific use cases.
-
Run evaluations on the cloud through simple APIs. Judge your LLM application on one, centralized platform.
-
Or evaluate LLMs that run locally on your machine
Also:
-
Modular components that are extremely simple to plug and use. You can easily mix and match different metrics, or even use DeepEval to build your own evaluation pipeline if needed.
-
Treats evaluations as unit tests. With an integration for Pytest, DeepEval is a complete testing suite most developers are familiar with.
-
Allows you to generate synthetic datasets using your knowledge base as context, or load datasets from CSVs, JSONs, or Hugging face.
-
Offers a hosted platform with a generous free tier to run real-time evaluations in production.
- Effectively writing python unit tests
- Custom metrics available on answers
- Fully integrated with Confident AI
- Metrics
- Answer, conversation and multimodal metrics
- Answer metrics overall
- Accept input (question), actual_answer (target) and LLM model
- Also have various other custom parameters that should be used
- G-Eval, Tool Correctness
- G-Eval is CoT-based detailed evaluation of the result, effectively extra chatbot-evaluating chatbot
- Tool Correctness is definitely a support metric for our agents/assistants
- Major issue: aims at working with GPT-4o, supports calling custom models but support for assistant/agents is clunkier. Still something is offered and probably more coming in the future.
- Moderate issue: it seems everything is sent/calculated in the cloud on their end. Weird since you can use local models and they do offer support for sensitive data but it is definitely not just a local library.
- Concern: All metrics are based on asking a model whether a model’s response is a thing e.g. biased, toxic, factual, accurate, etc.
- Upside: Confirms what metrics and approaches we want to aim for and how to implement them.
- Open-source evaluation framework for LLMS
- Easily "unit test" LLM outputs in a similar way to Pytest.
- Plug-and-use 14+ LLM-evaluated metrics, most with research backing.
- Synthetic dataset generation with state-of-the-art evolution techniques.
- Metrics are simple to customize and cover all use cases.
- Real-time evaluations in production.
- Using local models
deepeval set-local-model \--model-name=\<model\_name\> \\
\--base-url="http://localhost:11434/v1/" \\
\--api-key="ollama"
- Metrics available
- Answer metrics
- Generally each has input, expected answer and evaluating model
- G-Eval
- Most versatile metric
- Steps:
- generate a series of evaluation_steps using a chain of thought (CoTs) based on the given criteria;
- use the generated steps to determine the final score using the parameters presented in an LLMTestCase.
- Summarization
- Use LLM to evaluate coverage and accuracy of summary
- Summarization = min(Alignment Score, Coverage Score)
- GPT-4o by default but can use custom models
- Details on building the metric: A Step-By-Step Guide to Evaluating an LLM Text Summarization Task - Confident AI
- Faithfulness
- Requires retrieval_context in addition to other parameters
- Faithfulness = Claims / Truthful Claims
- Answer Relevancy
- Relevancy = Relevant Statements / Total Statements
- Contextual Relevancy
- Might be very relevant to us as an approach when retrieving results so long as we can get accurate target sets
- Based on retrieval context
- Contextual Precision
- Might be very relevant to us as an approach when retrieving results so long as we can get accurate target sets
- Based on retrieval context
- Strong emphasis on the top result
- Rewards relevant ordering in result selection
- Contextual Recall
- Might be very relevant to us as an approach when retrieving results if we can get accurate target sets
- Based on retrieval context
- Tool correctness
- Define input, actual_input, tools_called and expected_tools
- We definitely want this metric but making use of it requires us to do all the heavy lifting essentially
- Tool Correctness = Correctly Used Tools / Total Tools Called
- Ragas
- Generalized RAG pipeline generator and retriever combining 4 metrics
- Hallucination
- Looks for contradictions between produced and expected result
- Toxicity
- Bias
- Is your model’s response biased? This model will tell you!
- Custom metrics
- Easy to define, run on their end
- The only way to use a not-LLM metric
- Conversation metrics
- Expected data always Turns + [ input + actual_output] per turn
- Role adherence
- Extra chatbot_role needs to be specified
- Adherence = Adhered Turns / Total Turns
- Knowledge Retention
- Retention = No Knowledge Attrition Turns / Total Turns
- Conversation Completeness
- Completeness = Satisfied User Intentions / User Intentions
- Conversation Relevancy
- Relevancy = Turns with Relevant Output / Total Turns
- Multimodal metrics
- Text to Image
- Image Editing
- Answer metrics
- Using custom LLMs
https://github.com/mlflow/mlflow 20.7k stars, 856 contributors, 4.6k forks, 62.1k declared users
MLflow is a platform for the machine learning lifecycle (traditional ML and LLM).
Characteristics:
- Open Source: Integrate with any ML library and platform
- Comprehensive: Manage end-to-end ML and GenAI workflows, from development to production
- Unified: Unified platform for both traditional ML and GenAI applications
Features:
-
Improve generative AI quality
-
Enhance LLM observability with tracing
-
Build applications with prompt engineering
-
Track progress during fine-tuning
-
Package and deploy models
-
Securely host LLMs at scale with MLflow Deployments
LLM Evaluate is a modular and simplistic package that allows you to run evaluations in your own evaluation pipelines. It offers RAG evaluation and QA evaluation. Intuitive developer experience.
-
Use MLflow Evaluate with the Prompt Engineering UI
It is available as managed service in Databricks.
MLflow is tightly integrated in Databricks.
- A tracking system that lets the user track model parameters and model metrics
- Streamlines the model development and deployment processes
- Ensures reproducibility, scalability, and traceability
- Smooth quick start
- The tracking server can be deployed locally
- Nice interface that allows for easy comparison of results and parameters
- Models can be loaded using the mlflow.pyfunc module
- Autologging is available
- Supported libraries
Tracking is central to the MLflow ecosystem, facilitating the systematic organization of experiments and runs:
- Experiments and Runs: Each experiment encapsulates a specific aspect of your research, and each experiment can house multiple runs. Runs document critical data like metrics, parameters, and the code state.
- Artifacts: Store crucial output from runs, be it models, visualizations, datasets, or other metadata. This repository of artifacts ensures traceability and easy access.
- Metrics and Parameters: By allowing users to log parameters and metrics, MLflow makes it straightforward to compare different runs, facilitating model optimization.
- Dependencies and Environment: The platform automatically captures the computational environment, ensuring that experiments are reproducible across different setups.
- Input Examples and Model Signatures: These features allow developers to define the expected format of the model’s inputs, making validation and debugging more straightforward.
- UI Integration: The integrated UI provides a visual overview of all runs, enabling easy comparison and deeper insights.
- Search Functionality: Efficiently sift through your experiments using MLflow’s robust search functionality.
- APIs: Comprehensive APIs are available, allowing users to interact with the tracking system programmatically, integrating it into existing workflows.
Ensuring model quality is paramount:
- Auto-generated Metrics: MLflow automatically evaluates models, providing key metrics for regression (like RMSE, MAE) and classification (such as F1-score, AUC-ROC).
- Visualization: Understand your model better with automatically generated plots. For instance, MLflow can produce confusion matrices, precision-recall curves, and more for classification tasks.
- Extensibility: While MLflow provides a rich set of evaluation tools out of the box, it’s also designed to accommodate custom metrics and visualizations.
This feature acts as a catalog for models:
- Versioning: As models evolve, keeping track of versions becomes crucial. The Model Registry handles versioning, ensuring that users can revert to older versions or compare different iterations.
- Annotations: Models in the registry can be annotated with descriptions, use-cases, or other relevant metadata.
- Lifecycle Stages: Track the stage of each model version, be it ‘staging’, ‘production’, or ‘archived’. This ensures clarity in deployment and maintenance processes.
MLflow simplifies the transition from development to production:
- Consistency: By meticulously recording dependencies and the computational environment, MLflow ensures that models behave consistently across different deployment setups.
- Docker Support: Facilitate deployment in containerized environments using Docker, encapsulating all dependencies and ensuring a uniform runtime environment.
- Scalability: MLflow is designed to accommodate both small-scale deployments and large, distributed setups, ensuring that it scales with your needs.
Notebook on Databricks
https://docs.databricks.com/en/mlflow/quick-start-python.html
https://github.com/open-compass/opencompass 5.4k stars, 585 forks, 151 contributors
Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.
A Universal Evaluation Platform for Foundation Models.
An LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2, GPT-4, LLaMa2, Qwen, GLM, Claude, etc.) over 100+ datasets.
https://github.com/truera/trulens/ 2.5k stars, 218 forks, 62 contributors
Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.
This is a general LLM evaluation tool that focuses on answer quality, and also some other aspects such as safety. It's been called "a framework for explaining deep network behavior".
It supports (instruments and evaluates) LangChain, LlamaIndex and NVIDIA Nemo Guardrails.
Links:
- Home: https://www.trulens.org/
- Documentation: getting started, detailed (tracking)
- Colab Notebook
- Slack
"RAG Triad":
https://www.trulens.org/trulens/getting_started/core_concepts/rag_triad/
- Context Relevance: is the retrieved context relevant to the question?
- Groundedness: is the answer based on the provided context?
- Answer Relevance: is the answer relevant to the question?
Mentions of TruLens in Google Scholar
-
Exploring Conceptual Soundness with TruLens (NeurIPS 2021):
interactive application built on TruLens that we use to explore the conceptual soundness of various pre-trained models -
MedInsight: A Multi-Source Context Augmentation Framework for Generating Patient-Centric Medical Responses using Large Language Models, arXiv 2403:
Quantitative evaluation using the Ragas metric and TruLens for answer similarity and answer correctness. -
Safeguarding Large Language Models: A Survey
Focuses on safety -
Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems
we use the state-of-the-art automated RAG evaluation tool called the ‘RAG Triad’ provided by TruLens. This approach emphasizes three quality scores of the RAG system -
Retrieval augmented generation for large language models: A survey:
Introduces the "RAG triad" concept -
Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models, Ital-IA 2024:
Estimates Answer Relevance with TruLens and RAGAS, and Answer Correctness with RAGAS. Explains the evaluation approach (Spearman correlations etc.) -
Towards increased truthfulness in LLM applications Application-oriented methods from current research: TowardsDataScience blog
-
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems arXiv 2407
-
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation arXiv 2408
-
Enterprise LLMOps: Advancing Large Language Models Operations Practice: 2024 IEEE Cloud Summit
-
Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models: arXiv 2401
-
LLMs in Production: book chapter in Large Language Models: A Deep Dive.
-
Causal Reasoning in Large Language Models using Causal Graph Retrieval Augmented Generation
https://github.com/explodinggradients/ragas 9.4k stars, 925 forks, 201 contributors
Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
Reviewed in "Retrieval-Augmented Generation for Large Language Models: A Survey", 2024, Gao et al.
Ragas: Automated evaluation of retrieval augmented generation, arXiv:2309.15217
- Works without having to rely on ground truth human annotations
- A reference-free (not tied to having ground truth available) evaluation framework for retrieval augmented generation
Resources:
- Documentation
- How-to Guides
- References
- Synthetic Test Data generation
- Utilizing User Feedback
- README: Installation, Quickstart
- Hugging Face
- Discord
RAGAS best covers the metrics mentioned in the Survey, and actually has more metrics:
- Faithfulness
- Answer relevancy
- Context recall
- Context precision
- Context utilization
- Context entity recall
- Noise Sensitivity
- Summarization Score
Extra source repos:
- https://github.com/alextakele/python-random-quote (given in PWC)
https://github.com/stanford-futuredata/ares 0.599k stars, 59 forks, 8 contributors
Reviewed in "Retrieval-Augmented Generation for Large Language Models: A Survey", 2024, Gao et al.
Ares: An automated evaluation framework for retrieval-augmented generation systems, arXiv:2311.09476
-
By creating its own synthetic training data, ARES finetunes lightweight LMM judges to assess the quality of individual RAG components.
-
To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI).
-
Tried on eight different knowledge-intensive tasks in KILT, SuperGLUE, and AIS
Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.
https://github.com/langchain-ai/langchain, 109k stars, 17.7k forks, 3632 contributors
There are some matches related to evaluation in the LangChain documentation: https://python.langchain.com/. They mostly point to:
- Allows you to closely trace, monitor and evaluate your LLM application.
- Seamlessly integrates with LangChain and LangGraph, and you can use it to inspect and debug individual steps of your chains and agents as you build.
- LangSmith helps with every step of evaluation from creating a dataset to defining metrics to running evaluators.
- Provides an evaluation framework that helps you define metrics and run your app against your dataset
- Allows you to track results over time and automatically run your evaluators on a schedule or as part of CI/Code
Concepts:
- Evaluation is the process of assessing the performance and effectiveness of your LLM-powered applications. It involves testing the model's responses against a set of predefined criteria or benchmarks to ensure it meets the desired quality standards and fulfills the intended purpose. This process is vital for building reliable applications.
- Tracing gives you observability inside your chains and agents. It is the series of steps that your application takes to go from input to output. Traces contain individual steps called runs. These can be individual calls from a model, retriever, tool, or sub-chains. Tracing gives you observability inside your chains and agents, and is vital in diagnosing issues.
- LangSmith is not open source: https://docs.smith.langchain.com/pricing
- Provided as SaaS, but can also be self-hosted. Works on Docker, Kubernetes. Works on all major cloud platforms.
Resources:
https://github.com/langchain-ai/auto-evaluator, 0.765k stars, 104 forks, 5 contributors
Description:
-
Challenge: The quality of QA systems can vary considerably; for example, we have seen cases of hallucination and poor answer quality due specific parameter settings. But, it is not always obvious to (1) evaluate the answer quality in a systematic way and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size) or components (e.g., model or retriever choice).
-
App overview: This app aims to address the above limitations. Recent work from Anthropic has used model-written evaluation sets. OpenAI and others have shown that model-graded evaluation is an effective way to evaluate models. This app combines both of these ideas into a single workspace, auto-generating a QA test set and auto-grading the result of the specified QA chain.
https://github.com/Giskard-AI/giskard, 4.6k stars, 324 forks, 51 contributors
Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.
- Home: https://www.giskard.ai/
- Documentation
- Discord
- Colab notebook
- Tutorials and guides on Machine Learning Testing
- Resources
- Glossary
The Evaluation & Testing framework for LLMs & ML models
- Control risks of performance, bias and security issues in AI models.
RAG Evaluation Toolkit (RAGET):
-
Automatically generate evaluation datasets & evaluate RAG application answers
-
If you're testing a RAG application, you can get an even more in-depth assessment using RAGET, Giskard's RAG Evaluation Toolkit.
-
RAGET can generate automatically a list of question, reference_answer and reference_context from the knowledge base of the RAG. You can then use this generated test set to evaluate your RAG agent.
-
RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness of the agent’s answers on different question types. Components evaluated with RAGET:
- Generator: the LLM used inside the RAG to generate the answers
- Retriever: fetch relevant documents from the knowledge base according to a user query
- Rewriter: rewrite the user query to make it more relevant to the knowledge base or to account for chat history
- Router: filter the query of the user based on his intentions
- Knowledge Base: the set of documents given to the RAG to generate the answers
-
See raget_demo Demonstration notebook
Uses a freemium model: https://www.giskard.ai/pricing. Doesn't have a free evaluation, but you can deploy it as Python.
https://github.com/promptfoo/promptfoo, 7k stars, 554 forks, 178 contributors
Reviewed in "A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks", 2024, Hudson et al.
Features
- Test your prompts, agents, and RAGs.
- Red teaming, pentesting, and vulnerability scanning for LLMs.
- Compare performance of OpenAI GPT, Azure, Anthropic Claude, Google Gemini, Llama, HuggingFace models, or integrate custom API providers for any LLM API
- Simple declarative configs with command line and CI/CD integration
- Build reliable prompts, models, and RAGs with benchmarks specific to your use-case
- Secure your apps with automated red teaming and pentesting
- Speed up evaluations with caching, concurrency, and live reloading
- Score outputs automatically by defining metrics
- Use as a CLI, library, or in CI/CD
- Developer friendly: fast, with quality-of-life features like live reloads and caching.
- Battle-tested: Originally built for LLM apps serving over 10 million users in production. Our tooling is flexible and can be adapted to many setups.
- Simple, declarative test cases: Define evals without writing code or working with heavy notebooks.
- Language agnostic: Use Python, Javascript, or any other language.
- Share & collaborate: Built-in share functionality & web viewer for working with teammates.
- Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
- Private: This software runs completely locally. The evals run on your machine and talk directly with the LLM.
https://github.com/uptrain-ai/uptrain, 2.3k stars, 198 forks, 40 contributors
https://uptrain.ai/ is an open-source unified platform to evaluate and improve Generative AI applications.
- Provides grades for 20+ preconfigured evaluations (covering language, code, embedding use cases),
- Performs root cause analysis on failure cases and gives insights on how to resolve them.
- Covers all your LLMOps needs: Enterprise grade tooling to help you iterate faster and stay ahead of competitors
- Faster and Systematic Experimentation: Get quantitative scores and make the right decisions. Eliminate guesswork, subjectivity and hours of manual review.
- Automated Regression Testing: Automated testing for each prompt-change/config-change/code-change across a diverse test set. Prompt versioning allows you to roll back changes hassle-free.
- Know Where Things Are Going Wrong: Not just monitoring, UpTrain isolates error cases and finds common patterns among them. UpTrain provides root cause analysis and helps make improvements faster.
- Enriched Datasets for your testing needs: UpTrain helps create diverse test sets for different use cases. You can also enrich your existing datasets by capturing different edge cases encountered in production.
Notes:
-
Mentioned as a "provider" in langchain documentation: https://python.langchain.com/docs/integrations/providers/uptrain/
-
https://demo.uptrain.ai/evals_demo/ - this site can’t provide a secure connection. ERR_SSL_PROTOCOL_ERROR
https://github.com/lunary-ai/lunary, 1.3k stars, 155 forks, 9 contributors
The production toolkit for LLMs. Observability, prompt management and evaluations.
- Cost and usage analytics, user tracking, tracing, monitoring, evaluation tools.
- Formerly "llmonitor"
- Mentioned in LangChain documentation: integrations/providers/llmonitor
https://docs.arize.com/phoenix
Not open source ?
- Open-Source Tracing and Evaluation: Trace, evaluate, and iterate on generative AI applications
- AI observability and LLM evaluation platform
- Support for LangChain applications
- Detailed traces of input, embeddings, retrieval, functions, and output messages.
Docs:
-
LLM Evaluation. In-depth topics:
- LLM As a Judge
- LLM Model Evals vs. LLM System Evals
- LLM Model Evals
- LLM System Evals
- When To Use Each
- LLM System Evaluation Metrics
- Top LLM System Evaluation Metrics
- LLM RAG Retrieval Metrics
- Exercise: Evaluating Context Relevance
- How To Build An LLM Eval
- LLM Benchmarks: Why Precision, Recall
- How To Run LLM Evals
- Questions To Consider
- Needle In a Haystack Tests
- LLM Tracing
-
RAG Evaluation
- Troubleshooting Retrieval and Responses
- Response Evaluation Metrics
- Retrieval Evaluation Metrics
- Troubleshooting RAG Workflows
- Scenario 1: Good Response, Good Retrieval
- Scenario 2: Bad Response, Bad Retrieval
- Scenario 3: Bad Response, Mixed Retrieval
https://github.com/HumanSignal/label-studio, 22.3k stars, 2.8k forks, 167 contributors, 844 declared users
Open Source Data Labeling Platform
- Home: https://labelstud.io/
- Documentation
- Blog
- Videos, Starter Tutorial
- Community, Academic Program
- Slack
The most flexible data labeling platform to fine-tune LLMs, prepare training data or validate AI models. Multi-type data labeling and annotation tool with standardized output format.
This is more for creating labeled datasets, not so much for evaluation. Features:
- Flexible and configurable: Configurable layouts and templates adapt to your dataset and workflow.
- Integrate with your ML/AI pipeline: Webhooks, Python SDK and API allow you to authenticate, create projects, import tasks, manage model predictions, and more.
- ML-assisted labeling: Save time by using predictions to assist your labeling process with ML backend integration.
- Connect your cloud storage: Connect to cloud object storage and label data there directly with S3 and GCP.
- Explore & understand your data: Prepare and manage your dataset in our Data Manager using advanced filters.
- Multiple projects and users: Support multiple projects, use cases and data types in one platform.
Freemium: comparison between open source and enterprise version
https://github.com/deepchecks/deepchecks, 3.8k stars, 269 forks, 52 contributors
Tests for Continuous Validation of ML Models & Data. Holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and models from research to production.
Notes:
- Geared more towards evaluating the LLM itself, rather than LLM systems/applications.
- Complicated developer experience
- Open-source offering is unique as it focuses heavily on the dashboards and the visualization UI, which makes it easy for users to visualize evaluation results.
https://github.com/microsoft/promptflow, 10.4k stars, 982 forks, 107 contributors, 1.9k declared users
By Microsoft.
Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.
https://microsoft.github.io/promptflow/
Suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, evaluation to production deployment and monitoring. It makes prompt engineering much easier and enables you to build LLM apps with production quality.
With prompt flow, you will be able to:
-
Create flows that link LLMs, prompts, Python code and other tools together in an executable workflow.
-
Debug and iterate your flows, especially tracing interaction with LLMs with ease.
-
Evaluate your flows, calculate quality and performance metrics with larger datasets.
-
Integrate the testing and evaluation into your CI/CD system to ensure quality of your flow.
-
Deploy your flows to the serving platform you choose or integrate into your app’s code base easily.
-
Collaborate with your team by leveraging the cloud version of Prompt flow in Azure AI.
Visualization of a DAG flow in Promptflow using visual studio code
https://github.com/i-am-bee/bee-agent-framework: 2.5k stars, 293 forks, 42 contributors
By IBM.
This is an open-source platform to discover, run, and compose AI agents from any framework.
- LLMs & GenAI Playground
- This doesn't seem to be open source, which is a deal-breaker
https://github.com/Helicone/helicone : 3.9k stars, 381 forks, 83 contributors
-
The open-source LLM-Observability Platform for Developers (logging, monitoring, and debugging).
-
One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc.
-
Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain…
- Free tier: monthly 100k requests
https://dev.to/guybuildingai/-top-5-open-source-llm-evaluation-frameworks-in-2024-98m
It seems like every week there is a new open-source repo trying to do the same thing as the other 30+ frameworks that already exist.
- DeepEval
- MLflow
- RAGAS
- DeepChecks
- Arize Phoenix
DeepChecks Blog: Best 10 LLM Evaluation Tools in 2024.
- DeepChecks
- LLMbench
- MLflow
- Arize Phoenix
- DeepEval
- RAGAS
- ChainForge
- Guardrails AI
- OpenPipe
- Prompt Flow
https://www.superannotate.com/blog/llm-evaluation-guide#top-10-llm-evaluation-frameworks-and-tools
This lists mostly commercial platforms:
- Amazon Bedrock
- NVidia Nemo
- Azure AI Studio
- Google Vertex AI Studio
- LangSmith
And some open source tools that were already listed:
- TruLens
- DeepEval
- Prompt Flow
LLM Observability Tools: 2024 Comparison
LLM observability refers to gaining total visibility into all layers of an LLM-based software system, including the application, prompt, and answer.
Each response must be reviewed for cleanliness and relevance. To meet your monitoring objectives, you must set up recording of your LLM prompts and replies, followed by contextual analysis.
- Lunary
- LangSmith
-
Portkey
Maybe that's not an eval tool: "a proxy that lets you keep a prompt library and supply variables in the template to access your LLM. The tool maintains all of your integration’s fundamental parameters, including temperature. It provides tools for caching responses, creating load balancing between models, and configuring fallbacks." - Helicone
- TruLens
- Arize Phoenix
- Traceloop OpenLLMetry
-
Datadog
Maybe that's not an eval tool: "an infrastructure and application monitoring software that has expanded its integrations into the world of LLMs and associated tools. It provides out-of-the-box dashboards for LLM observability. You can enable OpenAI usage tracing"
These are not evaluation tools, but other related resources.
https://github.com/openai/evals, 16.3k stars, 2.7k forks, 459 contributors
A framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
This was the first collection of very diverse LLM evaluation cases. At the beginning, OpenAI asked people to contribute to these evals in order to gain access to GPT+. I guess the depth/quality of evaluations differs by branch, but it's worth exploring the registry.
- modelgraded is a set of evals that are graded (evaluated) by LLM
- modelgraded/closedqa is the same for closed-answer (factual) Q&A
https://github.com/THUDM/AgentBench, 2.6k stars, 183 forks, 15 contributors
LLMbench consists of 3 benchmarks where 25 LLMs are evaluated:
- Agent: LLM capabilities
- Safety
- Alignment
Here we describe only the first one.
Home: https://llmbench.ai/agent
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to operate as autonomous agents in various scenarios:
- Operating System (OS)
- Database (DB)
- Knowledge Graph (KG)
- Digital Card Game (DCG)
- Lateral Thinking Puzzles (LTP)
- House-Holding (HH) (ALFWorld), e.g. following commands in the kitchen
- Web Shopping (WS) (WebShop)
- Web Browsing (WB) (Mind2Web)
It shows significant gap between leading commercial LLM and open source LLMs:
Eg in the KG environment, GPT4 has score 58 and Llama2 has score 8.