leaderboard - chunhualiao/public-docs GitHub Wiki
Several reputable leaderboards for large language models (LLMs) have emerged as industry standards, providing valuable insights into model performance across various dimensions. Here are some of the most notable LLM leaderboards:
- https://huggingface.co/spaces/galileo-ai/agent-leaderboard
- https://huggingface.co/spaces/gaia-benchmark/leaderboard
https://openai.com/index/swe-lancer/
https://lmarena.ai/?leaderboard
https://trackingai.org/home IQ tests of AI
https://klu.ai/llm-leaderboard
Generalist agents
https://microsoft.github.io/WindowsAgentArena/ WindowsAgentArena (Bonatti et al., 2024)
Medical question answering
https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard
Agentic coding capabilities
As of April 2025, the landscape for ranking AI models on agentic coding capabilities includes at least two significant leaderboards:
- Galileo AI’s Agent Leaderboard focusing on tool use https://huggingface.co/spaces/galileo-ai/agent-leaderboard
- Aider’s polyglot benchmark on autonomous editing: https://aider.chat/docs/leaderboards/
Developers should leverage both, mindful of methodological differences and ongoing debates over metrics, to select models best suited for their needs.
graphics-related
https://grok.com/share/bGVnYWN5_b797002b-521a-4892-bb2a-1e7b48aa2111
Direct Answer The task of ranking AI models for extracting insights or raw data from graphic diagrams, figures, or plots is a niche area, and there isn't a definitive top 10 list of specialized leaderboards. However, based on current research, here are the most relevant leaderboards and benchmarks that address this task, along with some related resources.
Relevant Leaderboards
- Visual Question Answering (VQA) Leaderboard: This is the most directly applicable leaderboard, ranking models on their ability to answer questions about images, which can include charts and plots. It's available at this website.
- Document Understanding Competitions: Competitions like those from ICDAR (ICDAR Competitions) often include tasks that involve extracting information from documents, which may contain plots or charts, though they're not exclusively focused on this.
- Table Extraction Leaderboards: Platforms like Kaggle (Kaggle Competitions) have competitions involving table data extraction, which is similar to extracting data from plots, but again, not specifically for graphical data.
Related Benchmarks
- While not public leaderboards with rankings, benchmarks like Chart-to-Text (Chart-to-Text Benchmark) and ChartX (ChartX Benchmark) evaluate models on understanding and reasoning with charts, which involves extracting data. These are research-focused and provide baseline models but lack a community ranking system.
Unexpected Detail
- An interesting aspect is that tools like PlotDigitizer (PlotDigitizer) and
- WebPlotDigitizer (WebPlotDigitizer) are designed for human use to extract data from plots,
- but there isn't a similar competitive leaderboard for AI models, highlighting a gap in the field.
This list covers the key resources available as of March 12, 2025, but given the niche nature, users may need to adapt these for specific plot data extraction tasks.
Open LLM Leaderboard
Hosted by Hugging Face, this leaderboard uses the Eleuther AI-Language Model Evaluation Harness to benchmark models across six tasks, including AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k[2]. It offers detailed numerical results and model specifics, making it a comprehensive resource for evaluating open-source LLMs[1].
LMSYS Chatbot Arena Leaderboard
This crowdsourced platform has collected over 1,000,000 human pairwise comparisons to rank LLMs using the Bradley-Terry model. It includes 102 models and uses Elo-scale rankings, providing a user-centric evaluation of chatbot performance[2].
Berkeley Function-Calling Leaderboard (BFCL)
BFCL focuses on evaluating LLMs' ability to call functions and tools, a critical capability for applications like Langchain and AutoGPT. It features a diverse dataset of 2,000 question-function-answer pairs across multiple languages and scenarios[2].
Artificial Analysis LLM Performance Leaderboard
This leaderboard benchmarks LLMs on serverless API endpoints, measuring both quality and performance from a customer perspective. It includes metrics such as Time to First Token (TTFT), throughput, and total response time[2].
OpenCompass: CompassRank
OpenCompass 2.0 is a versatile benchmarking platform that evaluates LLMs across multiple domains using both open-source and proprietary benchmarks. It includes CompassKit for evaluation tools, CompassHub for benchmark repositories, and CompassRank for leaderboard rankings[4].
ScaleAI Leaderboard
ScaleAI's leaderboards use proprietary, private datasets and expert-led evaluations to provide unbiased and uncontaminated results in a dynamic, contest-like environment[4].
These leaderboards offer diverse perspectives on LLM performance, covering aspects such as reasoning ability, function calling, real-world application performance, and human evaluation. By considering multiple leaderboards, researchers and organizations can gain a comprehensive understanding of LLM capabilities and make informed decisions about model selection and development.
Citations:
- [1] https://klu.ai/llm-leaderboard
- [2] https://www.marktechpost.com/2024/06/02/top-12-trending-llm-leaderboards-a-guide-to-leading-ai-models-evaluation/
- [3] https://www.galileo.ai/blog/llm-benchmarks-performance-evaluation-guide
- [4] https://www.nebuly.com/blog/llm-leaderboards
- [5] https://llm-stats.com
- [6] https://www.reddit.com/r/LocalLLaMA/comments/1d5il4c/is_there_an_unbiased_leaderboard_for_all/
- [7] https://www.shakudo.io/blog/top-9-large-language-models
- [8] https://www.acorn.io/resources/learning-center/llm-leaderboards/