leaderboard - chunhualiao/public-docs GitHub Wiki

Several reputable leaderboards for large language models (LLMs) have emerged as industry standards, providing valuable insights into model performance across various dimensions. Here are some of the most notable LLM leaderboards:

Generalist agents

OSWorld

https://webarena.dev/

https://microsoft.github.io/WindowsAgentArena/ WindowsAgentArena (Bonatti et al., 2024)

Medical question answering

https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard

Agentic coding capabilities

As of April 2025, the landscape for ranking AI models on agentic coding capabilities includes at least two significant leaderboards:

Galileo AI’s Agent Leaderboard focusing on tool use https://huggingface.co/spaces/galileo-ai/agent-leaderboard
Aider’s polyglot benchmark on autonomous editing: https://aider.chat/docs/leaderboards/

Developers should leverage both, mindful of methodological differences and ongoing debates over metrics, to select models best suited for their needs.

graphics-related

https://grok.com/share/bGVnYWN5_b797002b-521a-4892-bb2a-1e7b48aa2111

Direct Answer The task of ranking AI models for extracting insights or raw data from graphic diagrams, figures, or plots is a niche area, and there isn't a definitive top 10 list of specialized leaderboards. However, based on current research, here are the most relevant leaderboards and benchmarks that address this task, along with some related resources.

Relevant Leaderboards

Visual Question Answering (VQA) Leaderboard: This is the most directly applicable leaderboard, ranking models on their ability to answer questions about images, which can include charts and plots. It's available at this website.
Document Understanding Competitions: Competitions like those from ICDAR (ICDAR Competitions) often include tasks that involve extracting information from documents, which may contain plots or charts, though they're not exclusively focused on this.
Table Extraction Leaderboards: Platforms like Kaggle (Kaggle Competitions) have competitions involving table data extraction, which is similar to extracting data from plots, but again, not specifically for graphical data.

Related Benchmarks

While not public leaderboards with rankings, benchmarks like Chart-to-Text (Chart-to-Text Benchmark) and ChartX (ChartX Benchmark) evaluate models on understanding and reasoning with charts, which involves extracting data. These are research-focused and provide baseline models but lack a community ranking system.

Unexpected Detail

An interesting aspect is that tools like PlotDigitizer (PlotDigitizer) and
WebPlotDigitizer (WebPlotDigitizer) are designed for human use to extract data from plots,
but there isn't a similar competitive leaderboard for AI models, highlighting a gap in the field.

This list covers the key resources available as of March 12, 2025, but given the niche nature, users may need to adapt these for specific plot data extraction tasks.

Open LLM Leaderboard

Hosted by Hugging Face, this leaderboard uses the Eleuther AI-Language Model Evaluation Harness to benchmark models across six tasks, including AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k[2]. It offers detailed numerical results and model specifics, making it a comprehensive resource for evaluating open-source LLMs[1].

LMSYS Chatbot Arena Leaderboard

This crowdsourced platform has collected over 1,000,000 human pairwise comparisons to rank LLMs using the Bradley-Terry model. It includes 102 models and uses Elo-scale rankings, providing a user-centric evaluation of chatbot performance[2].

Berkeley Function-Calling Leaderboard (BFCL)

BFCL focuses on evaluating LLMs' ability to call functions and tools, a critical capability for applications like Langchain and AutoGPT. It features a diverse dataset of 2,000 question-function-answer pairs across multiple languages and scenarios[2].

Artificial Analysis LLM Performance Leaderboard

This leaderboard benchmarks LLMs on serverless API endpoints, measuring both quality and performance from a customer perspective. It includes metrics such as Time to First Token (TTFT), throughput, and total response time[2].

OpenCompass: CompassRank

OpenCompass 2.0 is a versatile benchmarking platform that evaluates LLMs across multiple domains using both open-source and proprietary benchmarks. It includes CompassKit for evaluation tools, CompassHub for benchmark repositories, and CompassRank for leaderboard rankings[4].

ScaleAI Leaderboard

ScaleAI's leaderboards use proprietary, private datasets and expert-led evaluations to provide unbiased and uncontaminated results in a dynamic, contest-like environment[4].

These leaderboards offer diverse perspectives on LLM performance, covering aspects such as reasoning ability, function calling, real-world application performance, and human evaluation. By considering multiple leaderboards, researchers and organizations can gain a comprehensive understanding of LLM capabilities and make informed decisions about model selection and development.

Citations: