SOTA on Local LLMs - statnett/Talk2PowerSystem GitHub Wiki
This page serves as an overview and brief state of the art of currently available open source language models. It will cover possible applications of open source LLMs within the project, overview of major current families of open source LLMs and approaches to fine-tuning of LLMs.
Applications
The first versions of the chatbot agent developed in the project are based on the OpenAI models. Specifically, it used GPT-4o as the driving LLM that interprets the question, handles tool calls and processes responses to deliver a final answer. There are a few different ways in which other language models can be incorporated into this process.
LLM as Primary Agent
The most obvious approach is to replace the OpenAI models with a locally hosted open source model. This is the best way to ensure oversight over data usage throughout the whole process. This kind of application would require a comparably powerful model that can work with large context windows and supports intricate tool use. This means the only models worth considering are chat and instruct focused LLMs that run on a cluster of powerful GPUs. Only a few of the open source models fit these criteria but they do exist and are available.
Application of fine-tuning to models of this size is also beyond the scope of our project as the resource, time and data requirements are quite high. It is a long term possibility to create a cim-specialized LLM that loses some of its general chat and instruct capabilities but specializes in the domain.
LLM as Planner
Reasoner models are quite popular nowadays and they exhibit some impressive abilities. Unfortunately, they are all either incapable or inferior at actual tool use so their current versions are unsuitable for the role of primary agent. They could, however, serve as a planner tool that breaks down the question into steps and instructs how to use the tools to best effect, leaving the primary agent to only carry out the plan.
The effectiveness of such a tool is yet untested for our use case but it might be very valuable for truly complex multi-step queries. Still, there is much uncertainty and it is an improvement to consider in the long term.
SLM as a Specialized Tool
There is also the option of deploying a powerful specialized tool based on a fine-tuned Small Language Model (SLM). The advantage here is that proper fine-tuning of an SLM can produce a model with narrowly specialized knowledge that surpasses any LLM in its specialty. We have explored training a model for writing complex SPARQL queries over a specific version of the ontology but that is not the only possible application. Still, each unique field would require a separate SLM be trained for a specific task.
Notable Families of Open Source LLMs
Recommending a specific model is a challenge. For one, each of the major "families" of models release one or two new versions per year
but also new models appear frequently. More importantly, understanding which of the currently available models is "best" in general is nearly impossible.
The Open LLM Leaderboard used to be the universally accepted place for an unbiased comparison of baseline models. It was discontinued several months ago due to a determination that a general ranking, even across multiple custom actively-maintained datasets, is not possible or useful.
A tentative successor is available in UGI Leaderboard although it is not nearly as widely adopted, doesn't have complete coverage of models and the different metrics produce significantly different ranking.
The screenshot above presents the top performing models open models there based on one specific metric in June 2025. Unfortunately, the performance varies massively between metrics and is often dominated by a proliferation of merged or fine-tuned models. There are an overwhelming number of such and some subset of them rank highly across each metric based on something reminiscent of p-hacking.
Let's look through just the baseline open-source models across the available metrics.
Above, the UGI metric. "UGI: Uncensored General Intelligence. A benchmark measuring both willingness to answer and accuracy in fact-based contentious questions. The test set is made of roughly 100 questions/tasks, covering topics that are commonly difficult to get LLMs to answer."
Above, the W/10 metric. "W/10: Willingness/10. A more narrow subset of the UGI questions, solely focused on measuring how far a model can be pushed before going against its instructions or refusing to answer."
Above, the NatInt metric. "NatInt: Natural Intelligence. A general knowledge quiz covering real-world subjects that llms are not commonly benchmarked on, such as pop culture trivia. This measures if the model understands a diverse range of topics, as opposed to over-training on textbook information and the types of questions commonly tested on benchmarks"
Unsurprisingly, model size is crucial with proper LLMs outperforming SLMs across the board. Very broadly language models vary in size from 60M to over 1T parameters (almost five orders of magnitude). As a very broad rule of thumb anything below 3B parameters needs to be fine-tuned for any application, 7-20B parameters produce competent small language models that can do small clearly defined tasks reasonably well but need fine-tuning to outperform LLMs over the same tasks, and the flagship models have over 400B parameters. New models and new generations of possible families are released frequently and even the imperfect comparison dashboards available are also a few months behind on evaluating the newest releases.
DeepSeek
The DeepSeek models include the currently best performing open source models on benchmarks. Most notably it includes DeepSeek-R1-0528 and DeepSeek-V3-0324 which are both 685B parameter models and are a reasoner released in May 2025 and chat model released in March 2025 respectively.
The family also includes a series of distilled models ranging in size from 1.5B to 70B parameters. **Distillation **is the process of training a smaller LM to emulate the behavior of a larger and better LM. The final result should be a cheaper and quicker model with similar behavior and minimal loss of capability. These distilled models are various based on QWEN and LLama architectures.
Mistral
Mistral is a French company specializing in creating open source models. Their models should be deployable in Azure in a similar fashion to OpenAI's models in addition to available for local hosting. Their current flagship model is Mistral-large-2411 from November 2024 available under their license. It is a 123B parameter model.
Another model of interest is the Mistral-Small-3.2-24B-Instruct-2506 which is a 24B parameter model released in June 2025 and available under an Apache2 license. They also have smaller SLMs that are frequently updated.
QWEN
Qwen is the family of open source language models released by Alibaba. They released their third generation of models in May 2025 including Qwen3-235B-A22B (235B parameters) and several SLMs in the sub-1B to 32B parameter range.
Other
There are many other families of models with new ones emerging all the time and others dropping out of popularity.
MiniMax
MiniMax-M1 is a 456B parameter new Chinese open source model by MiniMax AI released in June 2025. It seems to place near the top of most benchmarks but there are no smaller models in the family for now.
Llama
Llama models by Meta used to be one of the top performers. The third generation vary in size from 1B to 70B, Llama-3.3-70B-Instruct being their current top performer. The 4th generation is expected soon with some 17B models released April and May 2025.
Microsoft Phi
Microsoft's Phi Family got it's fourth generation in April to June 2025. The models vary in size from 3.8B to 14.7B parameters so they are good candidates for SLMs.
Fine-tuning
Fine-tuning language models refers to taking a specific (open source) model with existing weights as a baseline and performing additional training to change it's behavior. It can be done in different ways and with a variety of purposes but there are a few important commonalities:
- It requires significantly more computational power than simply running the model
- It requires a relatively large dataset
- It is aimed at a particular task and done with corresponding training technique
- Requirements for a particular amount of improvement are not known ahead of time
Generally speaking, fine-tuning is the most complex and expensive step. It is only undertaken when prompt improvements and custom tools have reached their limit.
As a note, it can be done on models of any size but for practical purposes, we are limited to considering only SLMs. The most likely use case for fine-tuning in our work is producing an advanced custom tool- for example, a specialized SLM that is capable of writing complex queries using a specific version of the CIM ontology. This tool would then be able to more reliably, more quickly and less expensively write queries that follow the ontology.
Low-rank parameterized update matrices (LoRA)
A basic approach to fine-tuning a model is to basically continue its training on a new more specialized dataset. This means we feed it an input, compare the output it produces to the expected output and update all weights according to our training strategy.
Low-rank Adaptation of Large Language Models refers to a technique where the model is updated to have additional (relatively) small matrices (LoRA adaptors) which add new knowledge. In the course of training, only these newly added matrices are updated which reduces the computation requirements significantly and ensures the specialized knowledge is encoded separately from the general language knowledge. The LoRA approach is shown to match or exceed the results produced with basic fine-tuning.
QLoRA (i.e. Quantized LoRA) is a further optimization that quantizes the values of the model during training to further reduce the memory requirements. This does come at some cost in quality of the results but it would allows us to possibly fine-tune mid-sized language models whose baseline performance is closer to the flagship models than to SLMs.
Fine-tuning Techniques
There are a variety of fine-tuning techniques that are appropriate under different circumstances
Continued Pre-training (CPT)
This is a method that can produce a domain-adapted language model. It essentially continues the regular training but replacing the general world dataset with a large domain specific dataset. It is easy to implement, the data is easy to prepare, it has relatively small computation requirements and it can add knowledge to the model. The downside is that it continues to train the model on next token prediction so it can make it forget how to chat, reason, and do other more advanced interactions.
Supervised Fine-tuning (SFT)
This is similar to CPT but in this case input data are paired with target outputs. This makes it harder to prepare the dataset but other than dataset availability, the approach shares all advantages of CPT and minimizes the downside- the model will respond as shown but it's general knowledge could be degraded.
The main challenge here is that the required dataset is still quite large.
Direct Preference Optimization (DPO)
A further extension of SFT, here the dataset consists of input and multiple ranked outputs. This can even more effectively teach the model what the desired outputs. The dataset doesn't need to be as large but it is even harder to collect/produce.
Reinforcement Learning (RL)
The previous three techniques were all examples of supervised learning- for any given output, we can immediately calculate its desirability and adjust the model accordingly. RL is a type of semi-supervised learning where the training signals are provided to the model only occasionally.
Similar to the CPT technique, the training dataset consists only of prompts. The important addition is a reward function that can rate the quality of answers. Much of the complexity in RL comes from this reward function as the results are determined entirely by the kind of behavior it rewards. In practice, this is best implemented by having access to a live environment that can give real time feedback during training and presents a realistic and consistent state of the world the model is expected to interact with in practice.
This approach is core to teaching many of the more impressive feats LLMs demonstrate nowadays such as reasoning and performing complex multi-step tasks. Unfortunately, it is the most computationally expensive approach and the reward functions are difficult to define and often produce unexpected behavior.
Summary
The application of locally hosted LLMs can come in two very distinct varieties.
Replacements for OpenAI models as the central "brain" behind the agent can be done with some of the biggest flagship models. Current viable candidates are Mistral-large-2411, DeepSeek-V3-0324, and MiniMax-M1 but they require a dedicated cluster of powerful GPUs (recommended e.g. 8xH200 GPU cluster i.e. ~1200G vRAM).
Alternatively, fine-tuned SLMs can be used in specialized tools. There is a wide variety of available models but there is significant degree of experimentation here and a large dataset and well selected training technique is probably more important than the baseline model. SLMs have much lower GPU requirements for day-to-day operation and even their fine-tuning can be carried out on a single H200 GPU or a cluster of less powerful GPUs.