Introduction
Picking the right Large Language Model(LLM/model) for your use case can be challenging. You would most likely consult various leaderboards such as Chatbot Arena, Huggingface, etc to look up benchmarks, licensing, and other relevant aspects to evaluate the right model for you. However, each model comes with its cost-to-quality proposition. While building a system with LLM, it’s important to map the right task to the right model.
Multi-model solutions are a common hybrid approach. With multi-model solutions, you can pick just the right model for your task. And, luckily most LLM frameworks (langchain, llamaindex, crew.ai) support configuring LLM down to each task/LLM call.
However, sometimes, you might need more flexibility.
Enter LLM RoutersÂ
While it’s possible to select each LLM at design time, in certain cases it can be useful to do this dynamically. A model router is a potential solution in this scenario.
A model router is an LLM trained to evaluate a query and select the right LLM optimizing for quality vs cost.
As a developer, you can build your own or use a pre-trained router. Some of the options are as follows:
RouteLLM
Released by LMSYS Org, RouteLLM Is an open-source paper/solution with a dataset and the model available on huggingface. Claiming that benchmarks indicate that RouteLLM’s routers ‘demonstrate very strong results, achieving the same performance as commercial routers while being over 40% cheaper’. You can extend their solution and train your routing model as well. According to the presentation, RouteLLM selects from a model pair (high quality vs low cost).
Martian is an enterprise offering. They work with you to train a router for your specific needs. However, it appears Martian supports a preselected set of models to choose from. You can pass the models you want to compare with each API request. Martian exposes 4 parameters to control the router:
max_cost - The maximum cost of the total request in USD. Allows you to specify an upper bound on what you are willing to pay for the request.
max_cost_per_million_tokens - The maximum cost of each 1 million tokens in the request, in USD. Allows you to specify a minimum efficiency for your models.
model - The set of models from which we should be able to route. Allows you to select only models within a certain cost range.
willingness_to_pay - A parameter specifying the values of getting better output, measured in dollars. A value of 0.1, for example, indicates that each 10% improvement in performance is 10 cents. If this parameter is not set, it defaults to infinity, which indicates that we should optimize only for performance.
UnifyAI
On top of a commercial router offering, UnifyAI also offers a uniform API endpoint with configurable models and providers, freeing you from having to manage model endpoints in your application. They appear to have a larger list of available models(https://unify.ai/benchmarks). Unify exposes the following controls(they use the concept of a router string instead of params):
models: Models to choose from
providers: Specific providers to select from (for example OpenAI, Perplexity, aws-bedrock, anyscale, fireworks-ai, etc). A full list is available on the benchmarks page provided above.
The following parameters are used to specify relative weights(what’s important to you in this specific request): quality, cost, time-to-first-token, and inter-token latency.
The resulting string looks something like this:
router@models:model1,model2,model3|providers:provider1,provider2|q:1|c:4.65e-03|t:2.08e-05|i:2.07e-03
Evaluating LLM Routers / Benchmarks
Some of the available benchmarks for router evaluation:
Probably the only benchmark Benchmark for Multi-LLM Routing System. RouterBench is ‘a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies.'Â
MT-bench is, 'a benchmark consisting of 80 high-quality multi-turn questions. MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models.'
MMLU | arXiv:2009.03300
MMLU is a ‘test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more’. Although, not a router benchmark, it can be used to test the accuracy of the trained router.Â
Cost & Performance Considerations
Considering a router is an LLM itself, you will need to self-host or use a managed service. WithMartian and UnifyAI are managed services, while RouteLLM is an option to host one yourself. While deploying a router is an additional cost item, it can be useful to see if it provides additional value to you. Tools such as AgentOps may be used to bring cost observability.
Conclusion
While, on the surface, a router sounds like an interesting idea, if your app has fixed queries that do not change over time, a router might be unnecessary in production. However, as a development aid to evaluate models, it might have value for you.