Large language models (LLMs) are reshaping how we interact with machines, generating human-quality text, translating languages, and writing different kinds of creative content. But this power comes at a cost. Training and running LLMs can be expensive, limiting their accessibility for many businesses and researchers.
Researchers have found different ways to bridge the gap with practical strategies to achieve high-performance LLMs without sacrificing budget constraints.
Adaptive RAG for Optimizing Supporting Document Numbers to LLM
Retrieval Augmented Generation (RAG) helps LLMs answer questions by searching through a collection of documents and providing relevant information to the LLM. However, deciding how many documents to include in the search process is nuanced. While including more documents can enhance accuracy by providing a richer context, it also comes with increased costs due to the complex computational processes.
A study illustrates how accuracy changes with the amount of information used to support a RAG question-answering system using a budget-friendly LLM.
The following are the observations from the graph. With one supporting document, the model is accurate 68% of the time. Accuracy improves to nearly 80% with ten context documents but only slightly surpasses 82% with fifty documents. Accuracy decreases slightly with 100 context documents, suggesting that too much information may overwhelm the model.
This study introduces adaptive RAG, which adjusts expenses by varying supporting documents based on the LLM’s response. By utilizing the LLM’s ability to recognize unanswered queries, this method achieves accuracy comparable to large context-based RAG setups at a lower cost. Additionally, adaptive RAG enhances model explainability by utilizing fewer supporting documents, clarifying relevant document identification and improving tracking of LLM response origins.
A small prompt with a single LLM call proves efficient for most questions. However, for complex or ambiguous questions, the LLM may require re-evaluation if its initial response is unclear. Effective utilization of the adaptive RAG approach necessitates a strategy for prompt expansion when necessary.
There are two primary methods for providing additional information to the LLM: the geometric series and the linear series. In the geometric series, the number of documents provided to the LLM is doubled each time (i.e., 1+2+4+…), offering a fast and cost-effective solution, particularly suitable for simpler questions. Conversely, the linear series involves adding a fixed amount (i.e., 5+10+15+…) of additional information with each iteration, which may become more costly and time-consuming, especially for complex questions.
If the LLM fails to find an answer with the provided documents, two alternative methods are proposed: the overlapping prompts strategy and the non-overlapping prompts strategy. The overlapping prompts strategy offers familiar data with additional details, while the non-overlapping prompts strategy introduces entirely new information, which can be helpful in specific scenarios.
The cost versus accuracy plot clearly shows that both adaptive RAG strategies are more efficient than the basic variant despite having the option to consult more articles if necessary. However, the non-overlapping adaptive RAG strategy, while less costly, doesn’t achieve the same peak performance as the overlap prompt creation strategy, even with access to all 100 retrieved-context documents.
Cutting Costs and Enhanced Performance with Smaller LLMs
Opting for task-specific, smaller models over large, general-purpose ones brings significant benefits, particularly in cost reduction and performance optimization. These specialized models, tailored to specific tasks like sentiment analysis or text summarization, not only deliver superior results within their niche but also require fewer computational resources, reducing expenses. These models require fewer computational resources for training and deployment, leading to decreased infrastructure costs. With faster inference times, they also lower operational expenses for processing data. Additionally, the scalability and cost-effective fine-tuning of smaller models provide flexibility while keeping overall expenses low.
Semantic Caching for Smart Storage and Instant Retrieval of Data
Traditional caching systems work by storing exact matches of queries, but this isn’t always effective for complex queries like those used with LLMs. Instead of calling LLMs all the time, semantic caching enables storing similar or related queries instead of exact matches, making it more likely to find a match even if the query isn’t the same.
Tools like GPTCache use special algorithms to do this. When a new query comes in, GPTCache checks if it’s similar to any queries already stored. If it finds a match, it can quickly answer without doing all the work again. This not only saves time but also reduces the amount of computing power needed. By caching responses to frequently asked questions or queries, developers can significantly reduce the overall cost of their projects, sometimes by more than 50%.
Prompt Compression Boosts Model Efficiency and Cuts RAG Costs by 80%
Prompt compression simplifies the original prompt while keeping the important details. It helps the language model process the inputs faster to provide quick and accurate answers. This method works because language often has unnecessary repetition. There are various prompt compression techniques to reduce LLM cost.
AutoCompressors are tools that summarize long text into short vector representations or summaries called summary vectors, acting as soft prompts for the model. During soft prompting, a few trainable tokens are added to the input text for specific tasks, optimizing them for the task at hand.
Selective context compression removes predictable tokens from the data based on their self-information scores. Tokens with low self-information values or relevance are removed to compress the prompt while retaining the most relevant information.
LLMLingua offers a powerful solution for prompt compression, allowing for the efficient transformation of prompts into streamlined representations without sacrificing meaning. Using compact, well-trained language models like GPT2-small or LLaMA-7B, LLMLingua intelligently identifies and removes non-essential tokens, achieving up to 20x compression while maintaining output quality. This enables cost-effective processing of prompts, reducing token count and inference times without compromising accuracy.
In evaluating the effectiveness of LongLLMLingua prompt compression, a query about Nicolas Cage’s education is used as an example in a study. Initially, relevant information from Cage’s Wikipedia page is combined with the query to create a prompt for the language model. LongLLMLingua is then applied to compress the prompt significantly, reducing input tokens by nearly seven times, saving $0.00202. Despite this compression, the language model accurately identifies Cage’s education in its response, demonstrating the method’s efficacy in optimizing prompts for efficient inference without compromising accuracy.
By adopting these budget-friendly strategies, companies and researchers can confidently navigate the intricacies of LLM usage, achieving impressive outcomes without overspending. Striking the right balance between cost and quality is important and RandomWalk can help you here to know more about effective knowledge management strategies. Visit our website to explore how we can revolutionize your approach to knowledge management and integrate state-of-the-art AI technology for your use cases.