How to Optimize the Cost of LLM System

--

In the era of AI, Large Language Models (LLMs) stand out as remarkable tools for comprehending and producing text that mirrors human language. Yet, beneath their impressive capabilities lies a significant challenge: the financial burden of operating these models. the cost of running these models can be substantial.

If you use GPT-3.5-turbo with 4k tokens for both input and output, it’ll be $0.002 / prediction or $2 / 1k predictions. DoorDash ML model made 10 billion predictions per day so if each cost of prediction is $0.002, that’d be $20 million a day.

This article explores various strategies to optimize the cost of the LLM system without compromising on accuracy.

RAG system

Retrieval-Augmented Generation (RAG) is an architectural approach that enhances the efficacy of Large Language Models (LLMs) by incorporating real-time, external data retrieval. Here’s a concise breakdown of how it works:

  1. Ingestion Pipeline: Data is digitized and chunked, preparing it for further processing.
  2. Artifact: The embedded data resides in a vector store, facilitating easy retrieval.
  3. Production: When a user poses a question, the system searches the vector store for relevant context (top N context). This context, along with the original question and instructions, forms a prompt for the LLM Layer. The LLM generates an answer using open-source or SaaS platforms.
Retrieval-Augmented Generation (RAG) Pipeline

50% off on LLM Interview Questions And Answers Course

  • 100+ Interview Questions & Answers: Interview questions from leading tech giants like Google, Microsoft, Meta, and other Fortune 500 companies.
  • Regular Updates
  • Boost Your Earning Potential
  • 100+ Self-Assessment Questions
  • Community Support
  • Certification

As a special offer, we are providing a 50% discount using the coupon code below.

Course Link: https://www.masteringllm.com/course/llm-interview-questions-and-answers?utm_source=medium&utm_medium=post&utm_campaign=llmoptimization

Course Coupon: MED50

Coupon explanation: 31st March 2024

Cost optimization opportunity

Cost optimization opportunity in RAG

Chunking

LLMs process information in chunks. Most application uses default chunking with overlaps which can reduce the accuracy of the LLM system as well as increase the cost & latency.

Implementing logical and context-aware chunking based on factors like the nature of the content and the type of question the user is asking can help reduce context size and improve efficiency.

Semantic caching

Frequently asked questions, greetings, and feedback can burden LLMs unnecessarily. Implementing caching mechanisms like GPTCache can store and retrieve commonly used responses, saving LLM calls and improving response time. Langchain integrates lots of caching tools LLM Caching integrations.

Search space optimization

I have seen a lot of use cases where developers will pass top N context from the search to LLM without looking at the similarity score (embedding cosine similarity) or relevance score (output of re-ranking model) which can inflate context size with irrelevant chunks, reducing the accuracy of LLMs.

Implementing effective search mechanisms to deliver only relevant chunks can reduce computational load. This can be achieved by metadata-based filtering to reduce search space followed by re-ranking.

Chat History Summarization

We enable coherent conversations by allowing LLMs to remember previous interactions via chat history. However, lengthy chat histories can quickly accumulate tokens, impacting cost efficiency.

To address this, we can reduce the number of tokens needed to process chat history by Summarizing lengthy chat histories, storing only essential parts. This will retain relevant context while minimizing token usage. We can use more cost-effective LLM (SLM) to distill lengthy chats into concise summaries.

This trick works when you pass over 5 question/answer pairs in the chat history. if you are using only the last 2 question/answer pair then it will not be cost-effective as you will need one call to LLM for summarization as well.

Prompt Compression

The rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitates an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.

Techniques like LLMLingua tested on various datasets showed it can compress prompts up to 20x while preserving their capabilities, particularly in In-Context Learning (ICL) and reasoning tasks. LLMLingua uses a small language model to remove unimportant tokens from prompts, enabling LLMs to infer from compressed prompts. LLMLingua has been integrated into LlamaIndex.

Foundation model selection (SLM vs LLM)

A plethora of options are available for the foundation model. Selecting the most suitable model based on requirements is challenging. LLMs are massive models that require substantial computational resources, particularly for training and fine-tuning. For many use cases, using an LLM may not be cost-effective. Evaluating the potential use of smaller, task-specific models for given use cases can help optimize costs.

Create a framework to guide the selection of the most suitable foundation model (SaaS or Open-Source) based on factors like data security, use case, usage patterns, and operational cost.

Model Distillation

The process involves taking a larger model’s knowledge and “distilling” it into a smaller model. The smaller model is trained to mimic the larger model’s outputs, which allows it to achieve similar performance with less computational resources.

Check this paper by Google Distilling step-by-step. The research demonstrated that a smaller model (with 770M parameters) trained using this distillation technique was able to outperform a much larger model (with 540B parameters) on benchmark datasets. This suggests that the distillation process was successful in transferring the larger model’s knowledge to the smaller model, allowing it to achieve comparable performance with significantly less computational resources.

Fine-tuning

There are complex use cases where you need to provide a few shot examples to LLMs in prompt practically 10–15 so that the model can generalize well. In this kind of scenario’s better to fine-tune the Model which can reduce the number of tokens required by eliminating the need for a few shot examples to complete a task while maintaining high-quality results.

Model compression

Running LLM models can be challenging due to their high GPU computing requirements. Model quantization involves reducing the precision of model weights, typically from 32-bit floating point (FP32) to lower-bit representations (e.g., 8-bit, 4-bit). By doing so, we can significantly shrink the model size and make it more accessible for deployment on devices with limited resources. Techniques like quantization (GPTQ, GGML) have been developed to reduce the model size while optimizing performance, enabling LLM deployment on less resource-intensive hardware. The bitsandbytes library is a powerful tool for quantizing large language models.

Inference optimization

LLMS must use provided hardware efficiently to maximize throughput (Request/min). Tools like vLLM, HF TGI, and TensorRT LLM can speed up LLM inference, improving efficiency.

Infrastructure optimization

Selecting the right set of infrastructure for LLM-based system operationalization is critical and has a high impact on overall operational costs. Tailoring LLM infrastructure costs based on usage patterns (batch processing vs. real-time) and implementing effective Financial Operations (FinOps) strategies can optimize cloud infrastructure costs in alignment with LLM usage.

Choose the right hardware and inference options based on model size and required FLOPs, optimizing for cost and performance.

In conclusion, optimizing the cost of LLMs involves a multi-faceted approach, considering everything from data ingestion to infrastructure optimization. By implementing these strategies, organizations can cost-effectively harness the power of LLMs.

Your feedback as comments and claps encourages us to create better content for the community.

Can you give multiple claps? Yes you can

Seize Your Last Opportunity for 50% Discount: Explore the Course Now

Course Link: https://www.masteringllm.com/course/llm-interview-questions-and-answers?utm_source=medium&utm_medium=post&utm_campaign=llmoptimization

Course Coupon: MED50

Coupon explanation: 31st March 2024

--

--

Mastering LLM (Large Language Model)

Dedicated to knowledge sharing and simplified explanations for LLM . Our mission is to provide a visually simple platform on latest research