Mastering Caching Methods in Large Language Models (LLMs)
Large Language Models (LLMs) like OpenAI’s GPT-4 have transformed natural language processing, enabling applications ranging from chatbots to content generation. However, their computational demands can lead to high latency and increased operational costs, especially when handling a large volume of requests.
Caching is a powerful technique to mitigate these issues by storing and reusing responses, thus improving response times and reducing costs. In this comprehensive guide, we’ll explore various caching strategies to optimize LLM performance, including:
- In-memory caching
- Disk-based caching
- Semantic caching
We’ll provide code examples to demonstrate how caching improves response times and visually explain the concepts.
Are you preparing for Gen AI interview ? Look for our LLM Interview preparation Course
- 100+ Questions spanning 14 categories & Real Case Studies
- Curated 100+ assessments for each category
- Well-researched real-world interview questions based on FAANG & Fortune 500 companies
- Focus on Visual learning
- Certificate of completion
50% off Coupon Code — LLM50
Link for the course :
Understanding Caching in LLMs
Caching involves storing previously computed responses so they can be quickly retrieved without recomputation. In the context of LLMs, caching helps:
- Reduce Latency: Faster responses enhance user experience.
- Lower Costs: Fewer API calls translate to reduced operational expenses.
- Improve Scalability: Efficiently handle higher volumes of requests.
Types of Caching Strategies
In-Memory Caching
Stores data in RAM for rapid access.
Pros:
- Extremely fast read/write speeds.
- Ideal for high-frequency queries.
Cons:
- Limited storage capacity.
- Data is lost if the system restarts.
Example: Using a dictionary in Python or tools like Redis for quick data retrieval.
Disk-Based Caching
Stores data on disk using databases like SQLite.
Pros:
- Persistent storage across sessions.
- Larger capacity than memory.
Cons:
- Slower than in-memory caching due to disk I/O.
Example: Utilizing SQLite or other disk-based databases to store cached responses.
Semantic Caching
Stores responses based on the semantic meaning of queries using embeddings.
Pros:
- Handles semantically similar queries.
- Increases cache hit rates in natural language applications.
Cons:
- Additional computational overhead for computing embeddings.
- Complexity in setting appropriate similarity thresholds.
Semantic caching involves:
- Embedding Queries: Convert textual prompts into vector representations using models like Sentence Transformers.
- Storing Embeddings: Save embeddings alongside responses in a vector database.
- Similarity Search: When a new prompt arrives, compute its embedding and search for similar embeddings within a defined threshold.
AgenticRAG with LlamaIndex Course
Look into our AgenticRAG with LlamaIndex Course with 5 real-time case studies.
- RAG fundamentals through practical case studies
- Learn advanced AgenticRAG techniques, including:
- Routing agents
- Query planning agents
- Structure planning agents
- ReAct agent with a human in the loop - Dive into 5 real-time case studies with code walkthroughs
Demonstrating Caching Improvements with Code Examples
Without Caching
Code Example:
Output
Note: We are not doing LLM call for simplicity.
With In-Memory Caching
Code Example:
Output:
Explanation:
- First Call: The prompt is not in the cache, so it calls the LLM API and stores the response.
- Second Call: The prompt is found in the cache, and the response is returned instantly.
Disk-Based Caching
Code Example:
Output:
Explanation
- Initialization: The
initialize_cache()
function creates a SQLite database and a table namedcache
withprompt
as the primary key andresponse
as the stored response. - Cache Check: In
get_response_disk_cache()
, the function checks if the prompt exists in the database by querying thecache
table using the prompt as the primary key.
Cache Miss:
- If the prompt is not found (
result
isNone
), it indicates a cache miss. - The function calls
call_llm_api(prompt)
to simulate getting a response from the LLM. - It then inserts the prompt and response into the database for future queries.
Cache Hit:
- If the prompt is found, the cached response is retrieved from the database.
- The response is returned immediately without calling the LLM API.
With Semantic Caching
Code Example:
Output:
Explanation:
- First Call: Cache miss; the response is stored along with its embedding.
- Second Call: Cache hit; the semantically similar query retrieves the cached response quickly.
Optimizing Cache Usage
- Setting Cache Rules — Define specific rules for when to cache and when to invalidate cache entries.
- Adjusting Similarity Thresholds — In semantic caching, adjusting the similarity threshold balances between accuracy and cache hit rates.
Tracking Performance
- Cache Hit Ratio: Percentage of requests served from the cache.
- Average Latency: Time taken to serve requests.
- Cost Savings: Reduction in API calls.
All previous coffee break concepts
Look for all of our volumes of coffee break concepts:
Follow us here and your feedback as comments and claps encourages us to create better content for the community.