Best Practices for RAG Pipeline

7 min readSep 1, 2024

Over the past few years, RAG has matured and multiple studies has been done to understand pattern and behaviors which can result in low cost with high accuracy. One such research is Searching for Best Practices in Retrieval-Augmented Generation published as a paper.

Typical RAG Workflow

Credit : https://arxiv.org/pdf/2407.01219

A typical RAG (Retrieval-Augmented Generation) workflow has several steps:

Query Classification: Check if the user’s question needs document retrieval.
Retrieval: Find and get the most relevant documents quickly.
Re-ranking: Arrange the retrieved documents in order of relevance.
Re-packing: Organize the documents into a structured format.
Summarization: Extract key points to generate clear, concise answers and avoid repetition.

Implementing RAG also involves deciding how to break down documents into chunks, choosing the right embeddings for understanding the text’s meaning, selecting a suitable vector database for storing features efficiently, and finding ways to fine-tune language models, as shown in Figure 1.

Lets evaluate each one in detail:

Want to find out correct and accurate answers? Look for our LLM Interview Course

100+ Questions spanning 14 categories
Curated 100+ assessments for each category
Well-researched real-world interview questions based on FAANG & Fortune 500 companies
Focus on Visual learning
Real Case Studies & Certification

50% off Coupon Code — LLM50

Link for the course —

Large Language Model (LLM) Interview Question And Answer Course

Dive deep into the world of AI with this comprehensive large language model (LLM) interview questions & answer course…

www.masteringllm.com

Query Classification

Why is query classification important?

Not all questions require additional retrieval because LLMs have built-in knowledge.
While RAG (Retrieval-Augmented Generation) can enhance accuracy and reduce errors, frequent document retrieval can slow down response times.
To optimize performance, we classify queries to determine if retrieval is necessary.

When is retrieval recommended?

Retrieval is usually needed when the answer requires information not contained within the model itself.

How are tasks categorized?

Tasks are grouped into 15 types, based on whether they have enough information.
Tasks that can be answered with only the provided user information are marked as “sufficient” and don’t need retrieval.
Tasks that need more information are marked as “insufficient” and may require document retrieval.

Classification of retrieval requirements for different tasks. In cases where information is not provided, we differentiate tasks based on the functions of the model. Image credit — https://arxiv.org/pdf/2407.01219

This classification process is automated by training a classifier.

Chunking

Why is document chunking important?

Breaking down a document into smaller chunks helps improve retrieval accuracy and prevents issues related to document length when using LLMs (Large Language Models).

What are the levels of chunking?

Token-level chunking: Splits text by a set number of tokens. It’s simple but can break sentences, which might lower retrieval quality.
Semantic-level chunking: Uses LLMs to identify natural breakpoints, keeping the context intact but requiring more processing time.
Sentence-level chunking: Divides text at sentence boundaries, balancing the preservation of meaning with efficiency and simplicity.

Which chunking method is preferred?

Sentence-level chunking is often used as it strikes a good balance between maintaining the text’s meaning and being easy to implement.

Chunk Size

Importance of Chunk Size : The size of each chunk in a document greatly impacts performance.

Larger chunks provide more context, which helps in understanding the content better but can slow down processing.
Smaller chunks are processed faster and improve recall rates but might not provide enough context for accurate comprehension.

Key Metrics for Evaluation

As shown in Figure, chunk sizes are evaluated based on two metrics:
Faithfulness: Assesses whether the response is accurate or if it deviates (hallucinates) from the retrieved text.
Relevancy: Checks if the retrieved text and the response are closely related to the query.

How are Chunks Organized?

The analysis in Figure 5 compares different chunk sizes:

Smaller chunk size: 175 tokens.
Larger chunk size: 512 tokens.
Block overlap: 20 tokens, used to ensure some continuity between chunks.

Embedding Model

The selection of an embedding model is crucial for balancing performance and resource usage.
According to Figure 6, LLM-Embedder offers performance similar to BAAI/bge-large-en but is only about one-third the size.
Due to its smaller size and comparable effectiveness, LLM-Embedder is chosen as the preferred model to optimize both performance and efficiency.

Metadata Addition

Enhancing chunk blocks with metadata like titles, keywords, and hypothetical questions can improve retrieval.

The paper does not include specific experiments but leaves them for future work.

Vector Databases

Vector Databases Comparison: Includes Weaviate, Faiss, Chroma, Qdrant, and Milvus.
Top Choice: Milvus excels in performance and meets all basic criteria better than other options.

Retrieval

For user queries, the retrieval module selects the top k documents most relevant to the query from a pre-constructed corpus, based on their similarity.

Retrieval Techniques:

Query Rewriting: Improves queries for better document matching using LLMs.
Query Decomposition: Retrieves documents based on sub-questions from the original query.
Pseudo-Document Generation: Uses hypothetical documents to retrieve similar documents; notable example is HyDE.

Results for different retrieval methods on TREC DL19/20

Evaluation Results (Figure above):

Supervised methods outperform unsupervised methods.
HyDE + Hybrid Search: Achieved the highest performance score.
Hybrid Search: Combines BM25 (sparse retrieval) and LLM embeddings (dense retrieval) for optimal performance and low latency.

Recommendation: Use HyDE + hybrid search as the default retrieval method.

Re-ranking

Re-Ranking Phase: Enhances document relevance after initial retrieval.

Re-Ranking Methods:

DLM Re-Ranking: Uses Deep Language Models(DLMs) to classify document relevance; documents are ranked based on the probability of being “true.”
TILDE Re-Ranking: Scores documents by summing log probabilities of query terms, with TILDEv2 improving efficiency by indexing only relevant terms and using NCE loss.

Evaluation Results (Figure above):

monoT5: Recommended for a balance of performance and efficiency.
RankLLaMA: Best for optimal performance.
TILDEv2: Suitable for quick experiments with fixed sets.

Recommendation: Use monoT5 for comprehensive performance and efficiency.

Re-packing

Re-Packing Module: Ensures effective LLM response generation by optimizing document order after re-ranking.

Re-Packing Methods:

Forward: Orders documents in descending relevance.
Reverse: Orders documents in ascending relevance.
Sides: Places relevant information at the beginning or end, based on the “Lost in the Middle” concept.

Evaluation: Detailed assessment of these methods will be covered in next sections.

Summarization

Importance of Summarization: Reduces redundancy and prevents long prompts from slowing down inference in LLMs.

Summarization Methods:

Extractive Compressors: Segment and rank sentences based on importance.
Generative Compressors: Synthesize and rephrase information from multiple documents.

Evaluated Methods:

Recomp: Uses both extractive and generative compressors to select and synthesize important information.
LongLLMLingua: Focuses on key query-relevant information for improved summarization.
Selective Context: Enhances LLM efficiency by removing redundant information.

Comparison between different summarization methods

Evaluation: Methods tested on NQ, TriviaQA, and HotpotQA datasets. Recomp is preferred, with LongLLMLingua as an alternative.

AgenticRAG with LlamaIndex Course

Look into our AgenticRAG with LlamaIndex Course with 5 real-time case studies including summarization.

Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex

Learn Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex. Overcome traditional RAG challenges with…

www.masteringllm.com

Generator Fine-tuning

Investigates how fine-tuning with different contexts affects the generator’s performance, leaving retriever fine-tuning for future work.

Training Context Variants:

Dg: Only query-relevant documents.
Dr: Only randomly sampled documents.
Dgr: Mix of one relevant and one random document.
Dgg: Two copies of a relevant document.

Model Variants:

Mb: Base LM generator (not fine-tuned).
Mg, Mr, Mgr, Mgg: Models fine-tuned with contexts Dg, Dr, Dgr, and Dgg respectively

Training with Mixed Documents: The model trained with a mix of relevant and random documents (Mgr) shows the best performance with gold or mixed context.
Key Insight: Mixing relevant and random context during training enhances the generator’s robustness to irrelevant information while maintaining effective use of relevant context.

Searching for Best RAG Practices (Comprehensive Evaluation)

Results of the search for optimal RAG practices. Modules enclosed in a boxed module are under investigation to determine the best method. The underlined method represents the selected implementation. The “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.

Query Classification Module:

Improvement: Increases effectiveness and efficiency.
Performance Metrics: Average score increased from 0.428 to 0.443.
Latency: Reduced from 16.41 seconds to 11.58 seconds.

Retrieval Module:

Best Performance: “Hybrid with HyDE” achieved the highest RAG score (0.58) but has high computational cost (11.71 seconds per query).
Recommendation: Use “Hybrid” or “Original” methods to balance performance and latency.

Re-Ranking Module:

Impact: Absence leads to significant performance drop.
Best Method: MonoT5 achieved the highest average score, highlighting the importance of re-ranking for relevance.

Re-Packing Module:

Best Configuration: Reverse configuration achieved a RAG score of 0.560.
Insight: Placing more relevant context closer to the query yields better results.

Summarization Module:

Best Method: Recomp demonstrated superior performance.
Alternative: Removing the summary module can yield comparable results with lower latency, but Recomp remains preferred for handling the generator’s maximum length limitation.

All previous coffee break concepts

Look for all of our volumes of coffee break concepts:

Mastering LLM: Comprehensive Courses in Generative AI and LLMs

Mastering LLM is on a mission to train 10 million AI engineers in the next 5 years. We offer comprehensive courses in…

www.masteringllm.com

Follow us here and your feedback as comments and claps encourages us to create better content for the community.

Best Practices for RAG Pipeline

Typical RAG Workflow

Want to find out correct and accurate answers? Look for our LLM Interview Course

Large Language Model (LLM) Interview Question And Answer Course

Dive deep into the world of AI with this comprehensive large language model (LLM) interview questions & answer course…

Query Classification

Chunking

Chunk Size

Vector Databases

Retrieval

Re-ranking

Re-packing

Summarization

AgenticRAG with LlamaIndex Course

Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex

Learn Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex. Overcome traditional RAG challenges with…

Generator Fine-tuning

Searching for Best RAG Practices (Comprehensive Evaluation)

All previous coffee break concepts

Mastering LLM: Comprehensive Courses in Generative AI and LLMs

Mastering LLM is on a mission to train 10 million AI engineers in the next 5 years. We offer comprehensive courses in…

Written by Mastering LLM (Large Language Model)

No responses yet