Best Practices for RAG Pipeline

--

Over the past few years, RAG has matured and multiple studies has been done to understand pattern and behaviors which can result in low cost with high accuracy. One such research is Searching for Best Practices in Retrieval-Augmented Generation published as a paper.

Typical RAG Workflow

Credit : https://arxiv.org/pdf/2407.01219

A typical RAG (Retrieval-Augmented Generation) workflow has several steps:

  • Query Classification: Check if the user’s question needs document retrieval.
  • Retrieval: Find and get the most relevant documents quickly.
  • Re-ranking: Arrange the retrieved documents in order of relevance.
  • Re-packing: Organize the documents into a structured format.
  • Summarization: Extract key points to generate clear, concise answers and avoid repetition.

Implementing RAG also involves deciding how to break down documents into chunks, choosing the right embeddings for understanding the text’s meaning, selecting a suitable vector database for storing features efficiently, and finding ways to fine-tune language models, as shown in Figure 1.

Lets evaluate each one in detail:

Want to find out correct and accurate answers? Look for our LLM Interview Course

  • 100+ Questions spanning 14 categories
  • Curated 100+ assessments for each category
  • Well-researched real-world interview questions based on FAANG & Fortune 500 companies
  • Focus on Visual learning
  • Real Case Studies & Certification

50% off Coupon Code — LLM50

Link for the course

Query Classification

Why is query classification important?

  • Not all questions require additional retrieval because LLMs have built-in knowledge.
  • While RAG (Retrieval-Augmented Generation) can enhance accuracy and reduce errors, frequent document retrieval can slow down response times.
  • To optimize performance, we classify queries to determine if retrieval is necessary.

When is retrieval recommended?

  • Retrieval is usually needed when the answer requires information not contained within the model itself.

How are tasks categorized?

  • Tasks are grouped into 15 types, based on whether they have enough information.
  • Tasks that can be answered with only the provided user information are marked as “sufficient” and don’t need retrieval.
  • Tasks that need more information are marked as “insufficient” and may require document retrieval.
Classification of retrieval requirements for different tasks. In cases where information is not provided, we differentiate tasks based on the functions of the model. Image credit — https://arxiv.org/pdf/2407.01219

This classification process is automated by training a classifier.

Results of the Query Classifier

Chunking

Why is document chunking important?

  • Breaking down a document into smaller chunks helps improve retrieval accuracy and prevents issues related to document length when using LLMs (Large Language Models).

What are the levels of chunking?

  • Token-level chunking: Splits text by a set number of tokens. It’s simple but can break sentences, which might lower retrieval quality.
  • Semantic-level chunking: Uses LLMs to identify natural breakpoints, keeping the context intact but requiring more processing time.
  • Sentence-level chunking: Divides text at sentence boundaries, balancing the preservation of meaning with efficiency and simplicity.

Which chunking method is preferred?

  • Sentence-level chunking is often used as it strikes a good balance between maintaining the text’s meaning and being easy to implement.

Chunk Size

Importance of Chunk Size : The size of each chunk in a document greatly impacts performance.

  • Larger chunks provide more context, which helps in understanding the content better but can slow down processing.
  • Smaller chunks are processed faster and improve recall rates but might not provide enough context for accurate comprehension.
Comparison of different chunk sizes.

Key Metrics for Evaluation

  • As shown in Figure, chunk sizes are evaluated based on two metrics:
  • Faithfulness: Assesses whether the response is accurate or if it deviates (hallucinates) from the retrieved text.
  • Relevancy: Checks if the retrieved text and the response are closely related to the query.
Comparison of different chunk sizes.

How are Chunks Organized?

The analysis in Figure 5 compares different chunk sizes:

  • Smaller chunk size: 175 tokens.
  • Larger chunk size: 512 tokens.
  • Block overlap: 20 tokens, used to ensure some continuity between chunks.
Comparison of different chunk skills

Embedding Model

  • The selection of an embedding model is crucial for balancing performance and resource usage.
  • According to Figure 6, LLM-Embedder offers performance similar to BAAI/bge-large-en but is only about one-third the size.
  • Due to its smaller size and comparable effectiveness, LLM-Embedder is chosen as the preferred model to optimize both performance and efficiency.
Results for different embedding models on namespace-Pt/msmarco

Metadata Addition

Enhancing chunk blocks with metadata like titles, keywords, and hypothetical questions can improve retrieval.

The paper does not include specific experiments but leaves them for future work.

Vector Databases

  • Vector Databases Comparison: Includes Weaviate, Faiss, Chroma, Qdrant, and Milvus.
  • Top Choice: Milvus excels in performance and meets all basic criteria better than other options.
Comparison of Various Vector Databases

Retrieval

For user queries, the retrieval module selects the top k documents most relevant to the query from a pre-constructed corpus, based on their similarity.

Retrieval Techniques:

  • Query Rewriting: Improves queries for better document matching using LLMs.
  • Query Decomposition: Retrieves documents based on sub-questions from the original query.
  • Pseudo-Document Generation: Uses hypothetical documents to retrieve similar documents; notable example is HyDE.
Results for different retrieval methods on TREC DL19/20

Evaluation Results (Figure above):

  • Supervised methods outperform unsupervised methods.
  • HyDE + Hybrid Search: Achieved the highest performance score.
  • Hybrid Search: Combines BM25 (sparse retrieval) and LLM embeddings (dense retrieval) for optimal performance and low latency.

Recommendation: Use HyDE + hybrid search as the default retrieval method.

Re-ranking

Re-Ranking Phase: Enhances document relevance after initial retrieval.

Re-Ranking Methods:

  • DLM Re-Ranking: Uses Deep Language Models(DLMs) to classify document relevance; documents are ranked based on the probability of being “true.”
  • TILDE Re-Ranking: Scores documents by summing log probabilities of query terms, with TILDEv2 improving efficiency by indexing only relevant terms and using NCE loss.
Results of different reranking methods on the dev set of the MS MARCO Passage ranking dataset

Evaluation Results (Figure above):

  • monoT5: Recommended for a balance of performance and efficiency.
  • RankLLaMA: Best for optimal performance.
  • TILDEv2: Suitable for quick experiments with fixed sets.

Recommendation: Use monoT5 for comprehensive performance and efficiency.

Re-packing

  • Re-Packing Module: Ensures effective LLM response generation by optimizing document order after re-ranking.

Re-Packing Methods:

  • Forward: Orders documents in descending relevance.
  • Reverse: Orders documents in ascending relevance.
  • Sides: Places relevant information at the beginning or end, based on the “Lost in the Middle” concept.

Evaluation: Detailed assessment of these methods will be covered in next sections.

Summarization

  • Importance of Summarization: Reduces redundancy and prevents long prompts from slowing down inference in LLMs.

Summarization Methods:

  • Extractive Compressors: Segment and rank sentences based on importance.
  • Generative Compressors: Synthesize and rephrase information from multiple documents.

Evaluated Methods:

  • Recomp: Uses both extractive and generative compressors to select and synthesize important information.
  • LongLLMLingua: Focuses on key query-relevant information for improved summarization.
  • Selective Context: Enhances LLM efficiency by removing redundant information.
Comparison between different summarization methods

Evaluation: Methods tested on NQ, TriviaQA, and HotpotQA datasets. Recomp is preferred, with LongLLMLingua as an alternative.

Generator Fine-tuning

Investigates how fine-tuning with different contexts affects the generator’s performance, leaving retriever fine-tuning for future work.

Training Context Variants:

  • Dg: Only query-relevant documents.
  • Dr: Only randomly sampled documents.
  • Dgr: Mix of one relevant and one random document.
  • Dgg: Two copies of a relevant document.

Model Variants:

  • Mb: Base LM generator (not fine-tuned).
  • Mg, Mr, Mgr, Mgg: Models fine-tuned with contexts Dg, Dr, Dgr, and Dgg respectively
Results of generator fine-tuning.
  • Training with Mixed Documents: The model trained with a mix of relevant and random documents (Mgr) shows the best performance with gold or mixed context.
  • Key Insight: Mixing relevant and random context during training enhances the generator’s robustness to irrelevant information while maintaining effective use of relevant context.

Searching for Best RAG Practices (Comprehensive Evaluation)

Results of the search for optimal RAG practices. Modules enclosed in a boxed module are under investigation to determine the best method. The underlined method represents the selected implementation. The “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.

Query Classification Module:

  • Improvement: Increases effectiveness and efficiency.
  • Performance Metrics: Average score increased from 0.428 to 0.443.
  • Latency: Reduced from 16.41 seconds to 11.58 seconds.

Retrieval Module:

  • Best Performance: “Hybrid with HyDE” achieved the highest RAG score (0.58) but has high computational cost (11.71 seconds per query).
  • Recommendation: Use “Hybrid” or “Original” methods to balance performance and latency.

Re-Ranking Module:

  • Impact: Absence leads to significant performance drop.
  • Best Method: MonoT5 achieved the highest average score, highlighting the importance of re-ranking for relevance.

Re-Packing Module:

  • Best Configuration: Reverse configuration achieved a RAG score of 0.560.
  • Insight: Placing more relevant context closer to the query yields better results.

Summarization Module:

  • Best Method: Recomp demonstrated superior performance.
  • Alternative: Removing the summary module can yield comparable results with lower latency, but Recomp remains preferred for handling the generator’s maximum length limitation.

Follow us here and your feedback as comments and claps encourages us to create better content for the community.

Can you give multiple claps? Yes you can

--

--

Mastering LLM (Large Language Model)

MasteringLLM is a AI first EdTech company making learning LLM simplified with its visual contents. Look out for our LLM Interview Prep & AgenticRAG courses.