Author(s): Nilesh Raghuvanshi Originally published on Towards AI. Improving Retrieval Augmented Generation (RAG) Systematically Choosing the right option — AI generated image Introduction Through my experience building an extractive question-answering system using Google’s QANet and BERT back in 2018, I quickly realized the significant impact that high-quality retrieval has on the overall performance of the system. With the advent of generative models (LLMs), the importance of effective retrieval has only grown. Generative models are prone to “hallucination”, meaning they can produce incorrect or misleading information if they lack the correct context or are fed noisy data. Simply put, the retrieval component (the “R” in RAG) is the backbone of Retrieval Augmented Generation. However, it is also one of the most challenging aspects to get right. Achieving high-quality retrieval requires constant iteration and refinement. To improve your retrieval, it’s essential to focus on the individual components within your retrieval pipeline. Moreover, having a clear methodology for evaluating their performance — both individually and as part of the larger system — is key to driving improvements. This series is not intended to be an exhaustive guide on improving RAG-based applications, but rather a reflection on key insights I’ve gained, such as the importance of iterative evaluation and the role of high-quality retrieval, while working on real-world projects. I hope these insights resonate with you and provide valuable perspectives for your own RAG endeavors. Case Study: Code Generation for SimTalk The project aimed to generate code for a proprietary programming language called SimTalk. SimTalk is the scripting language used in Siemens’ Tecnomatix Plant Simulation software, a tool designed for modeling, simulating, and optimizing manufacturing systems and processes. By utilizing SimTalk, users can customize and extend the behavior of standard simulation objects, enabling the creation of more realistic and complex system models. Since SimTalk is unfamiliar to LLMs due to its proprietary nature and limited training data, the out-of-the-box code generation quality is quite poor compared to more popular programming languages like Python, which have extensive publicly available datasets and broader community support. However, when provided with the right context through a well-augmented prompt — such as including relevant code examples, detailed descriptions of SimTalk functions, and explanations of expected behavior — the generated code quality becomes acceptable and useful, even if not perfect. This significantly enhances user productivity, which aligns well with our business objectives. Our only knowledge source is high-quality documentation of SimTalk, consisting of approximately 10,000 pages, covering detailed explanations of language syntax, functions, use cases, and best practices, along with some code snippets. This comprehensive documentation serves as the foundational knowledge base for code generation by providing the LLM with the necessary context to understand and generate SimTalk code. There are several critical components in our pipeline, each designed to provide the LLM with precise context. For instance, we use query rewriting techniques such as expansion, relaxation, and segmentation, and extract metadata from queries to dynamically build filters for more targeted searches. Instead of diving into all these specific components — such as query rewriting, metadata extraction, and dynamic filtering — I will focus on the general aspects that are applicable to any RAG-based project. In this series, we will cover How to evaluate the performance of multiple embedding models on your custom domain data? How to fine-tune an embedding model on your custom domain data? How to evaluate the retrieval pipeline? How to evaluate the generation pipeline? In general, the goal is to make data-driven decisions based on evaluation results, such as precision, recall, and relevance metrics, to optimize your RAG applications, rather than relying on intuition or assumptions. Evaluating Embedding Models for Domain-Specific Retrieval Embedding models are a critical component of any RAG application today, as they enable semantic search, which involves understanding the meaning behind user queries to find the most relevant information. This is valuable in the context of RAG because it ensures that the generative model has access to high-quality, contextually appropriate information. However, not all applications require semantic search — full-text search can often be sufficient or at least a good starting point. Establishing a solid baseline with full-text search is often a practical first step in improving retrieval. The embedding model landscape is as dynamic and competitive as the LLM space, with numerous options from a wide range of vendors. Key differentiators among these models include embedding dimensions, maximum token limit, model size, memory requirements, model architecture, fine-tuning capabilities, multilingual support, and task-specific optimization. Here, we will focus on enterprise-friendly choices like Azure OpenAI, AWS Bedrock, and open-source models from Hugging Face 🤗. It is essential to evaluate and identify the most suitable embedding model for your application in order to optimize accuracy, latency, storage, memory, and cost. To effectively evaluate and compare the performance of multiple embedding models, it is necessary to establish a benchmarking dataset. If such a dataset is not readily available, a scalable solution is to use LLMs to create one based on your domain-specific data. For example, LLMs can generate a variety of realistic queries and corresponding relevant content by using existing domain-specific documents as input, which can then be used as a benchmarking dataset. Generating a Synthetic Dataset Based on Domain-Specific Data Generating a synthetic dataset presented a unique challenge, especially with the goal of keeping costs low. We aimed to create a diverse and effective dataset using a practical, resource-efficient approach. To achieve this, we used quantized small language models (SLMs) running locally on desktop with a consumer-grade GPU. We wanted a certain level of variety and randomness in the dataset and did not want to spend excessive time selecting the ‘right’ LLM. Therefore, we decided to use a diverse set of SLMs, including Phi, Gemma, Mistral, Llama, Qwen, and DeepSeek. Additionally, we used a mix of code and language-specific models. Since we wanted the solution to be general-purpose, we developed a custom implementation that allows potential users to specify a list of LLMs they wish to use (including those provided by Azure OpenAI and […]
↧
Trending Articles
More Pages to Explore .....