Author(s): Chirag Agrawal Originally published on Towards AI. Photo by Alvaro Reyes on Unsplash Discover how Google’s Data Gemma leverages the Data Commons knowledge graph to tackle AI hallucinations. In this blog post, we’ll explore how Data Gemma aims to improve the factual accuracy of Large Language Models (LLMs), set up a Retrieval Augmented Generation (RAG) pipeline, test its capabilities, and compare it with other leading models. Whether you’re an AI enthusiast or a developer looking to enhance your applications, this deep dive into Data Gemma will provide valuable insights into the evolving landscape of AI technology. To make this exploration hands-on, I’ve created a GitHub repository demonstrating the setup and implementation: Hands-On with Data Gemma. Feel free to follow along! Introduction Ever since Google unveiled their new language model called Data Gemma, I’ve been eager to dive in and see what makes it tick. Data Gemma promises to revolutionize how AI models interact with data, aiming to reduce a common issue known as hallucinations — when AI confidently provides inaccurate information. As someone who frequently tinkers with Large Language Models (LLMs) and grapples with the quirks of Retrieval Augmented Generation (RAG), I was particularly intrigued by Data Gemma’s innovative approach. After poring over their research paper, I decided to get my hands dirty. This blog post chronicles my journey of setting up a RAG pipeline with Data Gemma, testing its capabilities, and comparing it with other models to understand how it addresses these common AI challenges. Understanding the Problem Space LLMs are getting impressively sophisticated — they can summarize text, brainstorm creative ideas, even crank out code. But let’s be real: sometimes they confidently spout inaccuracies — a phenomenon we lovingly call hallucination. Google’s research aims to tackle this head-on by addressing three major challenges: Teaching the LLM when to fetch data from external sources versus relying on its own knowledge. Helping the LLM decide which external sources to query. Guiding the LLM to generate queries that fetch the data needed to answer the original question. Typically, we tackle these problems with Tool Use + Retrieval Augmented Generation. Here’s the playbook: Tool Use: The LLM is trained — either through fine-tuning or In-Context Learning — to decide which API to call, when to call it, and what arguments to pass. RAG: Once the data is fetched, it’s augmented into the instruction, and the LLM generates an answer. Introducing Data Commons To streamline the process of fetching data, Google introduced an open-source knowledge graph called Data Commons. Think of Data Commons as a massive, well-organized library. Instead of wandering through countless aisles (APIs) to find a book (data), you have a friendly librarian (Natural Language API) who understands exactly what you need and fetches it for you. Google claims that Data Commons brings two key innovations: A Unified Knowledge Graph: A massive collection of publicly available datasets. Natural Language API: An API that accepts natural language queries to interact with the knowledge graph — no LLMs required. Google’s research suggests that relying on the LLM to choose between multiple APIs and determine the right arguments is too error-prone at scale. Replacing that with a single knowledge graph and a natural language API significantly reduces the chances of hallucinations during query inference. Exploring Retrieval Interleaved Generation (RIG) While traditional RAG systems retrieve relevant information before generating a response, Google’s approach introduces a new method called Retrieval Interleaved Generation (RIG). Think of it like having a conversation where you pause mid-sentence to check a fact before continuing. In RIG, the model starts generating a response and, when it realizes it needs specific data (like a statistic or factual detail), it produces a natural language query that can be executed on an external database (in this case that is Data Commons). This interleaving of retrieval and generation aims to minimize hallucinations by grounding the AI’s responses in verified data from Data Commons. By fetching information on-the-fly, the model ensures that the answers it provides are accurate and up-to-date. Data Gemma’s Two Approaches Google released two versions of Data Gemma to explore these concepts: RIG Version: This model is fine-tuned to produce answers to statistical questions while also generating natural language queries for Data Commons. Imagine you’re writing a report and, as you type, you note that you need the latest unemployment rate. The model not only provides an answer but also crafts a query to fetch the exact statistic from Data Commons. RAG Version: This model focuses on generating a list of natural language queries relevant to the user’s original question. Instead of attempting to provide the answer directly, it expands the user’s question into multiple, more specific queries that can be answered using reliable data sources. Personally, I found the second approach — using the LLM to expand the user query — more intriguing. According to the research paper, human evaluators also preferred the answers from the RAG pipeline over those from the RIG pipeline. So, I decided to build a RAG pipeline myself using Data Gemma and Data Commons to see how it performs. RAG with Google’s Data Gemma Getting My Hands Dirty You can follow along with my code available on GitHub: Hands-On Data Gemma. Let’s set up the environment together. Setting Up the Environment Setting up the model wasn’t without its hurdles. Google hasn’t published a 7B version of the model on HuggingFace — or at least I couldn’t find it — and the 27B version is too large for my machine. So, I had to get creative with quantized models. Luckily, I found several quantized versions and decided to go with the most downloaded one: bartowski/datagemma-rag-27b-it-GGUF. I used the 2-bit quantized version of the model. With llama-cpp-python, hosting these models for inference is a breeze. Here’s how I set up the Data Gemma model: Testing the Model With the model up and running, I wanted to see how well it performed. I used the example query: “Has the use of renewables increased in the world?” Data Gemma […]
↧