DataGemma: Using real-world data to dispel AI myths

DataGemma is a set of open models designed as a solution to the problem of hallucination, which fixes LLMs on the real datasets of the vast universe available through the Data Commons of Google. Data Commons already has a natural language interface. Following the concepts of basic and general, DataGemma utilizes this existing interface allowing the natural language. In this way, using Data Commons, we avoid the problem of working with data in different schemas and APIs. Thus, from the point of view of an application program, LLMs present a single, ‘universal’ interface to an external database.

Data Commons is a part of Google that presents the knowledge graph accessible to the public; Data Commons is based on more than 250 billion global facts across several hundred thousand statistical indicators; Data Commons involves official data from various organizations such as the United Nations, the World Health Organization, ministries of health and census bureaus, as well as other factual data representing a vast area of interest, including economics, climate change, health, and demographics

Retrieval Interleaved Generation (RIG)

RIG approach adapts Gemma 2 to look for statistics in its responses and write a call to Data Commons with a corresponding query and model response side-by-side with the original text.

User query: As this article indicates, a user enters a query into the LLM.
Initial response & Data Commons query: The DataGemma model, derived from the 27 billion parameters of the Gemma 2 model, which was trained and fine-tuned only for this current RIG task, returns a response that encompasses a natural language question for Data Commons’ existent natural language interface aiming at retrieving data.
Data retrieval & correction: Certain data are sought from Data Commons and the required data are obtained. Next, these data are used with the source information and link to replace possibly inaccurate numbers appearing in the initial response.
Final response with source link: The final response is provided in the form of an answer to the user’s question, along with source data and metadata link to Data Commons for reproducing the analysis.

DataGemma: Retrieval Augmented Generation (RAG)

This setup process searches for pertinent information in Data Commons prior to creating text for the LLM to respond to the input by grounding the answer on facts. Therefore, experience with Gemini 1 is the only way to achieve the implementation of RAG. 5 Pro’s long context window, which lets us append the user query with such extensive Data Commons data.

Here’s how RAG works:

User query: A user encodes a message and puts it in the LLM.
Query analysis & Data Commons query generation: The DataGemma model (derived from the Gemma 2 (27B) model and specifically further fine-tuned for this RAG task) then parses the user’s query and produces the natural language equivalent of the query (or queries) that can be fed into Data Commons’ current natural language interface.
Data retrieval from Data Commons: Natural language query is done on Data Commons and results in the data tables, source information, and links related to the query.
Augmented prompt: This augmented prompt is the information that the system has retrieved and put together with the original user’s question or query.
Final response generation: A larger LLM employs this augmented prompt together with the data that the LLM has retrieved from various sources to come up with a broad and well–rooted reply.

As much as DataGemma is a baby step ahead of DataGemini, we do not deny the fact that it is still in its infancy of grounded AI. Of course, we are looking forward to researchers, developers, and every person who is interested in the proper usage of AI technologies reading about DataGemma and joining us in this great process. It has long been our vision that by sitting LLMs in the realism of Data Commons, one can achieve new possibilities for AI as well as an environment in which information is not only intelligent but also factual.

DataGemma – Grounding the LLMs with real-world data from Data Commons

ByAisha Singh

Retrieval Interleaved Generation (RIG)

DataGemma: Retrieval Augmented Generation (RAG)

By Aisha Singh

Related Post

Gemini Rolls Out the Spotify Extension, Hear Music from Your Android Gemini App

Instagram To Bring AI Video Editing Tool, Roll Out in 2025

OnePlus 13 Update Added New AI Feature to Enhance the Performance

Leave a Reply Cancel reply

AyuTechno