Laurent Giraid stands as a leading voice in the evolution of artificial intelligence, particularly in how machines interpret the complex, human-centric structures of the modern web. With a deep background in natural language processing and machine learning, he has long championed the idea that for AI to truly understand our world, it must perceive information the same way we do—not just as strings of text, but as visual experiences. Today, we sit down with him to discuss a groundbreaking shift in Retrieval-Augmented Generation (RAG) that moves away from traditional HTML parsing in favor of a vision-based approach.
In our conversation, we explore the fundamental flaws inherent in converting rich web pages into flat text files and how this “parser loss” accounts for a significant portion of AI hallucinations. We delve into the mechanics of PixelRAG, a system that utilizes screenshots and vision-language models to retain visual hierarchy and context. We also discuss the striking economic advantages of this technology, including a tenfold reduction in token costs for AI agents, and the remaining hurdles that developers must clear to achieve a truly seamless visual retrieval pipeline.
For years, developers have relied on HTML-to-text parsers as the foundation of any RAG pipeline, yet we are seeing a shift toward bypassing them entirely. What is fundamentally wrong with the way we currently extract information from the web for AI?
The core issue is that traditional parsers are essentially trying to translate a three-dimensional visual experience into a one-dimensional string of characters. When you strip away the HTML and convert a page to plain text, you are effectively destroying the visual hierarchy, typography, and layout that provide critical context to the reader. Think about a complex table or a sidebar; a parser often flattens these into a jumbled mess where the relationship between data points is lost. According to recent research, this conversion step is actually the culprit behind the majority of incorrect answers in enterprise AI systems. We’ve reached a point where improving parsers is a game of diminishing returns because every website is a unique snowflake requiring custom engineering, whereas a visual approach treats every page with a universal standard.
When we look at why these AI systems fail to provide accurate answers, your research points toward specific breakdown points in the data pipeline. Could you break down where the information actually gets lost?
We have identified three primary failure points that occur long before the information even reaches the vision-language model. The first is what we call parser loss, which accounts for about 36.6% of failures; this is where the HTML-to-text conversion is so destructive that the answer simply doesn’t exist in any of the resulting text chunks. Then there is rank loss, the biggest offender at 55.2%, where the correct answer is in the database but gets buried under keyword-heavy infoboxes that rank higher but lack the specific detail needed. Finally, reader loss makes up the remaining 8.2%, where the model actually sees the right text but can’t attribute it correctly because the flattened structure has removed the original context. By seeing the page as a screenshot, we can virtually eliminate these gaps because the visual cues—like bold text or specific layout placements—remain intact for the model to interpret.
If we are moving away from text and toward images, how does a system like PixelRAG actually handle the massive amount of data found on a site like Wikipedia?
The process begins with rendering the pages using a tool like Playwright, which captures the site in an 875-pixel viewport to ensure a consistent visual format. For a massive corpus like Wikipedia, which contains 7 million articles, this results in approximately 30 million screenshot tiles, each sliced into 1024-pixel-tall segments. These tiles are then encoded into 2048-dimensional vectors using a specialized embedding model, like the Qwen3-VL-Embedding-2B, and stored in a FAISS index. Even though the raw screenshots would take up a massive 5.6 TB of storage, we can use a “render-on-demand” strategy where we only store the 120 GB vector index and re-render the images only when they are needed for a query. This makes the system incredibly efficient, allowing for incremental updates without having to re-index the entire multi-terabyte library.
There is a lot of talk about the “token tax” in AI, where the cost of processing large amounts of data becomes prohibitive for many companies. How does a visual retrieval system compare to traditional text-based methods in terms of cost and efficiency?
This is perhaps the most compelling near-term argument for switching to a visual-first approach like PixelRAG. In our benchmark testing, an AI agent using this visual search backend required only 3.6 million prompt tokens to complete its tasks, compared to a staggering 37.5 million tokens for the traditional text retrieval method. That is a ten-fold reduction in token usage, which translates directly to a cost that is 2 to 4 times lower than current alternatives, including high-end search APIs from major tech giants. When you factor in the ability to further compress these images, you can actually cut that token budget by another third without sacrificing accuracy. It turns the economics of high-fidelity RAG on its head, making deep, context-aware retrieval affordable for smaller-scale enterprise applications.
While the accuracy and cost benefits are clear, there are always trade-offs when implementing a new architecture. What are the current limitations or “unsolved problems” that keep visual RAG from being a total replacement for text today?
The most significant hurdle we face right now is what I call the visual chunking problem. In the text-based RAG world, we’ve spent a decade perfecting how to split documents into semantic chunks based on paragraphs, headers, or specific topics. Currently, PixelRAG simply slices pages at a fixed pixel height, which means a vital table or a crucial paragraph might get cut in half right in the middle of a tile. This lack of awareness regarding content boundaries is a major area for future research, as the visual retrieval community is still playing catch-up to the years of study devoted to text chunking. Until we can intelligently slice images based on their semantic content rather than just pixel counts, the system will always have a slight “blind spot” at the edges of its tiles.
Looking at the benchmarks, PixelRAG seems to excel specifically in areas where text parsers struggle. Which types of queries or data structures see the most dramatic improvement when processed visually?
We see the most impressive gains in structured information extraction, particularly with tables and complex layouts. On the SimpleQA benchmark, PixelRAG achieves an accuracy of 78.8%, which is a significant jump over the 71.6% managed by the strongest text-based parsers. When you move to queries that are specifically based on structured tables, the gap widens even further, with the visual system hitting 48.8% accuracy compared to just 42.5% for text. It is important to note, however, that you need a certain level of model “muscle” to see these benefits; we’ve found that you need at least a Qwen3-VL-4B class model or higher. Smaller models actually struggle with visual data and can trail behind text retrieval by more than 12.5 percentage points because they lack the reasoning capacity to interpret the layout.
For a CTO or a lead developer who already has a functional RAG pipeline, a complete rebuild seems like a daunting task. Is there a way to integrate these visual capabilities without starting from scratch?
The most practical path forward isn’t a total replacement but rather the adoption of a hybrid retrieval layer. We are seeing a massive trend in the market toward this, with enterprise intent to adopt hybrid systems tripling from roughly 10% in January to over 33% by March of this year. You can layer PixelRAG on top of your existing text systems, using visual retrieval to handle the complex tables and highly formatted pages where your current parser is likely failing. This “enhancement layer” approach is straightforward to implement and allows teams to see immediate accuracy improvements of up to 18.1% without the risk of a ground-up rebuild. It’s about using the right tool for the right job—letting text handle the simple prose while the vision model tackles the visually dense content.
What is your forecast for the role of vision-language models in the enterprise over the next few years?
I believe we are entering an era where the concept of “parsing” will eventually become an archaic relic of the early AI age. Within the next three to five years, I expect vision-language models to become the default standard for all web-scale retrieval because they eliminate the need for the fragile, site-specific engineering that plagues our current systems. As training costs for these models continue to drop—evidenced by our ability to fine-tune PixelRAG in under three hours on a single H100—the barrier to entry will vanish. We will move toward a “what you see is what you retrieve” model, where AI agents navigate the digital world with the same visual fluidity as humans, leading to a massive surge in reliability for autonomous systems across every industry.
