Gemini's data-analysis capabilities are not as good as Google claims

One of the selling points of Google's flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can allegedly process and analyze. In press briefings and demos, Google has repeatedly claimed that the models can accomplish previously impossible tasks thanks to their “long context,” such as summarizing several hundred-page documents or searching for scenes in film footage.

But new research shows that these models actually aren't very good at these things.

Two separate studies examined how well Google's Gemini model and other models make sense of huge amounts of data — think “War and Peace”-length work. Both found that Gemini 1.5 Pro and 1.5 Flash struggled to correctly answer questions about large datasets; in a series of document-based tests, the model gave the correct answer only 40% to 50% of the time.

“While models like Gemini 1.5 Pro can technically process longer contexts, we have seen several cases that demonstrate that the models do not truly ‘understand’ the content,” Marzena Karpinska, a postdoctoral fellow at UMass Amherst and a study co-author, told TechCrunch.

Gemini's reference window is lacking

The model's context, or context window, refers to the input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question – “Who won the 2020 US presidential election?” – can serve as context, as can a movie script, show, or audio clip. And as context windows grow, so does the size of the documents that fit into them.

The latest versions of Gemini can take more than 2 million tokens as references. (“Tokens” are subdivided bits of raw data, such as the “fan”, “tas” and “tic” syllables in the word “fantastic”.) This is equivalent to about 1.4 million words, two hours of video or 22 hours of audio – the largest reference of any commercially available model.

At a briefing earlier this year, Google showed several pre-recorded demos to show the potential of Gemini's long-context capabilities. In one demo, Gemini 1.5 Pro searched the transcript of the Apollo 11 moon landing telecast — about 402 pages — for quotes containing jokes, and then found a scene in the telecast that looked like a pencil sketch.

Oriol Viñales, vice president of research at Google DeepMind, who led the briefing, described the model as “magical.”

,[1.5 Pro] “It's this kind of logic that works on every single page, every single word,” he said.

This might be an exaggeration.

In one of the above-mentioned studies benchmarking these capabilities, Karpinska, together with researchers from the Allen Institute for AI and Princeton, asked the models to evaluate true/false statements about fiction books written in English. The researchers chose recently written works so that the models could not “cheat” by relying on prior knowledge, and they filled the statements with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.

Given a statement such as “Using his skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the key to the reagents found in Rona's wooden chest”, Gemini 1.5 Pro and 1.5 Flash – after swallowing the corresponding book – had to tell whether the statement was true or false and explain their reasoning.

test book
Image Credit: UMass Amherst

When tested on a book of about 260,000 words (~520 pages), the researchers found that 1.5 Pro correctly answered true/false statements 46.7% of the time, while Flash only gave the correct answer 20% of the time. This means that Coin is significantly better at answering questions about the book than Google's latest machine learning models. When averaging all benchmark results, none of the models managed to achieve more than random chance in terms of question-answer accuracy.

“We observed that the models had more difficulty verifying claims that required considering large parts of the book, or the entire book, compared with claims that could be resolved by obtaining sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models struggled to verify claims about implicit information that is obvious to a human reader but not explicitly stated in the text.”

The second of two studies co-authored by UC Santa Barbara researchers tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to reason on videos — that is, to discover and answer questions about the content in them.

The co-authors created a dataset of images (e.g., a picture of a birthday cake) that was paired with questions for the models to answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the models, they randomly chose one of the images and inserted “distractor” images before and after it to create slideshow-like footage.

Flash did not perform as well. In a test in which the model had to write six handwritten digits from a “slideshow” of 25 images, Flash made about 50% correct transcriptions. Accuracy dropped to about 30% with eight digits.

“On actual question-answering tasks on images, this appears to be particularly difficult for all of the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That little amount of reasoning — recognizing that a number is in a frame and reading it — may be the thing that breaks the model.”

Google is promising too much with Gemini

Neither of these studies is peer-reviewed, nor do they test the releases of Gemini 1.5 Pro and 1.5 Flash with 2 million token contexts. (Both tested the 1 million token context release.) And Flash is not as capable as Pro in terms of performance; Google advertises it as a lower-cost alternative.

Still, both fuel the notion that Google has been overpromising — and underpromising — with Gemini from the start. None of the models tested by the researchers, which include OpenAI’s GPT-4o and Anthropic’s Cloud 3.5 Sonnet, performed well. But Google is the only model provider to put context windows top of its ads.

“There’s nothing wrong with the simple claim that ‘our model can take X number of tokens’ based on objective technical details,” Saxon said. “But the question is, what useful thing can you do with it?”

More broadly, generative AI is coming under greater scrutiny, as businesses (and investors) become frustrated with the technology’s limitations.

In recent surveys by the Boston Consulting Group, nearly half of respondents – all C-suite executives – said they don’t expect generative AI to lead to any significant productivity boost and are concerned about the potential for mistakes and data compromises from generative AI-powered tools. PitchBook recently reported that, for two consecutive quarters, generative AI dealmaking in the early stages has declined, down 76% from its Q3 2023 peak.

Faced with meeting-summarizing chatbots that offer fictitious details about people and AI search platforms that are basically the equivalent of plagiarism generators, customers are looking for promising differentiators. Google — which has sometimes clumsily raced to catch up to its generative AI rivals — was desperate to make Gemini's context one of those differentiators.

But it appears that this bet was placed prematurely.

“We haven't really figured out a way to show that 'reasoning' or 'understanding' is happening over long documents, and basically every group releasing these models is piecing together their own ad-hoc evaluations to make these claims,” ​​Karpinska said. “Without information on how long context processing is implemented — and companies don't share these details — it's hard to say how realistic these claims are.”

Google did not respond to a request for comment.

Both Saxon and Karpinska believe that the antidote to exaggerated claims about generative AI is better benchmarks and, similarly, a greater emphasis on third-party criticism. Saxon notes that one of the more common tests for long context (cited liberally by Google in its marketing materials), the “Needle in the Haystack,” only measures the model’s ability to retrieve specific information like names and numbers from a dataset — not answer complex questions about that information.

“Essentially all the scientists and most engineers who use these models agree that our current benchmark culture is broken, so it’s important that the public understands that these huge reports with numbers like ‘general intelligence in benchmarks’ should be taken with a great deal of skepticism,” Saxon said.

Leave a Comment

“The Untold Story: Yung Miami’s Response to Jimmy Butler’s Advances During an NBA Playoff Game” “Unveiling the Secrets: 15 Astonishing Facts About the PGA Championship”