
One in Ten Queries Goes Wrong – Even After Upgrades (Image Credits: Unsplash)
Google’s AI Overviews have transformed search results by delivering quick summaries at the top of the page, but a recent analysis exposed persistent inaccuracies that multiply dramatically at the company’s vast scale. Commissioned by The New York Times and conducted by AI startup Oumi, the study tested thousands of queries and found error rates that could lead to millions of flawed responses each hour.[1][2] Critics warn these mistakes extend beyond trivia to potential dangers in health and safety information, prompting calls for greater caution among users.
One in Ten Queries Goes Wrong – Even After Upgrades
A striking finding emerged from the benchmark tests: Google’s AI Overviews faltered on roughly one in ten questions, despite recent model improvements. The Oumi analysis evaluated 4,326 searches using the SimpleQA benchmark, an industry-standard set of challenging, verifiable questions originally developed by OpenAI.[1] In October, when powered by Gemini 2, the feature achieved 85 percent accuracy. By February, following the shift to Gemini 3, that figure rose to 91 percent.
Progress appeared evident, yet the sheer volume of daily Google searches – billions worldwide – amplified the issue. Analysts calculated that a nine percent error rate could produce millions of incorrect summaries per hour and tens of millions each day.[3] This scale turned incremental flaws into a systemic concern, as users increasingly relied on these prominent AI-generated answers over traditional links.
Behind the Numbers: Flawed Sources and Hallucinations
The study delved deeper into why errors persisted, revealing problems with source quality and verifiability. Even among correct responses, a growing share lacked solid grounding in cited links – jumping from 37 percent under Gemini 2 to 56 percent with Gemini 3.[1] Social platforms like Facebook and Reddit frequently appeared in citations, accounting for five to seven percent of references in both accurate and inaccurate overviews.
- AI Overviews sometimes pulled from conflicting web data, selecting the wrong detail amid discrepancies.
- Verification tools like Oumi’s HallOumi AI judge confirmed answers but highlighted inconsistencies across repeated tests.
- Responses varied even for identical queries run seconds apart, underscoring generative AI’s inherent unpredictability.
These patterns suggested that while the technology advanced, it struggled with nuanced interpretation and reliable sourcing, issues compounded by the web’s mix of high- and low-quality content.
From Trivial Blunders to Tangible Risks
Concrete examples illustrated the pitfalls. One query about the Bob Marley Museum cited a Facebook post, travel blog, and Wikipedia, yet stated an incorrect opening year of 1987 instead of 1986.[1] Another asked if cellist Yo-Yo Ma belonged to a Classical Music Hall of Fame; the overview referenced relevant sites but denied any record of his induction. A geography question misidentified a river near Goldsboro, North Carolina, as the Neuse instead of the Little River, despite finding the right tourism page.
Health-related searches amplified the stakes. A Guardian investigation earlier in the year uncovered AI Overviews offering misleading guidance on liver function tests, listing numbers without context for age, sex, or ethnicity – potentially reassuring patients with serious conditions.[4] Experts labeled such outputs “dangerous and alarming,” as they could delay critical care. Google removed summaries for specific liver test queries but left variations intact.
Google Pushes Back on the Critique
Google disputed the study’s premises. Spokesperson Ned Adriance described it as having “serious holes,” arguing that SimpleQA’s tricky, non-real-world questions inflated failure rates and included flawed data itself.[1] The company emphasized internal testing with validated benchmarks and noted Gemini 3.1 Pro reduced hallucinations by 38 percentage points over prior versions.
Officials pointed out that AI Overviews include a disclaimer urging users to double-check responses. They also limited appearances to high-confidence queries and excluded satire or harmful content, appearing in less than one in seven million cases per past reports. Still, the feature’s expansion raised questions about balancing speed, utility, and precision.
| Model Version | Accuracy Rate | Ungrounded Correct Answers |
|---|---|---|
| Gemini 2 | 85% | 37% |
| Gemini 3 | 91% | 56% |
This comparison from the Oumi tests highlights gains in raw accuracy but setbacks in reliable sourcing.[1]
Key Takeaways
- 91% accuracy still yields millions of daily errors at Google’s volume.
- Source verifiability lags, with social media often cited.
- Health misinformation persists in some queries, demanding user vigilance.
As AI integrates further into daily information-seeking, the tension between convenience and correctness grows. Google continues refining its models, but experts urge skepticism toward automated summaries. What steps should search engines take next to minimize risks? Share your thoughts in the comments.






