THE HALLUCINATION CRISIS: OpenAI's 'Smartest' AI Models Are Now Making Things Up at Terrifying Rates—And Even They Don't Know Why
The Unsettling Reality of AI's New 'Reasoning' Models and Why the Technology You Trust Might Be Lying to You More Than Ever
--
- BREAKING: OpenAI just released what they're calling their most advanced AI reasoning models yet—and they've accidentally created something unprecedented. Not in a good way.
The Numbers That Should Shock You
The company's new o3 and o4-mini models, launched with fanfare on April 16, 2025, are supposed to represent the cutting edge of AI "reasoning" capabilities. They can solve complex math problems, write better code, and demonstrate what OpenAI calls "chain of thought" reasoning.
But there's a catastrophic catch that OpenAI buried in their technical documentation: these "smarter" models hallucinate more than their predecessors. Significantly more.
In fact, the hallucination rates are so concerning that even OpenAI's own researchers admitted in their system card: "More research is needed to understand why hallucinations are getting worse as we scale up reasoning models."
Let me repeat that: The world's leading AI lab just released their most advanced models yet, acknowledged that these systems make up facts at higher rates than older, "less intelligent" models, and admitted they don't understand why.
If that's not cause for alarm, I don't know what is.
--
Let's get specific, because vague concerns about AI are easy to dismiss. These numbers are not vague. They are alarming.
On PersonQA—OpenAI's own internal benchmark for measuring the accuracy of a model's knowledge about people—here's what the data shows:
- Previous reasoning models (o1, o3-mini): Hallucinated on approximately 15% of questions
The new models are hallucinating at roughly DOUBLE the rate of their predecessors.
Think about what that means in practical terms. If you're using o4-mini to research someone—maybe a potential business partner, a job candidate, a source for a news story—you have nearly a 50/50 chance that the information it provides will be completely fabricated.
This isn't a minor technical glitch. This is a fundamental failure of the core promise of AI: that it will provide accurate, reliable information. Instead, these "advanced" models are less reliable than their predecessors.
--
The Corporate Euphemism Problem
Why This Is Happening: The Theory That Explains Everything and Solves Nothing
Real-World Consequences: When AI Hallucinations Aren't Just Funny Anymore
OpenAI didn't lead with this information. Of course they didn't. When they announced o3 and o4-mini, they emphasized the models' improved performance on coding benchmarks, math competitions, and reasoning tasks.
The hallucination problem was buried in technical documentation that most users—and most journalists—will never read.
And when they did discuss hallucinations, they used carefully crafted corporate language. The models don't "lie" or "make things up." They "hallucinate." They don't "have a truthfulness problem." They "make more claims overall, which leads to both more accurate claims and more inaccurate claims."
This linguistic sanitization obscures a dangerous reality: AI systems being deployed to millions of users are becoming less reliable in ways that even their creators don't understand.
--
OpenAI's technical report offers a hypothesis for why reasoning models hallucinate more. It's worth understanding, even if it doesn't provide a solution:
The theory: Reasoning models like o3 and o4-mini are trained using reinforcement learning to "think" through problems step by step. This process encourages them to generate more detailed, elaborate responses. But more elaborate responses mean more opportunities for error. As one OpenAI researcher put it, the models "make more claims overall," which means "more accurate claims as well as more inaccurate/hallucinated claims."
In other words: the very feature that makes these models "smarter"—their ability to reason through complex problems—also makes them more prone to making things up.
This is not a bug that can be easily patched. It's a fundamental trade-off built into the architecture of reasoning models. The more they "think," the more they risk hallucinating. And OpenAI has no clear path to solving this problem.
As their system card admits: "More research is needed to understand why hallucinations are getting worse as we scale up reasoning models."
Translation: We built something more powerful, it's more dangerous in ways we didn't expect, and we don't know how to fix it.
--
The tech press often treats AI hallucinations as amusing anecdotes. Remember when Google's Bard made up facts about the James Webb Space Telescope? Or when ChatGPT invented court cases that got a lawyer sanctioned?
These stories get shared as funny examples of AI weirdness. But with hallucination rates doubling or tripling in the newest models, the consequences are no longer amusing—they're potentially catastrophic.
Let me walk you through some scenarios that should keep you up at night:
Medical Misinformation at Scale
Imagine a doctor using o3 or o4-mini to research a rare condition. The models hallucinate 33-48% of the time on factual queries. If the AI confidently fabricates a treatment recommendation—perhaps inventing a drug interaction or misrepresenting clinical trial data—the consequences could be fatal.
Medical AI tools are being integrated into healthcare systems worldwide. If those tools are powered by models that hallucinate at these rates, we're looking at a public health crisis.
Legal Liability
Law firms are increasingly using AI for legal research and document drafting. We've already seen cases where lawyers were sanctioned for submitting AI-generated briefs with fabricated court citations. With o3 and o4-mini's elevated hallucination rates, expect more incidents—not fewer.
The cost of these errors isn't just professional embarrassment. It's miscarriages of justice, wasted legal resources, and eroded trust in the legal system.
Financial Decision-Making
Traders and financial analysts are using AI for market analysis and investment recommendations. If these systems are hallucinating economic data, company financials, or market trends at rates approaching 50%, the potential for financial losses is enormous.
And it's not just professional traders. Retail investors using AI-powered tools could make life-altering financial decisions based on completely fabricated information.
Academic Integrity
Students are using AI for research and writing assistance. If these systems are fabricating sources, misrepresenting research findings, and inventing academic references at unprecedented rates, we're looking at a crisis in academic integrity.
The distinction between legitimate research and AI-generated hallucinations is becoming increasingly difficult to detect.
Journalism and Democracy
Reporters using AI for research could inadvertently publish false information. In an era of widespread distrust in media, AI hallucinations masquerading as verified facts could be devastating to democratic discourse.
Imagine a major news outlet publishing a story based on AI-generated claims that turn out to be complete fabrications. The damage to public trust would be incalculable.
--
The Independent Research That Confirms the Crisis
The Business Users Are Already Noticing
The Bigger Picture: A Dangerous Industry Trend
OpenAI's admission is concerning enough. But independent research confirms that this isn't just a theoretical problem—it's already happening.
Transluce, a nonprofit AI research lab, conducted independent testing of o3 and found evidence of what they called "confabulation"—the tendency to make up actions the AI claimed to have taken while reasoning through problems.
In one striking example, o3 claimed it had run code on a 2021 MacBook Pro "outside of ChatGPT," then copied the results into its answer. This is impossible—o3 doesn't have the ability to execute code on external computers. But it fabricated the entire process with convincing detail.
Neil Chowdhury, a Transluce researcher and former OpenAI employee, offered a hypothesis: "Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines."
Translation: The training process that makes these models better at reasoning may also be making them worse at telling the truth.
Sarah Schwettmann, co-founder of Transluce, noted that o3's hallucination rate "may make it less useful than it otherwise would be"—a polite way of saying that a model that makes things up half the time isn't particularly useful at all.
--
It's not just researchers raising alarms. Business users who are actually deploying these models are seeing the problem firsthand.
Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling startup Workera, told TechCrunch that his team is testing o3 for coding workflows. While he found it competitive with other models, he noted a specific problem: o3 tends to hallucinate broken website links.
The model will supply URLs with apparent confidence, but when you try to click them, they don't work. The AI invented the links entirely.
This isn't a hypothetical risk. It's happening right now, to real users, in production environments.
--
The o3 and o4-mini hallucination crisis isn't an isolated incident. It's part of a broader, deeply concerning trend in the AI industry:
Diminishing Returns on Traditional Training
The AI industry has hit a wall with traditional training techniques. Simply adding more data and more compute—what engineers call "scaling"—is producing diminishing returns. GPT-4 isn't dramatically better than GPT-3 in the way that GPT-3 was better than GPT-2.
This is why the industry has pivoted to "reasoning" models. The hope is that by training AI to think step-by-step, they can achieve better performance without needing exponentially more training data and compute.
The Reasoning Trade-Off
But the o3 and o4-mini results suggest there's a fundamental trade-off: reasoning capability may come at the cost of truthfulness.
The more these systems "think," the more elaborate their responses become. And more elaborate responses mean more opportunities for error. The models aren't just hallucinating more—they're hallucinating more confidently, with detailed, plausible-sounding fabrications that are harder to detect.
Competitive Pressure Overriding Caution
The AI arms race is creating intense pressure to release more capable models, even when those capabilities come with significant risks. OpenAI's admission that they don't fully understand why their reasoning models hallucinate more is deeply concerning—yet they released the models anyway.
The industry is prioritizing capabilities over safety, performance over reliability, and speed over caution.
--
The OpenAI Response: Corporate Speak That Says Nothing
Why This Matters: The Trust Erosion Crisis
When TechCrunch asked OpenAI about the hallucination problem, spokesperson Niko Felix offered this carefully crafted non-answer:
"Addressing hallucinations across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability."
Translation: We know it's a problem. We don't know how to fix it. But we're working on it.
This is cold comfort to the millions of users who are already using these models for critical tasks. "We're working on it" isn't an acceptable response when the problem involves systems making up facts at rates approaching 50%.
--
Here's the fundamental problem that OpenAI's hallucination crisis represents: AI systems are being deployed at scale before they're reliable enough to be trustworthy.
Every hallucination—every fabricated fact, every made-up citation, every invented piece of information—erodes trust in AI systems. And as these systems become more integrated into critical infrastructure—healthcare, legal systems, financial markets, democratic processes—the cost of that erosion becomes catastrophic.
We're rapidly approaching a point where:
- Society loses confidence in AI-powered systems
That future is closer than you think. And OpenAI just accelerated us toward it.
--
The Questions OpenAI Won't Answer
If OpenAI were being transparent about this crisis, they would answer the following questions:
- Why release models with known hallucination problems?
If you acknowledge that these models hallucinate more than their predecessors, why release them to millions of users?
- What safeguards are in place?
What protections exist to prevent users from relying on hallucinated information for critical decisions?
- Who's liable?
If someone makes a life-altering decision based on information from o3 or o4-mini that turns out to be a complete fabrication, who's responsible?
- What research is being prioritized?
"More research is needed" is an admission of ignorance, not a plan. What concrete steps are being taken to understand and solve the hallucination problem?
- Will you delay future releases?
If scaling reasoning models increases hallucination rates, will you delay future model releases until this problem is solved?
Don't hold your breath waiting for answers. OpenAI has been notably silent on these fundamental questions.
--
The Competitors Are Watching—and Learning
OpenAI isn't operating in a vacuum. Every AI lab in the world is watching these developments and drawing conclusions:
- Smaller labs are racing to catch up, often with fewer resources for safety testing.
The entire industry is converging on reasoning models as the path forward. Which means the entire industry may be converging on the same hallucination problem.
If OpenAI—by many measures, the most capable AI lab in the world—can't solve this problem, what chance do smaller labs have?
--
The Prediction: Things Get Worse Before They Get Better
Based on current trends, here's what I predict will happen:
Short Term (6-12 months)
More incidents of AI hallucinations causing real-world harm will make headlines. Users will begin to understand that the newest, "most capable" models are actually less reliable for factual queries. There will be growing pressure for AI companies to be transparent about hallucination rates.
Medium Term (1-2 years)
Industry standards will emerge for measuring and reporting hallucination rates. Some applications will start requiring human verification of AI-generated facts. But competitive pressure will continue to push for more capable models, even at the cost of reliability.
Long Term (3-5 years)
Either the hallucination problem will be solved (unlikely, given current trajectories), or society will adapt to an information environment where AI-generated content is inherently unreliable. This would represent a fundamental shift in how we trust and verify information—a shift with profound implications for democracy, science, and social cohesion.
--
What You Can Do: Protecting Yourself in the Hallucination Age
While the industry works out these fundamental problems—or doesn't—here's how you can protect yourself:
Verify Everything
Never rely on AI-generated information without independent verification. Treat every fact, citation, and claim as potentially fabricated until proven otherwise.
Understand the Limitations
Know which tasks are high-risk for hallucinations. Factual recall, citations, names, dates, and specific details are particularly vulnerable. Creative tasks and coding assistance may be safer.
Use Multiple Sources
Don't rely on a single AI system for important information. Cross-reference with search engines, primary sources, and human experts.
Maintain Skepticism
Approach AI-generated content with the same skepticism you'd apply to an anonymous source with a history of making things up. Because that's essentially what you're dealing with.
Demand Accountability
Support regulations and industry standards that require transparency about hallucination rates and hold AI companies accountable for the reliability of their systems.
--
The Bottom Line: A Crisis of Confidence
- What's your experience with AI hallucinations? Have you caught an AI making things up? Share your stories and help others understand the risks. The more we talk about this problem, the harder it becomes to ignore.
- Author's Note: This article is based on OpenAI's published technical documentation, independent research from Transluce, and statements from business users and AI researchers. All hallucination statistics are from OpenAI's own published materials. The concerns raised are shared by numerous AI researchers, safety experts, and industry professionals.
OpenAI's o3 and o4-mini models represent a turning point in the AI industry. They're more capable than their predecessors in many ways—but fundamentally less reliable in ways that matter enormously.
The admission that these models hallucinate more, combined with the acknowledgment that OpenAI doesn't fully understand why, should give everyone pause. We're deploying systems at scale that are demonstrably unreliable, in ways their creators can't explain or fix.
This is not responsible technology development. It's an experiment being conducted on billions of users without their informed consent.
The hallucination crisis isn't just a technical problem. It's a crisis of confidence in AI systems. Every fabricated fact, every invented citation, every made-up claim erodes the trust that AI needs to fulfill its promise.
And OpenAI just released their most hallucination-prone models yet, with no clear plan to fix the problem.
The AI revolution is here. But the systems leading that revolution are lying to us more than ever—and even their creators don't know why.
Welcome to the hallucination age. Verify everything. Trust nothing.
--
--