Quick Take
  • DeepSeek-R1, the flagship reasoning model from Chinese lab DeepSeek, hallucinates at 14.3% according to Vectara’s HHEM 2.1 benchmark.
  • That is nearly four times higher than its non-reasoning predecessor DeepSeek-V3, which scored 3.9%.
  • The gap raises hard questions for the crypto sector.
  • A fast-growing class of AI agent tokens now leans on reasoning-style LLMs for autonomous trading, signals, and on-chain execution.

What Happened

The gap raises hard questions for the crypto sector. A fast-growing class of AI agent tokens now leans on reasoning-style LLMs for autonomous trading, signals, and on-chain execution.

When the underlying model fabricates a price level, a partnership, or a contract address, the consequences can land on-chain.

Market Context

The crypto market now hosts hundreds of AI agent tokens, led by Virtuals Protocol (VIRTUAL), ai16z (AI16Z), and aixbt (AIXBT).

The category has posted roughly 39.4% growth over a recent 30-day window. Virtuals alone has surpassed $576 million in market capitalization.

Most of these agents wrap a large language model in tooling. That tooling lets the agent post on social media, route trades, mint tokens, or generate market commentary.

Yann LeCun, Meta’s chief AI scientist, has long argued that autoregressive LLMs cannot fully escape hallucination. In his view, the architecture itself lacks any grounded model of the world.

Why It Matters

The risk surface scales with autonomy. Read-only agents that summarize sentiment differ in stakes from agents that hold treasury keys.

Details

DeepSeek-R1, the flagship reasoning model from Chinese lab DeepSeek, hallucinates at 14.3% according to Vectara’s HHEM 2.1 benchmark. That is nearly four times higher than its non-reasoning predecessor DeepSeek-V3, which scored 3.9%.

Vectara Data Shows R1 ‘Overhelps’ With False Facts

Vectara ran both DeepSeek models through HHEM 2.1, its dedicated hallucination evaluation framework. The team also cross-checked the results using Google’s FACTS methodology. R1 produced more false or unsupported statements than V3 in every test configuration.

The cause was not reasoning depth alone. Vectara’s analysts found that R1 tends to “overhelp.” The model adds information that does not appear in the source text.

That added detail can be factually correct on its own and still count as a hallucination. The behavior smuggles fabricated context into otherwise sound answers.

Vectara stated the finding directly in a public post on X.

“DeepSeek-R1 shows a 14.3% hallucination rate, nearly 4x higher than DeepSeek-V3,” Vectrara noted in a post.

The pattern is not unique to DeepSeek. Industry trackers note the same trade-off across reasoning-trained models from other labs. Reinforcement learning that sharpens chain-of-thought also rewards bolder and more confident generation.

Why Crypto AI Tokens Sit on This Trade-Off

One BeInCrypto analysis of AIXBT showed the agent had shilled 416 tokens with a 19% average return. The same surface mechanic, however, exposes followers to bad calls when the model fails.

Reasoning models are especially attractive for agents that plan across multiple steps. That is also the use case where Vectara’s 14.3% figure bites hardest.

A single hallucinated fact early in a chain of thought can propagate through every downstream action.

LeCun Argues the Problem Is Architectural

Reinforcement learning on chain-of-thought can paper over the issue inside narrow domains like math and coding. The root cause, however, stays in place.

Other frontier labs disagree. They point to steady progress on benchmark hallucination rates through retrieval augmentation, post-training fine-tunes, and verifier models. Reports from developers, however, often line up with the leaderboard data.

AI researcher xlr8harder, writing on X about a debugging session with R1, summed up the daily experience.

“Deepseek R1 has an interesting unintegrated understanding of its thought traces. … so it defaults to gaslighting me with hallucinations,” they stated.