Skip to main content

Benchmark study released — How leading LLMs perform at finding errors in AI-generated documents

E
Written by Erik Ross
1777489944570.jpeg

Average F1 scores across 12 LLM configurations and Rosey, on the task of finding errors in an AI-generated earnings summary. Rosey scores 0.76 vs. a best-LLM score of 0.41.

Headline finding

The best-performing LLM configuration found only 28% of errors in an AI-generated financial document. Rosey, our purpose-built independent review system, found nearly twice as many.

Measured by F1 score — a combined measure of correctness and coverage — the best LLM scored 0.41. Rosey scored 0.76.

What we tested

We generated a Q2 FY2026 earnings summary for Cracker Barrel (NASDAQ: CBRL) using AI, then manually identified 51 numerical errors in the output — claims that were either unsupported or directly contradicted by Cracker Barrel’s official SEC earnings materials.

We then ran 12 leading LLM configurations on the same task: given the same AI-generated summary and the same source materials, find the errors. The 12 configurations span Anthropic, OpenAI, and Google models, with and without extended reasoning enabled.

We compared each configuration’s findings against the human-verified error set and computed F1 scores. The chart above shows the results.

Why this matters

It’s intuitive to assume one LLM can verify the work of another. The data says otherwise — and the gap is large.

Verification is a fundamentally different problem from generation, and it demands a different architecture: ground truth, citations, traceability, independent math and logic calculations, and a goal of proof rather than plausibility. General-purpose LLMs are optimized for the latter, not the former.

This is why we built Rosey as a purpose-built review system rather than a generation system pointed inward. Rosey verifies claims against source documents, simulates expert reviewers from multiple perspectives, and surfaces errors with citations and rationale — delivered as tracked changes inside Microsoft Word.

Materials

  • Whitepaper, claim viewer, and full results: insummary.com/prf-cbrl

  • Benchmark kit: the source documents, AI-generated summary, and human-verified error set are all available so you can run the benchmark yourself. The benchmark kit is also available for download at the link above.

Try Rosey

Rosey is currently available in limited access beta. Request access at insummary.com/request-access.

Did this answer your question?