Average F1 scores across 12 LLM configurations and Rosey, on the task of finding errors in an AI-generated earnings summary. Rosey scores 0.76 vs. a best-LLM score of 0.41.
Headline finding
The best-performing LLM configuration found only 28% of errors in an AI-generated financial document. Rosey, our purpose-built independent review system, found nearly twice as many.
Measured by F1 score — a combined measure of correctness and coverage — the best LLM scored 0.41. Rosey scored 0.76.
What we tested
We generated a Q2 FY2026 earnings summary for Cracker Barrel (NASDAQ: CBRL) using AI, then manually identified 51 numerical errors in the output — claims that were either unsupported or directly contradicted by Cracker Barrel’s official SEC earnings materials.
We then ran 12 leading LLM configurations on the same task: given the same AI-generated summary and the same source materials, find the errors. The 12 configurations span Anthropic, OpenAI, and Google models, with and without extended reasoning enabled.
We compared each configuration’s findings against the human-verified error set and computed F1 scores. The chart above shows the results.
Why this matters
It’s intuitive to assume one LLM can verify the work of another. The data says otherwise — and the gap is large.
Verification is a fundamentally different problem from generation, and it demands a different architecture: ground truth, citations, traceability, independent math and logic calculations, and a goal of proof rather than plausibility. General-purpose LLMs are optimized for the latter, not the former.
This is why we built Rosey as a purpose-built review system rather than a generation system pointed inward. Rosey verifies claims against source documents, simulates expert reviewers from multiple perspectives, and surfaces errors with citations and rationale — delivered as tracked changes inside Microsoft Word.
Materials
Whitepaper, claim viewer, and full results: insummary.com/prf-cbrl
Benchmark kit: the source documents, AI-generated summary, and human-verified error set are all available so you can run the benchmark yourself. The benchmark kit is also available for download at the link above.
Try Rosey
Rosey is currently available in limited access beta. Request access at insummary.com/request-access.
