Comparative Empirical Evaluation of Hallucination Mitigation Strategies in LLM-Based Text Generation

Shuyang Xu; Minhao Li; Fanyi Zhao

Authors

Shuyang Xu Master of Professional Studies, Applied Statistics, Cornell University, Ithaca, NY, USA Author
Minhao Li Master of Science in Computer Engineering, University of California, Davis, Davis, CA, USA Author
Fanyi Zhao Computer Science, Stevens Institute of Technology, Hoboken, NJ, USA Author

Keywords:

large language models, hallucination mitigation, retrieval-augmented generation, factuality evaluation

Abstract

Large language models (LLMs) have achieved remarkable performance across natural language tasks, yet their tendency to generate factually incorrect content --- commonly termed hallucination --- remains a critical barrier to deployment in high-stakes domains. Two dominant families of mitigation strategies have emerged: retrieval-augmented generation (RAG) approaches that ground outputs in external knowledge, and prompting-based approaches that leverage self-verification without external retrieval. While both families have demonstrated promising results individually, no systematic comparative evaluation exists across standardized benchmarks under unified conditions. This paper presents a comparative empirical analysis of hallucination mitigation strategies spanning four RAG variants (Naive RAG, Self-RAG, Corrective RAG, FLARE) and three prompting-based methods (Chain-of-Verification, self-consistency decoding, self-contradiction detection) evaluated on five public benchmarks: TruthfulQA, HaluEval, FActScore, FELM, and RAGBench. Drawing exclusively from published experimental results, the analysis reveals that advanced RAG strategies achieve 10--25 percentage-point improvements in factual precision over naive baselines, while prompting-based methods offer competitive performance on reasoning-intensive tasks without retrieval infrastructure. Task-dependent performance patterns emerge: knowledge-intensive factoid tasks favor retrieval augmentation, whereas logical consistency tasks benefit from self-verification prompting. A practical decision matrix is derived to guide practitioners in selecting appropriate strategies based on task characteristics and resource constraints.

Comparative Empirical Evaluation of Hallucination Mitigation Strategies in LLM-Based Text Generation

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section