Comparative Empirical Evaluation of Hallucination Mitigation Strategies in LLM-Based Text Generation
Keywords:
large language models, hallucination mitigation, retrieval-augmented generation, factuality evaluationAbstract
Large language models (LLMs) have achieved remarkable performance across natural language tasks, yet their tendency to generate factually incorrect content --- commonly termed hallucination --- remains a critical barrier to deployment in high-stakes domains. Two dominant families of mitigation strategies have emerged: retrieval-augmented generation (RAG) approaches that ground outputs in external knowledge, and prompting-based approaches that leverage self-verification without external retrieval. While both families have demonstrated promising results individually, no systematic comparative evaluation exists across standardized benchmarks under unified conditions. This paper presents a comparative empirical analysis of hallucination mitigation strategies spanning four RAG variants (Naive RAG, Self-RAG, Corrective RAG, FLARE) and three prompting-based methods (Chain-of-Verification, self-consistency decoding, self-contradiction detection) evaluated on five public benchmarks: TruthfulQA, HaluEval, FActScore, FELM, and RAGBench. Drawing exclusively from published experimental results, the analysis reveals that advanced RAG strategies achieve 10--25 percentage-point improvements in factual precision over naive baselines, while prompting-based methods offer competitive performance on reasoning-intensive tasks without retrieval infrastructure. Task-dependent performance patterns emerge: knowledge-intensive factoid tasks favor retrieval augmentation, whereas logical consistency tasks benefit from self-verification prompting. A practical decision matrix is derived to guide practitioners in selecting appropriate strategies based on task characteristics and resource constraints.Downloads
Published
2026-05-06