Sparse, Dense, or Hybrid? Comparing Retrieval Strategies for Biomedical Question Answering with Retrieval-Augmented Generation
Keywords:
retrieval-augmented generation, biomedical question answering, dense retrieval, hybrid retrievalAbstract
Retrieval-augmented generation (RAG) has emerged as a dominant paradigm for grounding large language models in external knowledge, yet the choice of retrieval strategy remains underexplored in the biomedical domain. This study presents an empirical comparison of four retrieval strategies---BM25 (sparse), Contriever (general-purpose dense), MedCPT (domain-specific dense), and a reciprocal rank fusion hybrid combining BM25 with MedCPT---within a standardized RAG pipeline for biomedical question answering. Experiments are conducted on three established benchmarks: PubMedQA, MedQA, and BioASQ Task B. Evaluation spans retrieval quality (Recall@10, Recall@20, MRR@10), end-to-end QA accuracy, and answer faithfulness measured through the RAGAS metric. Results indicate that the hybrid strategy achieves the highest Recall@10 across all three datasets, reaching 0.761 on PubMedQA, 0.697 on MedQA, and 0.768 on BioASQ. The domain-specific MedCPT retriever consistently outperforms the general-purpose Contriever, while BM25 remains a competitive baseline that surpasses Contriever on two of three benchmarks. End-to-end QA accuracy follows a similar pattern, with the hybrid strategy yielding the best performance at 0.741 on PubMedQA and 0.613 on MedQA. Faithfulness analysis reveals that domain-specific retrieval reduces hallucination rates by providing more topically relevant context. These findings offer practical guidance for practitioners selecting retrieval strategies when deploying biomedical RAG applications.References
1. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
2. K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. Seneviratne, P. Gamber, C. Kelly, A. Babiker, N. Schärli, A. Chowdhery, P. Mansfield, D. Demner-Fushman, ... V. Natarajan, "Large language models encode clinical knowledge," Nature, vol. 620, pp. 172–180, 2023.
3. V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih, "Dense passage retrieval for open-domain question answering," in *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, pp. 6769–6781, 2020.
4. Q. Jin, W. Kim, Q. Chen, D. C. Comeau, L. Yeganova, W. J. Wilbur, and Z. Lu, "MedCPT: Contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval," Bioinformatics, vol. 39, no. 11, btad651, 2023.
5. K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, "REALM: Retrieval-augmented language model pre-training," in Proceedings of the 37th International Conference on Machine Learning (ICML), PMLR 119, pp. 3929–3938, 2020.
6. G. Izacard and E. Grave, "Leveraging passage retrieval with generative models for open domain question answering," in *Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL)*, pp. 874–880, 2021.
7. A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, "Self-RAG: Learning to retrieve, generate, and critique through self-reflection," in *Proceedings of the 12th International Conference on Learning Representations (ICLR)*, 2024.
8. O. Khattab and M. Zaharia, "ColBERT: Efficient and effective passage search via contextualized late interaction over BERT," in *Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval*, pp. 39–48, 2020.
9. G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave, "Unsupervised dense information retrieval with contrastive learning," Transactions on Machine Learning Research, 2022.
10. P. T. Chung, "Data Mining Methods for Biomechanical Property Prediction of Biomedical Materials Based on Optimized Feature Dimensionality Reduction," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 174–180, Dec. 2025.
11. Q. Zhang, "Adaptive Differential Privacy Mechanism for Federated Document Classification: A Gradient-Clipping Optimization Approach," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 672–678, Dec. 2025.
12. Y. Wang, "Practical AI Approaches for Community Infection Early Warning: From Public Data to Actionable Insights," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 1545–1552, Dec. 2025.
13. M. Han, "Privacy-Preserving Collaborative Learning Across Healthcare Institutions: An Adaptive Approach with Gradient Compression and Dynamic Privacy Budget Allocation," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 679–684, Dec. 2025.
14. D. Liang and C. Cai, "Optimizing Large-Scale Contract Review through Data Analytics: Practical Evidence from IPO Audits," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 242–249, Dec. 2025.
15. L. Gao, Z. Dai, T. Chen, Z. Fan, B. Van Durme, and J. Callan, "Complement lexical retrieval model with semantic residual embeddings," in Proceedings of the 43rd European Conference on Information Retrieval (ECIR), 2021.
16. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, "BioBERT: A pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
17. Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, "PubMedQA: A dataset for biomedical research question answering," in *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP)*, pp. 2567–2577, 2019.
18. D. Jin, E. Pan, N. Oufattole, W. Weng, H. Fang, and P. Szolovits, "What disease does this patient have? A large-scale open domain question answering dataset from medical exams," Applied Sciences, vol. 11, no. 14, 6421, 2021.
19. G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Polychronopoulos, et al., "An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition," BMC Bioinformatics, vol. 16, no. 1, 138, 2015.
20. S. Robertson and H. Zaragoza, "The probabilistic relevance framework: BM25 and beyond," Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
21. S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, "RAGAS: Automated evaluation of retrieval augmented generation," in *Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL): System Demonstrations*, pp. 150–158, 2024.
22. N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych, "BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models," in Advances in Neural Information Processing Systems, vol. 34 (Datasets and Benchmarks Track), 2021.
23. Y. Li, "Comparative Analysis of Illumination Normalization Methods for Autonomous Driving Under Challenging Lighting Conditions," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 633–639, Dec. 2025.
24. G. Xiong, Q. Jin, Z. Lu, and A. Zhang, "Benchmarking retrieval-augmented generation for medicine," in Findings of the Association for Computational Linguistics: ACL 2024, pp. 6233–6251, 2024.
25. Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, "Domain-specific language model pretraining for biomedical natural language processing," ACM Transactions on Computing for Healthcare, vol. 3, no. 1, pp. 1–23, 2022.
26. Y. Wang, "Explainable Risk Stratification for Polypharmacy-Related Adverse Outcomes in Community-Dwelling Elderly: A Rule-Enhanced Machine Learning Approach," Journal of Sustainability, Policy, and Practice, vol. 2, no. 2, pp. 18–31, 2026.
27. Y. Li, "Performance Benchmarking and Optimization Strategies for Depth Estimation Algorithms in Unstructured Environments," Journal of Sustainability, Policy, and Practice, vol. 2, no. 2, pp. 32–43, 2026.
28. J. Sohn, Y. Park, C. Yoon, S. Park, H. Hwang, M. Sung, H. Kim, and J. Kang, "Rationale-guided retrieval augmented generation for medical question answering," in *Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)*, vol. 1, pp. 12739–12753, 2025.
29. Y. Zhang, "Evaluation of Differential Privacy and Federated Learning for AI-Driven Customer Service Applications," Journal of Sustainability, Policy, and Practice, vol. 2, no. 2, pp. 55–66, 2026.
30. P. T. Chung, "Multi-Objective Optimization of Process Parameters for Dental Resin 3D Printing Using Improved NSGA-II Algorithm," Journal of Science, Innovation & Social Impact, vol. 2, no. 1, pp. 276–287, 2026.

