An Empirical Comparison of ReAct, Reflexion, Plan-and-Solve, and Tree-of-Thought Planning Strategies on Financial Question Answering and Numerical Reasoning Tasks

Xuanyi Fu; Tianxing Tang; Chuankai Luo

Authors

Xuanyi Fu M.S.E. in Computer Science, Johns Hopkins University, Baltimore, MD, USA Author
Tianxing Tang Translation and Localization Management, Middlebury Institute of International Studies, Monterey, CA, USA Author
Chuankai Luo Department of Electronic Engineering, Tsinghua University, Beijing, China Author

Keywords:

LLM agents, planning strategies, financial question answering, empirical evaluation

Abstract

Large language model agents increasingly automate reasoning- and decision-intensive financial workflows, yet the comparative effectiveness of competing planning strategies on finance-specific tasks remains unclear. We conduct a controlled empirical comparison of four widely adopted planning strategies --- ReAct, Reflexion, Plan-and-Solve, and Tree-of-Thought --- on four public financial benchmarks spanning multi-step numerical reasoning (FinQA), multi-turn numerical dialogue (ConvFinQA), hybrid tabular-textual question answering (TAT-QA), and long-document question answering (DocFinQA). Using a shared GPT-4o backbone, a common tool set, and a unified evaluation protocol, we measure execution accuracy, exact-match correctness, per-task-type performance, and per-query token cost across three random seeds. Plan-and-Solve offers the best accuracy-per-dollar on purely numerical tasks, delivering a moderate 2.8-point improvement over ReAct on FinQA at roughly one-seventh the token budget of Tree-of-Thought. ReAct with retrieval dominates on long-document DocFinQA, outperforming Plan-and-Solve by 4.1 points. Tree-of-Thought attains the single highest accuracy on the compound-arithmetic subset of TAT-QA (71.4%) but costs 7.2× more tokens per query than Plan-and-Solve. A manual error typology across 400 failures confirms that each strategy repairs a distinct failure class, and that no single strategy dominates all four financial task types. The findings clarify an existing design-space question rather than propose new methodology.

References

1. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, "ReAct: Synergizing reasoning and acting in language models," in *Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)*, 2023.

2. Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y. Wang, "FinQA: A dataset of numerical reasoning over financial data," in *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)*, pp. 3697–3711, 2021.

3. X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, "AgentBench: Evaluating LLMs as agents," in *Proceedings of the 12th International Conference on Learning Representations (ICLR 2024)*, 2024.

4. D. Liang and C. Cai, "Optimizing large-scale contract review through data analytics: Practical evidence from IPO audits," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 242–249, Dec. 2025.

5. P. T. Chung, "Enhancing dental polymer formulation through interpretable machine learning: A comparative analysis of feature selection and algorithm performance," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 234–241, Dec. 2025.

6. D. Zou, Z. Chen, and Z. Ling, "A comparative evaluation of deep learning paradigms for low-light image enhancement: From CNNs to diffusion models," Journal of Computing Innovations and Applications, vol. 3, no. 2, pp. 85–95, 2025.

7. Y. Chen and Z. Chen, "Multi-objective deep reinforcement learning for carbon-aware spatiotemporal workload scheduling in geo-distributed data centers," Journal of Advanced Computing Systems, vol. 5, no. 10, pp. 18–30, 2025.

8. Q. Xie, W. Han, Z. Chen, R. Xiang, X. Zhang, Y. He, M. Xiao, D. Li, Y. Dai, D. Feng, S. Ananiadou, and J. Huang, "FinBen: A holistic financial benchmark for large language models," in *Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track*, 2024.

9. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, "Chain-of-thought prompting elicits reasoning in large language models," in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 24824–24837, 2022.

10. L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim, "Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models," in *Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023): Long Papers*, pp. 2609–2634, 2023.

11. D. Zhang and X. Ma, "Machine learning-based credit risk assessment for green bonds: Climate factor integration and default prediction analysis," Journal of Sustainability, Policy, and Practice, vol. 1, no. 2, pp. 121–135, 2025.

12. S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, "Tree of thoughts: Deliberate problem solving with large language models," in Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023.

13. X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, "Self-consistency improves chain of thought reasoning in language models," in *Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)*, 2023.

14. M. Han and J. Lai, "Temporal feature engineering and threshold optimization for early warning in healthcare claims anomaly detection," Journal of Advanced Computing Systems, vol. 6, no. 4, pp. 27–49, 2026.

15. Y. Chen and J. Lai, "Multi-metric trustworthiness evaluation of AI-assisted medical imaging diagnosis: Integrating confidence calibration and distribution shift detection," Journal of Global Engineering Review, vol. 4, no. 1, pp. 113–126, 2026.

16. L. Long and J. Hu, "Multi-objective particle swarm optimization for site selection and policy subsidy maximization of foreign renewable energy enterprises in the United States," Artificial Intelligence and Machine Learning Review, vol. 7, no. 2, pp. 54–69, 2026.

17. H. Cao and L. Long, "Empirical evaluation of multi-source monitoring signal effectiveness and lead time for performance degradation prediction in Kubernetes-based microservices," Journal of Advanced Computing Systems, vol. 6, no. 4, pp. 15–26, 2026.

18. Y. Li and L. Long, "Lightweight AI-driven stress testing for small and medium financial institutions: A variational autoencoder approach with extreme value theory for macroeconomic scenario generation," Artificial Intelligence and Machine Learning Review, vol. 7, no. 1, pp. 108–119, 2026.

19. D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. H. Chi, "Least-to-most prompting enables complex reasoning in large language models," in *Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)*, 2023.

20. M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, and T. Hoefler, "Graph of thoughts: Solving elaborate problems with large language models," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, pp. 17682–17690, 2024.

21. X. Wang, M. Liu, and L. Long, "Effectiveness evaluation of attention mechanism strategies in deep learning-based single image super-resolution," Journal of Global Engineering Review, vol. 4, no. 1, pp. 89–98, 2026.

22. Y. Chen and J. Hu, "Graph neural network-based cascading disruption path identification in multi-tier rare earth processing networks," Journal of Global Engineering Review, vol. 4, no. 1, pp. 99–112, 2026.

23. P. T. Chung, "Data mining methods for biomechanical property prediction of biomedical materials based on optimized feature dimensionality reduction," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 174–180, Dec. 2025.

24. Q. Zhang, "Adaptive differential privacy mechanism for federated document classification: A gradient-clipping optimization approach," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 672–678, Dec. 2025.

25. Y. Wang, "Practical AI approaches for community infection early warning: From public data to actionable insights," in *Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, pp. 1545–1552, Dec. 2025.

26. N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, "Reflexion: Language agents with verbal reinforcement learning," in Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023.

27. W. Zhang, L. Zhao, H. Xia, S. Sun, J. Sun, M. Qin, X. Li, Y. Zhao, Y. Zhao, X. Cai, L. Zheng, X. Wang, and B. An, "A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist," in *Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024)*, pp. 4314–4325, 2024.

28. T. K. Trinh and D. Zhang, "Algorithmic fairness in financial decision-making: Detection and mitigation of bias in credit scoring applications," Journal of Advanced Computing Systems, vol. 4, no. 2, pp. 36–49, 2024.

29. B. Dong, D. Zhang, and J. Xin, "Deep reinforcement learning for optimizing order book imbalance-based high-frequency trading strategies," Journal of Computing Innovations and Applications, vol. 2, no. 2, pp. 33–43, 2024.

30. D. Zhang and E. Feng, "Quantitative assessment of regional carbon neutrality policy synergies based on deep learning," Journal of Advanced Computing Systems, vol. 4, no. 10, pp. 38–54, 2024.

31. Z. Chen, S. Li, C. Smiley, Z. Ma, S. Shah, and W. Y. Wang, "ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering," in *Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)*, pp. 6279–6292, 2022.

32. F. Zhu, W. Lei, Y. Huang, C. Wang, S. Zhang, J. Lv, F. Feng, and T.-S. Chua, "TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance," in *Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021): Long Papers*, pp. 3277–3287, 2021.

33. Y. Wang, "Accuracy evaluation of machine learning-based hospital resource demand forecasting during infectious disease surges: A comparative analysis," Journal of Science, Innovation & Social Impact, vol. 2, no. 1, pp. 314–327, 2026.

34. Y. Wang, "Explainable risk stratification for polypharmacy-related adverse outcomes in community-dwelling elderly: A rule-enhanced machine learning approach," Journal of Sustainability, Policy, and Practice, vol. 2, no. 2, pp. 18–31, 2026.

35. Y. Li, "Performance benchmarking and optimization strategies for depth estimation algorithms in unstructured environments," Journal of Sustainability, Policy, and Practice, vol. 2, no. 2, pp. 32–43, 2026.

36. P. T. Chung, "Comparative evaluation of machine learning algorithms for spectrophotometric dental shade classification," Journal of Sustainability, Policy, and Practice, vol. 2, no. 1, pp. 204–214, 2026.

37. V. Reddy, R. Koncel-Kedziorski, V. D. Lai, M. Krumdick, C. Lovering, and C. Tanner, "DocFinQA: A long-context financial reasoning dataset," in *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024): Short Papers*, pp. 445–458, 2024.

38. M. Krumdick, R. Koncel-Kedziorski, V. D. Lai, V. Reddy, C. Lovering, and C. Tanner, "BizBench: A quantitative reasoning benchmark for business and finance," in *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024): Long Papers*, pp. 8309–8332, 2024.

39. Q. Xie, W. Han, X. Zhang, Y. Lai, M. Peng, A. Lopez-Lira, and J. Huang, "PIXIU: A comprehensive benchmark, instruction dataset and large language model for finance," in *Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track*, 2023.

40. D. Zhang and Y. Wang, "AI-driven quality assessment and investment risk identification for carbon credit projects in developing countries," Pinnacle Academic Press Proceedings Series, vol. 3, pp. 76–92, 2025.

41. J. Y. Sheng, X. Y. Jia, Z. H. Guo, Y. Gao, Y. P. Cao, and X. Q. Feng, "Characterizing layer-specific mechanical properties of soft materials by pipette aspiration using transformer model and SHapley additive explanations," International Journal of Applied Mechanics, vol. 17, no. 06, p. 2550048, 2025.

42. Z. Guo, Y. Man, J. Sheng, B. Lin, A. Ahmed, B. Jiang, and C. Zhang, "Event-VStream: Event-driven real-time understanding for long video streams," arXiv preprint arXiv:2601.15655, 2026.

43. D. Yuan and D. Zhang, "APAC-sensitive anomaly detection: Culturally-aware AI models for enhanced AML in US securities trading," in 2025 International Conference on Computer, AI, and Security, May 2025.

44. J. Han and R. Jia, "AI-enhanced cross-asset liquidity contagion pathway identification and dynamic hedging strategy optimization: Evidence from US equity, bond, and derivatives markets," Journal of Computing Innovations and Applications, vol. 4, no. 1, pp. 89–96, 2026.

An Empirical Comparison of ReAct, Reflexion, Plan-and-Solve, and Tree-of-Thought Planning Strategies on Financial Question Answering and Numerical Reasoning Tasks

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

ISSN

Abstract & Indexing

Partners