Explainable Remote Sensing Image Captioning with Uncertainty-Aware Vision–Language Feature Fusion for SMB Decision Support
DOI:
https://doi.org/10.71222/zwk4j370Keywords:
remote sensing captioning, uncertainty quantification, vision-language fusion, Explainable AI (XAI), Bayesian Deep Learning, decision support systemsAbstract
The democratization of remote sensing data presents a transformative opportunity for Small and Medium Businesses (SMBs), yet the adoption of automated interpretation tools is hindered by the "black box" nature of current Vision-Language Models (VLMs). Standard models frequently exhibit overconfidence in ambiguous scenarios, posing financial risks for applications in precision agriculture and logistics. This paper introduces SentiMap, an uncertainty-aware image captioning framework that disentangles aleatoric and epistemic uncertainty through a dual-stream Bayesian architecture. We propose a novel Adaptive Fusion Mechanism that dynamically re-weights visual representations based on spatial variance maps, prioritizing semantic priors when image quality degrades. Extensive experiments on the RSICD dataset and a curated "SMB-Risk" benchmark demonstrate that SentiMap achieves state-of-the-art calibration (ECE: 0.05) without compromising captioning accuracy. User studies confirm that providing interpretable "Trust Scores" and uncertainty heatmaps significantly enhances human decision confidence, bridging the gap between raw pixel data and actionable business intelligence.References
1. K. Zhang, P. Li, and J. Wang, "A review of deep learning-based remote sensing image caption: Methods, models, comparisons and future directions," Remote Sensing, vol. 16, no. 21, p. 4113, 2024. doi: 10.3390/rs16214113
2. Y. Wang, Q. Song, D. Wasif, M. Shahzad, C. Koller, J. Bamber, and X. X. Zhu, "How certain are uncertainty estimates? three novel earth observation datasets for benchmarking uncertainty quantification in machine learning," arXiv preprint arXiv:2412.06451, 2024.
3. Q. Bai, and X. Wang, "Cross-Temporal Remote Sensing Image Change Captioning: A Manifold Mapping and Bayesian Diffusion Approach for Land Use Monitoring," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025. doi: 10.1109/jstars.2025.3575807
4. Y. Wang, X. Tang, J. Ma, X. Zhang, F. Liu, and L. Jiao, "Cross-modal remote sensing image-text retrieval via context and uncertainty-aware prompt," IEEE Transactions on Neural Networks and Learning Systems, 2024. doi: 10.1109/tnnls.2024.3458898
5. R. Ricci, F. Melgani, J. M. Junior, and W. N. Goncalves, "NLP-based fusion approach to robust image captioning," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 11809-11822, 2024.
6. G. Franchi, N. Belkhir, D. N. Trong, G. Xia, and A. Pilzer, "Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation," In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 8062-8072. doi: 10.1109/cvpr52734.2025.00755
7. X. Yu, Y. Li, J. Ma, C. Li, and H. Wu, "Diffusion-rscc: Diffusion probabilistic model for change captioning in remote sensing images," IEEE Transactions on Geoscience and Remote Sensing, 2025. doi: 10.1109/tgrs.2025.3554360
8. C. Yang, Z. Li, and L. Zhang, "Bootstrapping interactive image-text alignment for remote sensing image captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-12, 2024. doi: 10.1109/tgrs.2024.3359316
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Qikun Zuo (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.

