Research on Cross-Modal Semantic Alignment Methods for Low-Resource Languages

Zhizhi Yu

doi:10.71222/tvftmg24

Authors

Zhizhi Yu Hebei University of Engineering, Handan, China Author

DOI:

https://doi.org/10.71222/tvftmg24

Keywords:

low-resource languages, cross-modal semantic alignment, contrastive learning, transfer enhancement

Abstract

This study addresses the challenge of cross-modal semantic alignment in low-resource languages, a critical problem for enabling inclusive and equitable AI-driven multimodal applications. We propose a novel framework that synergistically integrates multi-level textual embeddings, visual Transformer modeling, and the construction of a unified cross-modal projection space. To enhance alignment quality, the approach incorporates advanced mechanisms including contrastive learning, distributed semantic constraints, and fine-grained local alignment strategies. Furthermore, to mitigate data scarcity inherent in low-resource settings, we leverage transfer enhancement techniques such as cross-lingual knowledge distillation, pseudo-pair augmentation, and multi-task training. Comprehensive experiments on the FLORES-200 dataset demonstrate that our method consistently surpasses state-of-the-art models such as CLIP and ALIGN across multiple metrics. Specifically, significant gains are observed in Recall@1 and Mean Rank for languages including Swahili and Sinhala, underscoring the method's effectiveness, robustness, and generalizability in low-resource scenarios. These findings highlight the potential of the proposed approach for advancing cross-lingual multimodal understanding and bridging the performance gap for underrepresented languages.

References

1. Q. Liu, Q. Wu, L. Tang, L. Xu, and Q. Chen, "Multi-modal semantic feature alignment medical cross-modal hashing," Engi-neering Applications of Artificial Intelligence, vol. 157, p. 111158, 2025. doi: 10.1016/j.engappai.2025.111158

2. E. Al-Buraihy, and D. Wang, "Enhancing Cross-Lingual Image Description: A Multimodal Approach for Semantic Relevance and Stylistic Alignment," Computers, Materials & Continua, vol. 79, no. 3, 2024. doi: 10.32604/cmc.2024.048104

3. L. Zhu, F. Zhou, S. Wang, L. Shi, F. Kou, Z. Li, and P. Zhou, "A language-guided cross-modal semantic fusion retrieval method," Signal Processing, vol. 234, p. 109993, 2025. doi: 10.1016/j.sigpro.2025.109993

4. C. Chen, X. Sun, and Z. Liu, "UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception," IEEE Transactions on Image Processing, 2025. doi: 10.1109/tip.2025.3587577

5. Y. Wu, S. Wang, and Q. Huang, "Multi-modal semantic autoencoder for cross-modal retrieval," Neurocomputing, vol. 331, pp. 165-175, 2019. doi: 10.1016/j.neucom.2018.11.042

6. L. Li, and W. Sun, "Label-wise deep semantic-alignment hashing for cross-modal retrieval," In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, June, 2023, pp. 416-424. doi: 10.1145/3591106.3592283

7. A. Li, X. Wei, D. Wu, and L. Zhou, "Cross-modal semantic communications," IEEE Wireless Communications, vol. 29, no. 6, pp. 144-151, 2022. doi: 10.1109/mwc.008.2200180

8. T. Gong, J. Wang, and L. Zhang, "Cross-modal semantic aligning and neighbor-aware completing for robust text-image person retrieval," Information Fusion, vol. 112, p. 102544, 2024. doi: 10.1016/j.inffus.2024.102544

9. P. P. Liang, P. Wu, L. Ziyin, L. P. Morency, and R. Salakhutdinov, "Cross-modal generalization: Learning in low resource modalities via meta-alignment," In Proceedings of the 29th ACM International Conference on Multimedia, October, 2021, pp. 2680-2689.

10. B. Xiao, Q. Shen, and D. Z. Wang, "From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments," In Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), May, 2025, pp. 24-35. doi: 10.18653/v1/2025.loresmt-1.4

11. Z. Yang, Q. Fang, and Y. Feng, "Low-resource neural machine translation with cross-modal alignment," arXiv preprint arXiv:2210.06716, 2022. doi: 10.18653/v1/2022.emnlp-main.689

12. L. Chen, S. Guan, X. Huang, W. J. Wang, C. Xu, Z. Guan, and W. Zhao, "Cross-lingual Multimodal Sentiment Analysis for Low-Resource Languages via Language Family Disentanglement and Rethinking Transfer," In Findings of the Association for Computational Linguistics: ACL 2025, July, 2025, pp. 6513-6522.

13. V. Ermolayev, and V. Kosa, "Ph," D. Program in Intelligent Systems: Annual Report 2023-24, 2023.

14. L. Mei, and H. Zhao, "Cross-Lingual Semantic Alignment With Adaptive Transformer Models For Zero-Shot Text Categori-zation," Frontiers in Emerging Artificial Intelligence and Machine Learning, vol. 2, no. 02, pp. 1-6, 2025.