Cross-Modal Attack Detection and Adaptive Reconstruction Method Based on Uncertainty Estimation

Authors

  • Jiayu Fu University of Chicago, Chicago, Illinois, United States Author
  • Shaochun Liu Beijing JoinQuant Investment Management Co., Ltd, Beijing, China Author

Keywords:

multimodal learning, attack detection, uncertainty estimation, adaptive reconstruction, Transformer fusion, cross-modal defense, system security

Abstract

As multimodal fusion applications integrating visual, speech, and language models become widespread in critical domains such as healthcare, transportation, and national defense, the vulnerability of these models to cross-modal adversarial attacks poses a significant threat to system security. Traditional detection methods are typically confined to single-modal signal analysis, struggling to capture subtle inconsistencies across multi-source information. This paper proposes an uncertainty-based cross-modal attack detection and adaptive reconstruction method, aiming to achieve real-time detection and repair through joint modeling of multi-modal consistency. The approach embeds a Bayesian inference module within the Transformer fusion layer to estimate joint uncertainty across modalities, enabling dynamic monitoring of semantic consistency. Upon detecting anomalous uncertainty distributions, the system automatically activates a lightweight reconstruction subnetwork. This subnetwork regenerates perturbed features based on cross-modal correlations, thereby repairing compromised regions. Experiments conducted on the COCO-Multimodal QA and AVSpeech datasets demonstrate that this method improves detection accuracy by 34% and 29% against FGSM and PGD attacks, respectively. Post-attack repair increases model accuracy by 22% with less than 6% increase in inference latency. The findings demonstrate that uncertainty-driven modal consistency estimation effectively enhances the security and reliability of multimodal learning systems in real-world scenarios. This research provides a deployable defense mechanism for multimodal AI systems, applicable to defense surveillance, autonomous driving, and medical image analysis. It aligns with the technical development direction of the U.S. Department of Defense's AI Security Assurance Program and holds practical significance for strengthening the security of critical national AI infrastructure.

Downloads

Published

2026-01-07

Issue

Section

Articles