Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The rapid advancement of generative models has increased the demand for detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods—including those leveraging pre-trained vision-language models—often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. Extensive experiments across various generation models and datasets demonstrate that CausalCLIP significantly improves generalization, achieving gains of 4.06\% in average precision and 6.82\% in accuracy compared to existing state-of-the-art methods. The source code will be publicly available upon publication.