United States

Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted.
Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction.
In this paper, we propose CaRDiff ($\textbf{Ca}$ption, $\textbf{R}$ank, and generate with $\textbf{Diff}$usion), a framework that imitates the process by integrating multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT ($\textbf{V}$ideo $\textbf{S}$alient $\textbf{O}$bject $\textbf{R}$anking $\textbf{C}$hain $\textbf{o}$f $\textbf{T}$hought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to accurately decode the saliency maps for the given video.
Extensive experiments showcase the effectiveness of VSOR-CoT in improving the performance of video saliency prediction.
The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.

AAAI 2025

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

video understanding activity analysis

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes the preservation of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID. Our code will release soon.

Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single Image Denoising

Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces the Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the fundamental limitations of traditional FLD methods. POPoS employs three key innovations: (1) Pseudo-range multilateration is utilized to correct heatmap errors, enhancing the precision of landmark localization. By integrating multiple anchor points, this approach minimizes the impact of individual heatmap inaccuracies, leading to robust overall positioning. (2) To improve the pseudo-range accuracy of selected anchor points, the multilateration anchor loss function is proposed. This loss function effectively enhances the accuracy of distance map, mitigates the risk of local optima, and ensures optimal solutions. (3) A single-step parallel computation algorithm is introduced, significantly enhancing computational efficiency and reducing processing time. Comprehensive evaluations across five benchmark datasets demonstrate that POPoS consistently outperforms existing methods, particularly excelling in low-resolution scenarios with minimal computational overhead. These features establish POPoS as a highly efficient and accurate tool for FLD, with broad applicability in real-world scenarios.

POPoS: Improving Efficient and Robust Facial Landmark Detection with Parallel Optimal Position Search

Massive open online courses (MOOCs) recommendation provides online courses tailored to learners' individual preferences. Existing literature is limited by: 1) Ignoring the interrelations among courses, knowledge concepts, and videos, which leads to suboptimal recommendation performance; 2) Neglecting the hierarchical interactions between learners and components like courses, knowledge concepts, and videos, which makes it difficult to capture learners' intentions accurately. To address them, we propose a novel multi-type MOOCs recommendation framework, which enables multi-type educational content recommendations. This framework includes two important components: multi-relational representation and hierarchical reasoning. Regarding multi-relational representation, we first create two static course-relational and knowledge concept-relational graphs based on domain knowledge and construct a dynamic video-relational graph using learners' browsing historical sequences. Then, we capture the interactions among different components by learning the corresponding embeddings via graph neural networks. Regarding hierarchical reasoning, we implement a hierarchical beam search strategy to narrow down the candidate courses, knowledge concepts, and videos by calculating joint probability. Finally, we introduce an optional layer to increase the diversity and reasonableness of video recommendations by estimating learners' intentions. Extensive experiments are conducted to show the effectiveness, robustness, and interpretability of our method.

Multi-type MOOCs Recommendation: Leveraging Deep Multi-Relational Representation and Hierarchical Reasoning

Class incremental semantic segmentation (CISS) aims to segment new classes during continual steps while preventing the forgetting of old knowledge. Existing methods alleviate catastrophic forgetting by replaying distributions of previously learned classes using stored prototypes or features. However, they overlook a critical issue: in CISS, the representation of class knowledge is updated continuously through incremental learning, whereas prototype replay methods maintain fixed prototypes. This mismatch between updated representation and fixed prototypes limits the effectiveness of the prototype replay strategy. To address this issue, we propose the Adaptive prototype replay (Adapter) for CISS in this paper. Adapter comprises an adaptive deviation compensation (ADC) strategy and an uncertainty-aware constraint (UAC) loss. Specifically, the ADC strategy dynamically updates the stored prototypes based on the estimated representation shift distance to match the updated representation of old class. The UAC loss reduces prediction uncertainty, aggregating discriminative features to aid in generating compact prototypes. Additionally, we introduce a compensation-based prototype similarity discriminative (CPD) loss to ensure adequate differentiation between similar prototypes, thereby enhancing the efficiency of the adaptive prototype replay strategy. Extensive experiments on Pascal VOC and ADE20K datasets demonstrate that Adapter achieves state-of-the-art results and proves effective across various CISS tasks, particularly in challenging multi-step scenarios. The source code will be made available online.

Adaptive Prototype Replay for Class Incremental Semantic Segmentation

Recent advances in text-to-image diffusion models have spurred significant interest in continuous story image generation. In this paper, we introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character
consistency, effective foreground-background separation, and diverse pose variation. The core innovation of Storynizor lies
in its key modules: ID-Synchronizer and ID-Injector. The ID-Synchronizer employs an auto-mask self-attention module and a mask perceptual loss across inter-frame images to improve the consistency of character generation, vividly representing their postures and backgrounds. The ID-Injector utilize a Shuffling Reference Strategy (SRS) to integrate ID features into specific locations, enhancing ID-based consistent character generation. Additionally, to facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100,000 images. This dataset contains single and multiple-character sets in diverse environments, layouts, and gestures with detailed descriptions. Experimental results indicate that Storynizor demonstrates superior coherent story generation with high-fidelity character consistency, flexible postures, and vivid backgrounds compared to other character-specific methods.

Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

Video moment retrieval (VMR) aims to locate the most likely video moment(s) corresponding to a text query in untrimmed videos. Training of existing methods is limited by the lack of diverse and generalisable VMR datasets, hindering their ability to generalise moment-text associations to queries containing novel semantic concepts (unseen both visually and textually in a training source domain). For model generalisation to novel semantics, existing methods rely heavily on assuming to have access to both video and text sentence pairs from a   target domain in addition to the source domain pair-wise training data. This is neither practical nor scalable. In this work, we introduce a more generalisable approach by assuming only text sentences describing new semantics are available in model training without having seen any videos from a target domain. To that end, we propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion to facilitate fine-grained video editing from the seen  source concepts to the unseen target sentences consisting of new concepts. This enables generative hypotheses of unseen video moments corresponding to the novel concepts in the target domain. This fine-grained generative video diffusion retains the original video structure and subject specifics from the source domain while introducing semantic distinctions of unseen novel vocabularies in the target domain. A critical challenge is how to enable this generative fine-grained diffusion process to be meaningful in optimising VMR, more than just synthesising visually pleasing videos. We solve this problem by introducing a hybrid selection mechanism that integrates three quantitative metrics to selectively incorporate synthetic video 
 moments (novel video hypotheses) as enlarged additions to the original source training data, whilst minimising potential detrimental noise or unnecessary repetitions in the novel synthetic videos harmful to VMR learning. Experiments on three datasets demonstrate
the effectiveness of FVE to unseen novel semantic video moment retrieval tasks

Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval

Model immunization is an emerging direction that aims to mitigate the potential risk of misuse associated with open-sourced models and advancing adaptation methods. The idea is to make the released models' weights difficult to fine-tune on certain harmful applications, hence the name "immunized". Recent work on model immunization focuses on the single-concept setting. However, in real-world situations, models need to be immunized against multiple concepts. To address this gap, we propose an immunization algorithm that, simultaneously, learns a single "difficult initialization" for adaptation methods over a set of concepts. We achieve this by incorporating a differentiable merging layer that combines a set of model weights adapted over multiple concepts.
In our experiments, we demonstrate the effectiveness of multi-concept immunization by generalizing prior work's experiment setup of re-learning and personalization adaptation to multiple concepts.

Multi-concept Model Immunization through Differentiable Model Merging

Automatic Radiology Report Generation (RRG) is an important topic for alleviating the substantial workload of radiologists. Existing RRG approaches rely on supervised regression based on different architectures or additional knowledge injection, while the generated report may not align optimally with radiologists’ preferences. Especially, since the preferences of radiologists are inherently heterogeneous and multi-dimensional, e.g., some may prioritize report fluency, while others emphasize clinical accuracy. To address this problem, we propose a new RRG method via Multi-objective Preference Optimization (MPO) to align the pre-trained RRG model with multiple human preferences, which can be formulated by multi-dimensional reward functions and optimized by multi-objective reinforcement learning (RL). Specifically, we use a preference vector to represent the weight of preferences and use it as a condition for the RRG model. Then, a linearly weighed reward is obtained via a dot product between the preference vector and multi-dimensional reward. Next, the RRG model is optimized to align with the preference vector by optimizing such a reward via RL. In the training stage, we randomly sample diverse preference vectors from the preference space and align the model by optimizing the weighted multi-objective rewards, which leads to an optimal policy on the entire preference space. When inference, our model can generate reports aligned with specific preferences without further fine-tuning. Extensive experiments on two public datasets show the proposed method can generate reports that cater to different preferences in a single model and achieve state-of-the-art performance.

Radiology Report Generation via Multi-objective Preference Optimization

Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line with human perception, without reference benchmarks. Currently, deep learning BIQA methods typically depend on using features from high-level tasks for transfer learning. However, the inherent differences between BIQA and these high-level tasks inevitably introduce noise into the quality-aware features.
In this paper, we take an initial step towards exploring the diffusion model for feature denoising in BIQA, namely {Perceptual Feature Diffusion for IQA (PFD-IQA)}, which aims to remove noise from quality-aware features. Specifically, 1)~we propose a {Perceptual Prior Discovery and Aggregation module} to establish two auxiliary tasks to discover potential low-level features in images that are used to aggregate perceptual text conditions for the diffusion model. 2) we propose a {Perceptual Conditional Feature Refinement strategy}, which matches noisy features to predefined denoising trajectories and then performs exact feature denoising based on text conditions. By incorporating a lightweight denoiser and requiring only a few feature denoising steps (e.g., just five iterations), our method demonstrates superior performance across eight standard BIQA datasets, outperforming state-of-the-art BIQA approaches.

Feature Denoising Diffusion Model for Blind Image Quality Assessment

Contemporary face recognition systems use feature templates extracted from face images to identify persons. To enhance privacy, face template protection techniques are widely employed to conceal sensitive identity and appearance information stored in the template. This paper identifies an emerging privacy attack form utilizing diffusion models that could nullify prior protection. The attack can synthesize high-quality, identity-preserving face images from templates, revealing persons' appearance. Based on studies of the diffusion model's generative capability, this paper proposes a defense by rotating templates to a noise-like distribution. This is achieved efficiently by spherically and linearly interpolating templates on their located hypersphere. This paper further proposes to group-wisely divide and drop out templates' feature dimensions, to enhance the irreversibility of rotated templates. The proposed techniques are concretized as a novel face template protection technique, SlerpFace. Extensive experiments show that SlerpFace provides satisfactory recognition accuracy and comprehensive protection against inversion and other attack forms, superior to prior arts.

Premium content

Next from AAAI 2025

Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single Image Denoising

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES