United States

Large visual-language models (LVLMs) have achieved great success in multiple applications. However, they still encounter challenges in complex scenes, especially those involving camouflaged objects. This is primarily due to the lack of samples related to camouflaged scenes in the training dataset. To mitigate this issue, we construct the MM-CamObj dataset for the first time, comprising two subsets: CamObj-Align and CamObj-Instruct. Specifically, CamObj-Align contains 11,363 image-text pairs, and it is designed for VL alignment and injecting rich knowledge of camouflaged scenes into LVLMs. CamObj-Instruct is collected for fine-tuning the LVLMs with improved instruction-following capabilities, and it includes 11,363 images and 68,849 conversations with diverse instructions. Based on the MM-CamObj dataset, we propose the CamObj-Llava, an LVLM specifically designed for addressing tasks in camouflaged scenes. To facilitate our model&#39;s effective acquisition of knowledge about camouflaged objects and scenes, we introduce a curriculum learning strategy with six distinct modes. Additionally, we construct the CamObj-Bench to evaluate the existing LVLMs&#39; capabilities of understanding, recognition, localization and count in camouflage scenes. This benchmark includes 600 images and 7 tasks, with a total of 9,449 questions. Extensive experiments are conducted on the CamObj-Bench with CamObj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84\% improvement in 4 out of 7 tasks compared to GPT-4o. Code and data samples are available in the supplementary materials.

AAAI 2025

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Large visual-language models (LVLMs) have achieved great success in multiple applications. However, they still encounter challenges in complex scenes, especially those involving camouflaged objects. This is primarily due to the lack of samples related to camouflaged scenes in the training dataset. To mitigate this issue, we construct the MM-CamObj dataset for the first time, comprising two subsets: CamObj-Align and CamObj-Instruct. Specifically, CamObj-Align contains 11,363 image-text pairs, and it is designed for VL alignment and injecting rich knowledge of camouflaged scenes into LVLMs. CamObj-Instruct is collected for fine-tuning the LVLMs with improved instruction-following capabilities, and it includes 11,363 images and 68,849 conversations with diverse instructions. Based on the MM-CamObj dataset, we propose the CamObj-Llava, an LVLM specifically designed for addressing tasks in camouflaged scenes. To facilitate our model's effective acquisition of knowledge about camouflaged objects and scenes, we introduce a curriculum learning strategy with six distinct modes. Additionally, we construct the CamObj-Bench to evaluate the existing LVLMs' capabilities of understanding, recognition, localization and count in camouflage scenes. This benchmark includes 600 images and 7 tasks, with a total of 9,449 questions. Extensive experiments are conducted on the CamObj-Bench with CamObj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84\% improvement in 4 out of 7 tasks compared to GPT-4o. Code and data samples are available in the supplementary materials.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Recent advances in diffusion-based generative models have demonstrated superior performance in subject-driven image generation. Identity(ID) preserving image generation, as a subtask of subject-image generation, aims to generate customized images for specific human identity and has broad application potential. However, this task remains challenging due to the requirement for high ID fidelity and precise detail preservation. Additionally, generating high-quality context alongside the human ID presents another challenge, as existing methods struggle to achieve both high ID fidelity and satisfactory context simultaneously. To address the issues of insufficient ID fidelity, we introduced a simple yet effective test-time fine-tuning approach. Specifically, we propose an attribute-driven training method that establishes global-level and local-level tasks to learn the global face feature and fine-grained attribute features, respectively. Furthermore, we introduce a novel ID-context decoupling framework that decouples image context generation from human ID generation, ensuring the quality of contextual content as well as facilitating the learning ID information. Through extensive experiments, we demonstrated the effectiveness of the proposed method and showcase its capabilities across various applications.

FaceA-Net: Facial Attribute-driven ID Preserving Image Generation Network

Incomplete multi-view multi-label classification aims to accurately predict labels for each sample in the face of some missing views. Moreover, it also encounters problems brought by redundant views. In this paper, we make the first attempt to take advantage of diffusion models to address the missing view problem and design a strategy to identify and remove redundant views. Specifically, we train a diffusion model conditioned on the pseudo-labels to recover information of missing views. The learned diffusion model can carry data distribution knowledge in training split to the data. Regarding redundant identification strategy, it is designed by considering both the additional information of views and the classification difficulty level of samples. We conduct extensive experiments on five datasets, and the proposed method achieves favorable performance against several state-of-the-art methods on the multi-view multi-label classification task.

Incomplete Multi-View Multi-Label Classification via Diffusion-Guided Redundancy Removal

Sometimes we use a neural network to learn predictions, which are then used to compute a downstream quantity of interest. For instance, we could learn a step-by-step dynamics model and then use it to infer a utility function for decision making. Given limited training data, neither the neural network’s prediction nor the downstream quantity of interest will be exact. Quantifying the overall epistemic uncertainty in the downstream quantity of interest can be helpful. For instance we could use the uncertainty of the utility function from the example above to incorporate a safety margin or to perform active learning. We show that epistemic uncertainty of such quantities of interest can be estimated conveniently using gradients. In comparison to popular related approaches, such as ensemble methods, our method exhibits similar results while requiring fewer computational resources.

General Uncertainty Estimation with Delta Variances

Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding tasks, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally locating each event in a video, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding.
Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability.
Moreover, previous loss used for training VideoLLMs does not fully reflect evaluation metrics, therefore providing supervision not directly aligned to target tasks.
To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO).
CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively.
M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics.
Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task.

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

Federated learning (FL) enables decentralized clients to collaboratively train a global model under the orchestration of a central server. However, the iterative exchange of model parameters between the server and clients imposes heavy communication burdens, risks potential privacy leakage, and even precludes collaboration among heterogeneous clients. Distillation-based FL tackles these challenges by exchanging low-dimensional model outputs rather than model parameters, yet it relies heavily on a task-relevant auxiliary dataset that is often not available in practice. Data-free FL attempts to overcome this limitation by training a server-side generator to directly synthesize task-specific data samples for knowledge transfer. However, the update rule of the generator requires clients to share on-device models for white-box access, which greatly compromises the advantages of distillation-based FL. This motivates us to explore a data-free and black-box FL framework via Zeroth-order Gradient Estimation (FedZGE), which estimates the gradients after flowing through on-device models in a black-box optimization manner to complete the training of the generator in terms of fidelity, transferability, diversity, and equilibrium, without involving any auxiliary data or sharing any model parameters, thus combining the advantages of both distillation-based FL and data-free FL. Experiments on large-scale image classification datasets and network architectures demonstrate the superiority of FedZGE in terms of data heterogeneity, model heterogeneity, communication efficiency, and privacy protection. Code is available at: https://anonymous.4open.science/r/FedZGE-B6A3.

Data-Free Black-Box Federated Learning via Zeroth-Order Gradient Estimation

Posters serve an essential function in marketing and advertising by improving visual communication and brand visibility, thus significantly contributing to industrial design. With the latest developments in controllable T2I diffusion models, research interest has surged in text rendering within synthesized images. Although text rendering accuracy has seen advancements, automatic poster generation remains a relatively untapped area. This paper presents an automatic poster generation framework featuring text rendering capabilities through the use of LLMs. Our framework employs a triple-cross attention mechanism based on alignment learning to achieve precise text placement within detailed contextual backgrounds. Moreover, it supports adjustable fonts, varying image resolutions, and poster rendering with textual prompts in both English and Chinese. Additionally, we present a comprehensive bilingual image-text dataset, GlyphDraw-3M, comprising 3 million image-text pairs, each with OCR annotations and resolutions exceeding 1024. Our method utilizes the SDXL architecture, and extensive experiments confirm its ability to generate posters with intricate and context-rich backgrounds.

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models

Concept Bottleneck Models (CBMs) offer inherent interpretability by initially translating images into human-comprehensible concepts, followed by a linear combination of these concepts for classification. However, the annotation of concepts for visual recognition tasks requires extensive expert knowledge and labor, constraining the broad adoption of CBMs. Recent approaches have leveraged the knowledge of large language models to construct concept bottlenecks, with multimodal models like CLIP subsequently mapping image features into the concept feature space for classification. Despite this, the concepts produced by language models can be verbose and may introduce non-visual attributes, which hurts accuracy and interpretability. In this study, we investigate to avoid these issues by constructing CBMs directly from multimodal models. To this end, we adopt common words as base concept vocabulary, and leverage auxiliary unlabeled images to construct a Vision-to-Concept (V2C) tokenizer that can explicitly quantize images into their most relevant visual concepts, thus creating a vision-oriented concept bottleneck tightly coupled with the multimodal model. This leads to our V2C-CBM that is training efficient and interpretable with high accuracy. Our V2C-CBM has matched or outperformed LLM-supervised CBMs on various visual classification benchmarks, validating the efficacy of our approach. The associated code will be released soon.

V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer

Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12\% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. We will release the code and weights after review.

More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Existing sequential recommendation models are mostly based on sequential models, which can be misled by inconsistent items in the local sequence. This study proposes GlobalDiff, a plug-and-play framework to enhance the performance of sequential models by utilizing a diffusion model to restore the global non-sequential data structure of the item universe and compensate for the local sequential context. Several novel techniques are proposed, including training construction, guided reverse approximator, and inference ensemble, to seamlessly integrate the diffusion model with the sequential model. Extensive experiments on various datasets demonstrate that \ours can enhance advanced sequential models by an average improvement of 4.8\% - 21.8\%.

Enhancing Sequential Recommendation with Global Diffusion

We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning and more. These problems are often formalized as Bi-Level Optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyper-parameters in the case of Vision Transformers.

Premium content

Next from AAAI 2025

FaceA-Net: Facial Attribute-driven ID Preserving Image Generation Network

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES