United States

The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more generalizable for the tasks relying on the information of different granularity. Furthermore, for supporting both human and machine visions with only a unifying bitstream, we incorporate a conditional decoding strategy that takes as conditions human or machine preferences, enabling the bitstream to be decoded into different versions for corresponding preferences. As such, our proposed UG-ICM is fully trained in a self-supervised manner, i.e., without awareness of any specific downstream models and tasks. The extensive experiments have shown that the proposed UG-ICM is capable of achieving remarkable improvements in various unseen machine analytics tasks, while simultaneously providing perceptually satisfying images.

AAAI 2025

Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision

low level physics based vision

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Diffusion models have achieved remarkable success in sequential decision-making by leveraging the highly expressive model capabilities in policy learning. A central problem for learning diffusion policies is to align the policy output with human intents in various tasks. To achieve this, previous methods conduct return-conditioned policy generation or Reinforcement Learning (RL)-based policy optimization, while they both rely on pre-defined reward functions. In this work, we propose a novel framework, Forward KL regularized Preference optimization for aligning Diffusion policies, to align the diffusion policy with preferences directly. We first train a diffusion policy from the offline dataset without considering the preference, and then align the policy to the preference data via direct preference optimization. During the alignment phase, we formulate direct preference learning in a diffusion policy, where the forward KL regularization is employed in preference optimization to avoid generating out-of-distribution actions. We conduct extensive experiments for MetaWorld manipulation and D4RL tasks. The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.

Forward KL Regularized Preference Optimization for Aligning Diffusion Policies

Foundation models, serving as pretrained fundamental bases for a variety of downstream tasks, try to learn versatile, rich, and generalizable representations that can be quickly adopted through fine-tuning or even in a zero-shot manner for specific applications. Foundation models for molecular representation are no exception. Various pretext tasks have been proposed for pretraining molecular representations, but these approaches have focused on only single or partial properties. Molecules are complicated and require different perspectives depending on purposes: insights from local- or global-level,  2D-topology or 3D-spatial arrangement, and low- or high-level semantics. We propose Multi-level mOlecule gRaph prE-train (MORE) to consider these multiple aspects of molecules simultaneously. Experimental results demonstrate that our proposed method effectively learns comprehensive representations by showing outstanding performance in both linear probing and full fine-tuning. Notably, in quantification experiments of forgetting the pretrained models, MORE consistently exhibits minimal and stable parameter changes with the smallest performance gap, whereas other methods show substantial and inconsistent fluctuations with larger gaps. The effectiveness of individual pretext tasks varies depending on the problems being solved, which again highlights the need for a multi-level perspective.

MORE: Molecule Pretraining with Multi-Level Pretext Task

Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multi-view RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D$^2$-Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D$^2$-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses.

3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling

Text summarization task extracts salient information from a large amount of text for productivity enhancement.
However, most existing methods heavily rely on training models from ample and centrally stored data which is infeasible to collect in practice, due to privacy concerns and data scarcity nature under several settings (e.g., edge computing or cold starting).
The main challenge lies in constructing the privacy-preserving and well-behaved summarization model under the data scarcity scenario, where the data scarcity nature will lead to the knowledge shortage of the model while magnifying the impact of data bias, causing performance degeneration.
To tackle this challenge, previous studies attempt to complement samples or improve the efficiency of data.
The former is usually associated with high computing costs or has a large dependence on empirical settings, while the latter might not effective due to the lack of consideration of data bias.
In this work, we propose FedSum which extends the standard FL framework from depth and breadth to further extract prime and diversified knowledge from limited resources for text summarization.
For depth extension, we introduce a Data Partition method to cooperatively recognize the prime samples that are more significant and unbiased, and the Data skip mechanism is introduced to help the model further focus on those prime samples during the local training process.
For breadth extension, FedSum extends the source of knowledge and develops the summarization model by extracting knowledge from the data samples, hidden spaces, and globally received parameters.
Extensive experiments on four benchmark datasets verify the promising improvement of FedSum compared to baselines, and show its generalizability, scalability, and robustness.

FedSum: Data-Efficient Federated Learning under Data Scarcity Scenario for Text Summarization

Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``one-size-fits-all'' approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33\%, going as high as 40\%.
Furthermore, our method can be used on top of other acceleration methods to provide further speed benefits.
Lastly, qualitative analysis shows that AdaDiff allocates more steps to more informative prompts and fewer steps to simpler prompts.

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

Bayesian Optimization (BO) is a sample-efficient black-box optimizer commonly used in search spaces where hyperparameters are independent. However, in many practical AutoML scenarios, there will be dependencies among hyperparameters, forming a conditional search space, which can be partitioned into structurally distinct subspaces. The structure and dimensionality of hyperparameter configurations vary across these subspaces, challenging the application of BO. Some previous BO works have proposed solutions to develop multiple Gaussian Process models in these subspaces. However, these approaches tend to be inefficient as they require a substantial number of observations to guarantee each GP's performance and cannot capture relationships between hyperparameters across different subspaces. To address these issues, this paper proposes a novel approach to model the response surfaces of all subspaces in one, which can model the relationships between hyperparameters elegantly via a self-attention mechanism. Concretely, we design a structure-aware hyperparameter embedding to preserve the structural information. Then, we introduce an attention-based deep feature extractor, capable of projecting configurations with different structures from various subspaces into a unified feature space, where the response surfaces can be formulated using a single standard Gaussian Process. The empirical results on a simulation function, various real-world tasks, and HPO-B benchmark demonstrate that our proposed approach improves the efficacy and efficiency of BO within conditional search spaces.

Modeling All Response Surfaces in One for Conditional Search Spaces

The human brain is a complex system, and understanding its mechanisms has been a long-standing challenge in neuroscience. The study of the functional connectome, which maps the functional connections between different brain regions, has provided valuable insights through various advanced analysis techniques developed over the years. Similarly, neural networks, inspired by the brain's architecture, have achieved notable success in diverse applications but are often noted for their lack of interpretability. In this paper, we propose a novel approach that bridges neural networks and human brain functions by leveraging brain-inspired techniques. Our approach, grounded in the insights from the functional connectome, offers scalable ways to characterize topology of large neural networks using stable statistical and machine learning techniques. Our empirical analysis demonstrates its capability to enhance the interpretability of neural networks, providing a deeper understanding of their underlying mechanisms.

Functional Connectomes of Neural Networks

As the size of language models notably grows, fine-tuning the models becomes more challenging: fine-tuning with first-order optimizers (e.g., SGD and Adam) requires high memory consumption, while fine-tuning with a memory-efficient zeroth-order optimizer (MeZO) has a significant accuracy drop and slower convergence rate. In this work, we propose a Low-order Hybrid Optimizer (LoHO) which merges zeroth-order (ZO) and first-order (FO) optimizers for fine-tuning. LoHO is empowered with inter-layer hybrid optimization and intra-layer hybrid optimization, which boosts the accuracy of MeZO while keeping memory usage within a budget. The inter-layer hybrid optimization exploits the FO optimizer in deep layers and the ZO optimizer in shallow ones, therefore avoiding unnecessary gradient propagation to improve memory efficiency. The intra-layer hybrid optimization updates a proportion of parameters in a layer by the ZO optimizer, and the rest by the FO optimizer, taking advantage of gradient sparsity for high efficiency implementation. Our experimental results across common datasets on different pre-trained backbones (i.e., RoBERTa-large, OPT-13B and OPT-30B) demonstrate that LoHO can significantly improve the predictive accuracy and convergence rate of MeZO, while controlling the memory footprint during fine-tuning. Moreover, LoHO can achieve comparable performance with first-order fine-tuning using substantially fewer memory resources.

Towards Efficient Low-Order Hybrid Optimizer for Language Model Fine-Tuning

Cross-Domain Few-Shot Learning (CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP). Our method first meta-learns Domain-Specific Preconditioners (DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.

Task-Specific Preconditioner for Cross-Domain Few-Shot Learning

In this paper, we explore how to develop salient object detection models using adder neural networks (ANNs), which are more energy efficient than convolutional neural networks (CNNs), especially for real-world applications. Based on our empirical studies, we show that directly replacing the convolutions in CNN-based models with adder layers leads to a substantial loss of activations in the decoder part. This makes the feature maps learned in the decoder lack pattern diversity and hence results in a significant performance drop. To alleviate this issue, by investigating the statistics of the feature maps produced by adder layers, we introduce a simple yet effective differential merging strategy to augment the feature representations learned by adder layers and present a simple baseline for SOD using ANNs. Experiments on popular salient object detection benchmarks demonstrate that our proposed method with a simple feature pyramid network (FPN) architecture achieves comparable performance to previous state-of-theart CNN-based models and consumes much less energy. We hope this work could facilitate the development of ANNs in binary segmentation tasks.

Premium content

Next from AAAI 2025

Forward KL Regularized Preference Optimization for Aligning Diffusion Policies

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES