United States

Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model&#39;s generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets. Code will be available.

AAAI 2025

Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization Through Spare-Coding Transformer

Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model's generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets. Code will be available.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Open-vocabulary semantic segmentation (OVSS) aims to segment images of arbitrary categories specified by class labels. While previous approaches relied on extensive image-text pairs or dense semantic annotations, recent training-free methods attempted to overcome these limitations by constructing semantic prototypes in the construction stage and image-to-image matching (i.e., prototype matching) during testing. However, these methods often struggle to effectively capture the visual characteristics of categories and fail to utilize local features during prototype matching. To deal with these problems, we propose a novel training-free framework for OVSS that constructs diverse prototypes and performs fine-grained sub-region matching. Specifically, our method leverages Large Language Models (LLMs) to guide support image generation by descriptions of different attributes of categories and employs coarse-fine clustering to obtain diverse and robust part-level prototypes in the construction stage. During testing, we propose a sub-region matching method, which assigns part-level prototypes to sub-regions utilizing optimal transport, to fully utilize local image features among part-level prototypes. Extensive experiments demonstrate the effectiveness of our method and show that our method achieves state-of-the-art performance, outperforming previous methods across five datasets.

Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching

Given a graph representing the workspace, Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start vertex to their respective goal vertex while minimizing path costs. Although many MAPF algorithms were developed and can handle up to thousands of agents, they usually rely on the assumption that each action of the agent takes a time unit, and the actions of all agents are synchronized in a sense that the actions of agents start at the same discrete time step, which may limit their use in practice. Only a few algorithms have been developed to address asynchronous actions, and they all lie on one end of the spectrum, focusing on finding optimal solutions with limited scalability. This paper develops new planners that lie on the other end of the spectrum, trading off solution quality for scalability, by finding an unbounded sub-optimal solution for many agents. Our method leverages both search-based methods in handling asynchronous actions and techniques in rule-based planning for MAPF. We analyze the properties of our method and test it against several baselines with up to a thousand agents with asynchronous actions in various maps. Given a runtime limit, our method can handle an order of magnitude more agents than the existing methods with about 25\% longer makespan.

Loosely Synchronized Rule-Based Planning for Multi-Agent Path Finding with Asynchronous Actions

Multimodal Relation Extraction (MRE) aims to predict relations between head and tail entities based on the context of sentence-image pairs. Most existing MRE methods progressively incorporate textual and visual inputs to dominate the learning process, assuming both contribute significantly to the task. However, the diverse visual appearances and text with ambiguous semantics contain less-informative contexts for the corresponding relation. To tackle these challenges, we highlight the importance of semantically invariant entity attributes that encompass fine-grained categories. Towards this, we propose a novel Prototype-Guided Multimodal Relation Extraction (PG-MRE) framework based on Entity Attributes. Specifically, we first generate detailed entity explanations using Large Language Models (LLMs) to supplement the attribute semantics. Then, the Attribute Prototype Module (APM) refines attribute categories and condenses scattered entity attribute features into cluster-level prototypes. Furthermore, prototype-aligned attribute features guide diverse visual appearance features to produce compact and distinctive multimodal representations in the Relation Prototype Module (RPM). Extensive experiments demonstrate that our method gains superior relation classification capability (especially in scenarios involving various unseen entities), achieving new state-of-the-art performances on MNRE dataset. Our code will be available soon.

Prototype-Guided Multimodal Relation Extraction based on Entity Attributes

Federated learning has become a promising solution for collaboration among medical institutions. However, data owned by each institution would be highly heterogeneous and the distribution is always non-independent and identical distribution (non-IID), resulting in client drift and unsatisfactory performance. Despite existing federated learning methods attempting to solve the non-IID problems, they still show marginal advantages but rely on frequent communication which would incur high costs and privacy concerns. In this paper, we propose a novel federated learning method: $\textbf{Fed}$erated learning via $\textbf{V}$aluable $\textbf{C}$ondensed $\textbf{K}$nowledge (FedVCK). We enhance the quality of condensed knowledge and select the most necessary knowledge guided by models, to tackle the non-IID problem within limited communication budgets effectively. Specifically, on the client side, we condense the knowledge of each client into a small dataset and further enhance the condensation procedure with latent distribution constraints, facilitating the effective capture of high-quality knowledge. During each round, we specifically target and condense knowledge that has not been assimilated by the current model, thereby preventing unnecessary repetition of homogeneous knowledge and minimizing the frequency of communications required. On the server side, we propose relational supervised contrastive learning to provide more supervision signals to aid the global model updating. Comprehensive experiments across various medical tasks show that FedVCK can outperform state-of-the-art methods, demonstrating that it's non-IID robust and communication-efficient.

FedVCK: Non-IID Robust and Communication-Efficient Federated Learning via Valuable Condensed Knowledge for Medical Image Analysis

We introduce TCAM-Diff, a novel 3D medical image generation model that reduces the memory requirements to encode and generate high-resolution 3D data. This model utilizes a decoder-only autoencoder method to learn triplane representation from dense volume and leverages generalization operations to prevent overfitting. Subsequently, it uses a triplane-aware cross-attention diffusion model to learn and integrate these features effectively. Furthermore, the features generated by the diffusion model can be rapidly transformed into 3D volumes using a pre-trained decoder module. Our experiments on three different scales of medical datasets, BrainTumour $128\times128\times128$, Pancreas $256\times256\times256$, and Colon $512\times512\times512$, demonstrated outstanding results. We utilized MSE and SSIM to evaluate reconstruction quality and leveraged the Wasserstein Generative Adversarial Network (W-GAN) critic to assess generative quality. Comparisons to existing approaches show that our method gives better reconstruction and generation results than other encoder-decoder methods with similar-sized latent spaces.

TCAM-Diff: Triplane-Aware Cross-Attention Medical Diffusion Model

In this paper, we study a challenging problem of simultaneously removing haze and estimating depth from real monocular hazy videos. These tasks are inherently complementary: enhanced depth estimation improves dehazing via the atmospheric scattering model (ASM), while superior dehazing contributes to more accurate depth estimation through the brightness consistency constraint (BCC). To tackle these tasks, we propose a novel depth-centric learning framework that integrates the ASM model with the BCC constraint. Our key idea is that both ASM and BCC rely on a shared depth estimation network. This network simultaneously leverages adjacent dehazed frames to enhance depth estimation using BCC and employs the refined depth cues to more effectively remove haze using ASM. Additionally, we leverage a non-aligned clear video and its estimated depth to independently regularize the dehazing and depth estimation networks. This is achieved by designing two discriminator networks: $D_\text{MFIR}$, which enhances high-frequency details in dehazed videos, and $D_\text{MDR}$, which reduces the occurrence of black holes in low-texture regions. Extensive experiments demonstrate that the proposed method outperforms current state-of-the-art techniques in both video dehazing and depth estimation tasks, especially in real-world hazy scenes. Project page: \url{https://github.com/hello2377/DCL}.

Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user's information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but this sparse rewards provide little guidance in most cases, leading to unstable training and generation results.We find that user's needs are also reflected in the gold documents, retrieved documents and ground-truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy.Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.

MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models

Recently, both closed-source LLMs and open-source communities have made significant strides, outperforming humans in various general domains. However, their performance in specific professional domains such as medicine, especially within the open-source community, remains suboptimal due to the complexity of medical knowledge. In this paper, we propose CareBot, a bilingual medical LLM, which leverages a comprehensive approach integrating continuous pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF). Our novel two-stage CPT method, comprising Stable CPT and Boost CPT, effectively bridges the gap between general and domain-specific data, facilitating a smooth transition from pre-training to fine-tuning and enhancing domain knowledge progressively. We also introduce DataRater, a model designed to assess data quality during CPT, ensuring that the training data is both accurate and relevant. For SFT, we develope a large and diverse bilingual dataset, along with ConFilter, a metric to enhance multi-turn dialogue quality, which is crucial to improving the model's ability to handle more complex dialogues. The combination of high-quality data sources and innovative techniques significantly improves CareBot's performance across a range of medical applications. Our rigorous evaluations on Chinese and English benchmarks confirm CareBot's effectiveness in medical consultation and education. These advancements not only address current limitations in medical LLMs but also set a new standard for developing effective and reliable open-source models in the medical domain. We will open-source the datasets and models later, contributing valuable resources to the research community.

CareBot: A Pioneering Full-Process Open-Source Medical Language Model

Tactile sensation plays a crucial role in the development of multi-modal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method.

TextToucher: Fine-Grained Text-to-Touch Generation

Phase imaging is gaining importance due to its applications in fields like biomedical imaging and material characterization. In biomedical applications, it can provide quantitative information missing in label-free microscopy modalities. One of the most prominent methods in phase quantification is the Transport-of-Intensity Equation (TIE). TIE often requires multiple acquisitions at different defocus distances, which is not always feasible in a clinical setting due to hardware constraints. To address this issue, we propose the use of chromatic aberrations to induce the required through-focus images with a single exposure, effectively generating a through-focus stack. Since the defocus distance induced by the aberrations is small, conventional TIE solvers are insufficient to address the resulting artifacts. We propose Zero-Mean Diffusion, a modified version of diffusion models designed for quantitative image prediction, and train it with synthetic data to ensure robust phase retrieval. Our contributions offer an alternative TIE approach that leverages chromatic aberrations, achieving accurate single-exposure phase measurement with white light and thus improving the efficiency of phase imaging. Additionally, we present a new class of diffusion models that are well-suited for quantitative data and have a sound theoretical basis. To validate our approach, we employ a widespread brightfield microscope equipped with a commercially available color camera. We apply our model to clinical microscopy of patients' urine, obtaining accurate phase measurements.

Premium content

Next from AAAI 2025

Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES