Singapore

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from $\textit{segment anything}$ to $\textit{any segmentation}$. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we devise a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.

AAAI 2026

X-SAM: From Segment Anything to Any Segmentation

mllms

image segmentaiton

unified model

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Sequential recommendation has emerged as a fundamental task in various domains, aiming to predict a user's next interaction based on historical behavior. Recent advances in deep sequence models, particularly Transformer-based architectures and the more recent Mamba, have substantially pushed the boundaries of sequential modeling performance. However, existing methods still face two critical challenges. First, many current approaches overlook the hierarchical structures and high-order dependencies among items, typically restricting representation learning to conventional Euclidean spaces, which limits their capacity to capture complex relational information. Second, although Mamba excels at long-range dependency modeling, its reliance on static Feed-Forward Networks (FFNs) hinders its ability to dynamically adapt to evolving user preferences across diverse contexts. To address these limitations, we propose a Hyperbolic-Enhanced Mixture-of-Experts Mamba recommender (HM2Rec) for sequential recommendation. HM2Rec first encodes user-item relationships through hyperbolic graph convolution to exploit hierarchical structure more effectively. Then, a Variational Graph Auto-Encoder (VGAE) is employed to reconstruct node embeddings, improving structural robustness. To further enhance sequential modeling, we integrate Rotary Positional Encoding (RoPE) into Mamba to better capture relative position dependencies, and replace the FFN with Mixture-of-Expert (MOE) module, enabling dynamic and personalized expert selection for each token. Our extensive experiments on four widely-used public datasets demonstrate that HM2Rec outperforms several advanced baseline models.

Hyperbolic-Enhanced Mixture-of-Experts Mamba for Sequential Recommendation

Product posters blend striking visuals with informative text to highlight the product and capture customer attention. However, crafting appealing posters and manually optimizing them based on online performance is laborious and resource-consuming. To address this, we introduce AutoPP, an automated pipeline for product poster generation and optimization that eliminates the need for human intervention. Specifically, the generator, relying solely on basic product information, first uses a unified design module to integrate the three key elements of a poster (background, text, and layout) into a cohesive output. Then, an element rendering module encodes these elements into condition tokens, efficiently and controllably generating the product poster. Based on the generated poster, the optimizer enhances its Click-Through Rate (CTR) by leveraging online feedback. It systematically replaces elements to gather fine-grained CTR comparisons and utilizes Isolated Direct Preference Optimization (IDPO) to attribute CTR gains to isolated elements. Our work is supported by AutoPP1M, the largest dataset specifically designed for product poster generation and optimization, which contains one million high-quality posters and feedback collected from over one million users. Experiments demonstrate that AutoPP achieves state-of-the-art results in both offline and online settings. Our code and dataset will be released upon the paper's acceptance.

AutoPP: Towards Automated Product Poster Generation and Optimization

Deep neural networks are often over-parameterized, resulting in prohibitive storage and computational costs. A fundamental question is whether a complex network can be re-expressed in terms of a compact set of basis functions without sacrificing accuracy. Motivated by this perspective, we aim to approximate a dense model by decomposing it into a small number of lightweight components that capture the essential functional structure of the network.
To this end, we propose a series expansion framework that rewrites a neural network as a linear combination of low-bit basis models. Within the post-training quantization setting, the full-precision model is expanded hierarchically at the tensor, layer, and model levels into a structured set of basis functions. We theoretically prove that this expansion converges exponentially to the original model. Furthermore, we design AbelianAdd and AbelianMul operations between isomorphic basis models, endowing the expansion with an Abelian group structure that naturally supports commutative and parallel computation. Experimental results across diverse architectures show that our series expansion method leverages a set of ultra-low-bit basis functions, not only preserving full-precision performance without the need for calibration data or fine-tuning, but also featuring a parallel-friendly design that enables efficient and scalable deployment.

FP=XINT: Representing Neural Networks via Low-Bit Series Basis Functions

Multi-instance learning (MIL) has become a powerful paradigm for weakly supervised learning tasks, where instance-level annotations are unavailable or costly. While graph-based MIL methods enhance bag topological structure modeling, they often suffer from high computational costs and limited representation due to rigid graph construction and insufficient integration of bag-level semantics. To address these challenges, we propose GDF-MIL, a novel graph-driven MIL framework, which introduces a dual-path feature fusion mechanism to adaptively balance topological structure modeling and semantic feature preservation. First, the adaptive bag mapping module (ABMM) performs soft clustering to extract compact and informative representations. Subsequently, a dynamic graph structure learning (DGSL) component efficiently learns sparse topological structures via weighted connectivity, aggregating them into a comprehensive graph-level representation. Finally, to balance fast graph construction and bag-level knowledge, dual-path feature fusion (DPFF) employs a dual-path gating mechanism to integrate both types of features, which are then passed to the classification layer for bag label prediction. Extensive experiments on 24 datasets across 4 domains shown that GDF-MIL significantly outperforms 18 state-of-the-art methods on the majority of datasets.

Rethinking Multi-Instance Learning Through Graph-Driven Fusion: A Dual-Path Approach to Adaptive Representation

Continuous sign language recognition (CSLR) technology enables social communication for the hearing-impaired by converting sign language videos into text. However, due to the limited receptive fields of convolutional networks and inefficient long-range dependency modeling in temporal modules, current methods find it difficult to capture cross-regional and high-order dynamic semantics in complex gestures. To address these limitations, we propose a dynamic spatiotemporal hypergraph network named HyperSign, which optimizes feature learning through innovative graph architectures. For single-frame spatial modeling, we propose a saliency-aware spatial graph construction strategy that dynamically quantifies semantic saliency by integrating feature complexity and motion intensity information from patches. This strategy can adaptively adjust node connectivity based on the computed saliency, thereby enabling the graph structure to focus on information-dense regions such as hands and faces. For temporal dependency modeling, we abandon the conventional pairwise frame interactions and propose a temporal hypergraph construction method. This method employs a learnable clustering algorithm to aggregate semantically correlated nodes within temporal windows into hyperedges, thereby explicitly capturing high-order associations within individual gesture actions that span multiple frames. Extensive experiments on the PHOENIX14, PHOENIX14-T, and CSL-Daily datasets demonstrate that HyperSign outperforms the state-of-the-art (SOTA) approaches in CSLR without any additional annotation information, establishing a new feature learning paradigm for the CSLR task.

HyperSign: Saliency-Aware Spatial Graphs and Temporal Hypergraphs for Continuous Sign Language Recognition

Vision-and-Language navigation on websites requires agents to navigate target webpages and answer questions based on human instructions. Current web agents primarily leverage Large Language Models (LLMs) for semantic understanding and reasoning, but still suffer from limited navigation performance and slow inference speed. Constructing a global map across webpages can effectively enhance both navigation accuracy and efficiency, however, this is challenged by the open structure of web navigation graphs and the dynamic nature of web layouts. In this paper, we propose ATLAS: Adaptive Topological Layout And Semantic mapping, a framework that adaptively constructs a time-varying, unbounded topological map across webpages and unifies heterogeneous elements through semantic representation. This enables both global path planning and local element selection for web-based navigation and question answering. As a lightweight approach, ATLAS significantly outperforms existing state-of-the-art methods on the WebVLN benchmark with a 10% improvement in success rate, and achieves the highest average task success rate on both the Mind2Web and WebArena benchmarks.

Lightweight Adaptive Topological Layout and Semantic Mapping in Vision-and-Language Navigation on Websites

Event linking aims to associate event mentions in text with their corresponding entries in a knowledge base (KB). This task can help text understanding to benefit downstream tasks (e.g., question answering and recommendation systems) and expand the KB through new event knowledge mentioned in the text. Existing event linking approaches usually adopt a retrieve-and-rank framework, which suffers from high computational costs and relies on hand-crafted rules, thereby limiting generalization. Additionally, it is found that some entity linking methods can be used to solve this task directly. However, they also perform not well. In this paper, we propose SEFEL, an end-to-end, argument-aware event representation-based event linking framework to unify the modeling of both in-KB and out-of-KB scenarios. To further enhance the linking performance, we propose a contrastive learning module to refine the learned embeddings of events and event mentions. Experimental results demonstrate that SEFEL improves accuracy by at least $3.59$ (in-KB) and $21.5$ (out-of-KB) compared with baselines, while its inference speed is more than $38$ times faster than baselines, showcasing its accuracy and efficiency.

SEFEL: A Simple Yet Effective Framework for Fast Event Linking

Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adapted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA performs well on three fine-grained datasets.

Fine-Grained Image Retrieval via Dual-Vision Adaptation

In multi-instance partial label learning (MIPL), each sample is a bag of multiple instances linked to a candidate label set containing one true and multiple false labels, yielding inexact supervision in both instance features and label space. However, existing works adopt decoupled approaches that focus exclusively on either instance-level feature fusion or label-level disambiguation, failing to fully exploit the intrinsic dependencies between these two spaces. To overcome this limitation, graph-based methods are widely recognized as a powerful paradigm in weakly supervised learning, yet their success hinges on reliable features—precisely what MIPL lacks due to instance-level noise.
To bridge this gap, we propose DualG, a novel framework that simultaneously addresses feature learning and label disambiguation through dual-level graph propagation.
Specifically, we construct dual relevance graphs at both the bag and instance levels. At the bag level, we build a similarity graph based on fused feature representations; at the instance level, we employ attention scores to filter out irrelevant instances and construct a reliable instance-level relevance graph. These complementary graphs enable our joint label disambiguation framework to simultaneously address inexact supervision signals in both instance space and label space.
Experimental results on five benchmark datasets demonstrate that DualG outperforms existing MIPL and partial label learning methods, validating its effectiveness and superiority.

Dual Graph Disambiguation for Multi-Instance Partial-Label Learning

Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation. The codes, data and pre-trained weights are available at https://anonymous.4open.science/r/TLV-CoRe.

Content not yet available

Next from AAAI 2026

Hyperbolic-Enhanced Mixture-of-Experts Mamba for Sequential Recommendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES