Singapore

Video recommendation systems heavily rely on user watch time feedback, making accurate watch time prediction a crucial task. However, this task inherently suffers from bias, as recommendation models tend to favor long-duration videos to maximize watch time. This issue, known as duration bias in the watch-time prediction context, can be explained from a causal perspective, where video duration acts as a confounder. Recent works address this bias using backdoor adjustment, isolating the direct effect of content on watch time from observational data. These methods typically discretize video duration into groups, estimate group-wise effects, and then aggregate them via a unified prediction model. However, this aggregation strategy is prone to model misspecification due to feature distribution shift across groups. In this paper, we reinterpret the problem through the lens of invariant learning and propose a novel framework: **D**uration-**I**nvariant **F**eature **L**earning (**DIFL**). DIFL employs a kernel-based regularization that enforces representation invariance across duration groups, reducing sensitivity to group design and improving generalization. This enables more accurate modeling of the direct causal effect and making counterfactual inference. Extensive experiments on both public and real large-scale production datasets demonstrate the effectiveness of our approach, which achieves SOTA performance.

AAAI 2026

Invariant Feature Learning for Counterfactual Watch-time Prediction in Video Recommendation

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3\% in IoU and 4.1\% in mIoU, while reducing both latency and memory consumption by more than 9.3\%. The code will be released upon acceptance.

SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion

Linear Temporal Logic over finite traces (LTLf) is a widely used formalism with applications in AI, process mining, model checking, and more. 
The primary reasoning task for LTLf is satisfiability checking; yet, the recent focus on explainable AI has increased interest in analyzing inconsistent formulae, making the enumeration of minimal explanations for unsatisfiability a relevant task also for LTLf. 
We introduce a novel technique for enumerating minimal unsatisfiable cores (MUCs) of an LTLf specification. 
The main idea is to encode an LTLf formula into an Answer Set Programming (ASP) specification, such that the minimal unsatisfiable subsets (MUSes) of the ASP program directly correspond to the MUCs of the original LTLf specification.
Leveraging recent advancements in ASP solving yields an MUC enumerator achieving good performance in experiments conducted on established benchmarks from the literature.

Enumerating Minimal Unsatisfiable Cores of LTLf Formulae

Multimodal Large Language Models (MLLMs) extend foundation models to real-world applications by integrating inputs such as text and vision. However, their broad knowledge capacity raises growing concerns about privacy leakage, toxicity mitigation, and intellectual property violations. Machine Unlearning (MU) offers a practical solution by selectively forgetting targeted knowledge while preserving overall model utility.
When applied to MLLMs, existing neuron-editing-based MU approaches face two fundamental challenges: (i) forgetting becomes inconsistent across modalities because existing point-wise attribution methods fail to capture the structured, layer-by-layer information flow that connects different modalities; and (ii) general knowledge performance declines when sensitive neurons that also support important reasoning paths are pruned, as this disrupts the model’s ability to generalize.
To alleviate these limitations, we propose a multimodal influential neuron path editor (MIP-Editor) for MU. Our approach introduces modality-specific attribution scores to identify influential neuron paths responsible for encoding forget-set knowledge and applies influential-path-aware neuron-editing via representation misdirection. This strategy also enables effective and coordinated forgetting across modalities while preserving the model's general capabilities. Experimental results demonstrate that MIP-Editor achieves a superior unlearning performance on multimodal tasks, with a maximum forgetting rate of 87.75% and up to 54.26% improvement in general knowledge retention. On textual tasks, MIP-Editor achieves up to 80.65% forgetting and preserves 77.90% of general performance.

Cross-Modal Unlearning via Influential Neuron Path Editing in Multimodal Large Language Models

Garment simulation is fundamental to various applications in computer vision and graphics, from virtual try-on to digital human modelling. However, conventional physics-based methods remain computationally expensive, hindering their application in time-sensitive scenarios. While graph neural networks (GNNs) offer promising acceleration, existing approaches exhibit poor cross-resolution generalisation, demonstrating significant performance degradation on higher-resolution meshes beyond the training distribution. This stems from two key factors: (1) existing GNNs employ fixed message-passing depth that fails to adapt information aggregation to mesh density variation, and (2) vertex-wise displacement magnitudes are inherently resolution-dependent in garment simulation.
To address these issues, we introduce Propagation-before-Update Graph Network (Pb4U-GNet), a resolution-adaptive framework that decouples message propagation from feature updates. Pb4U-GNet incorporates two key mechanisms: (1) dynamic propagation depth control, adjusting message-passing iterations based on mesh resolution, and (2) geometry-aware update scaling, which scales predictions according to local mesh characteristics. Extensive experiments show that even trained solely on low-resolution meshes, Pb4U-GNet exhibits strong generalisability across diverse mesh resolutions, addressing a fundamental challenge in neural garment simulation.

Pb4U-GNet: Resolution-Adaptive Garment Simulation via Propagation-before-Update Graph Network

Large models have achieved remarkable performance across a range of reasoning and understanding tasks. Prior work often utilizes model ensembles or multi-agent systems to collaboratively generate responses, effectively operating in a server-to-server paradigm. However, such approaches do not align well with practical deployment settings, where a limited number of server-side models are shared by many clients under modern internet architectures. In this paper, we introduce CoLM (Collaboration in Large-Models), a novel framework for collaborative reasoning that redefines cooperation among large models from a client-server perspective. Unlike traditional ensemble methods that rely on simultaneous inference from multiple models to produce a single output, CoLM allows the outputs of multiple models to be aggregated or shared, enabling each client model to independently refine and update its own generation based on these high-quality outputs. This design enables collaborative benefits by fully leveraging both client-side and shared server-side models. We further extend CoLM to vision-language models (VLMs), demonstrating its applicability beyond language tasks. Experimental results across multiple benchmarks show that CoLM consistently improves model performance on previously failed queries, highlighting the effectiveness of collaborative guidance in enhancing single-model capabilities.

CoLM: Collaborative Large Models via a Client-Server Paradigm

Depression represents a global mental health challenge requiring efficient and reliable automated detection methods. Current Transformer- or Graph Neural Networks (GNNs)-based multimodal depression detection methods face significant challenges in modeling individual differences and cross-modal temporal dependencies across diverse behavioral contexts. Therefore, we propose P$^3$HF (Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network) with three key innovations: (1) personality-guided representation learning using LLMs to transform discrete individual features into contextual descriptions for personalized encoding; (2) Hypergraph-Former architecture modeling high-order cross-modal temporal relationships; (3) event-level domain disentanglement with contrastive learning for improved generalization across behavioral contexts. Experiments on MPDD-Young dataset show P$^3$HF achieves around 10\% improvement on accuracy and weighted F1 for binary and ternary depression classification task over existing methods. Extensive ablation studies validate the independent contribution of each architectural component, confirming that personality-guided representation learning and high-order hypergraph reasoning are both essential for generating robust, individual-aware depression-related representations. The code is released at https://github.com/hacilab/P3HF.

Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network for Multimodal Depression Detection

An important desideratum in approval-based multiwinner voting is proportionality. We study the problem of reconfiguring proportional committees: given two proportional committees, is there a transition path that consists only of proportional committees, where each transition involves replacing one candidate with another candidate? We show that the set of committees satisfying the proportionality axiom of justified representation (JR) is not always connected, and it is PSPACE-complete to decide whether two such committees are connected. On the other hand, we prove that any two JR committees can be connected by committees satisfying a $2$-approximation of JR. We also obtain similar results for the stronger axiom of extended justified representation (EJR). In addition, we demonstrate that the committees produced by several well-known voting rules are connected or at least not isolated, and investigate the reconfiguration problem in restricted preference domains.

Reconfiguring Proportional Committees

Accurate identification of protein active sites at the residue level is crucial for understanding protein function and advancing drug discovery. However, current methods face two critical challenges: vulnerability in single-instance prediction due to sparse training data, and inadequate modality reliability estimation that leads to performance degradation when unreliable modalities dominate fusion processes. To address these challenges, we introduce Multimodal Mixtureof-Experts with Retrieval Augmentation (MERA), the first
retrieval-augmented framework for protein active site identification. MERA employs hierarchical multi-expert retrieval that dynamically aggregates contextual information from chain, sequence, and active-site perspectives through residuelevel mixture-of-experts gating. To prevent modality degradation, we propose a reliability-aware fusion strategy based on Dempster–Shafer evidence theory that quantifies modality trustworthiness through belief mass functions and learnable discounting coefficients, enabling principled multimodal integration. Extensive experiments on ProTAD-Gen and TS125 datasets demonstrate that MERA achieves state-of-the-art performance, with 90% AUPRC on active site prediction and significant gains on peptide-binding site identification, validating the effectiveness of retrieval-augmented multi-expert modeling and reliability-guided fusion

Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification

Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works.
Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21\% compared to prior approaches, normalized against non-streaming baseline BLEU scores.

REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

In high-stakes scenarios—such as self-harm, legal, or medical queries—LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict “trustworthy-before-helpful” ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)—a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.

Downloads

Next from AAAI 2026

SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads