Singapore

Text-Based Person Retrieval (TBPR) aims to accurately retrieve target individuals from large-scale image databases using only textual descriptions. Existing methods typically assume a ground-truth correspondence between text and images (i.e., strongly correlated). However, in real-world scenarios, this assumption may not be able to hold for the cross-modal matching due to weak or even corrupted correlations between textual descriptions and visual content, referred to as noisy correspondence (NC). Such NC largely disrupts the correspondence learning between visual and semantic modalities. Though prior works have improved single-modal robustness against noisy labels, systematic modeling of both cross-modal and intra-modal geometric structures in TBPR remains limited attention. In this paper, we propose Geometric Structure Consistency Alignment (\textbf{GSCA}) to TBPR, which leverages cross-modal cosine similarity and intra-modal nearest-neighbor affinity to learn visual-semantic consistency under noisy correspondence. To mitigate the structural corruption caused by noisy pairs, we introduce the Structure Refinement and Mining (\textbf{SRAM}) module. By partitioning training data into clean, ambiguous, and noisy subsets, SRAM enables the model to strategically refine the cross-modal correspondence by mining reliablepairs, thus enhancing the reliability of positive or negative samples discrimination and preserving structural consistency across modalities. Extensive experiments demonstrate that our method achieves state-of-the-art performance across three public datasets. On CUHK-PEDES, it boosts Rank-1 by 1.42\% in noise-free conditions, sustaining a robust 74.25\% Rank-1 under a 50\% noise ratio.

AAAI 2026

Geometry-Aware Noisy Correspondence Mitigation for Cross-Modal Text-Based Person Retrieval

structural consistency

text-based person retrieval

noisy correspondence

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Semi-Supervised Instance Segmentation (SSIS) involves classifying and grouping image pixels into distinct object instances using limited labeled data alongside large-scale unlabeled data. A major challenge in SSIS lies in the inherent noise of pseudo-labels, particularly when class and mask qualities are coupled into a single confidence score for filtering. Such coupling often results in sub-optimal trade-offs between semantic accuracy and spatial precision. To address this, we propose a novel Pseudo-Label Decoupling and Correction (PL-DC) framework, which explicitly decouples and enhances the pseudo-label selection process for SSIS. At the instance level, we introduce a Decoupled Filtering with Adaptive Class-Aware Thresholds mechanism, which independently evaluates class and mask qualities using category-specific thresholds updated via exponential moving averages. At the category level, we design a Dynamic Instance Category Correction module that reassigns ambiguous class pseudo-label by leveraging semantic prototypes and consistency alignment. At the pixel level, a Pixel-Level Mask Uncertainty-Aware mechanism is applied to suppress the influence of unreliable pixels during mask supervision, further improving the robustness against pixel-wise noise. Extensive experiments on COCO and Cityscapes datasets demonstrate that the proposed PL-DC achieves significant performance improvements, setting new state-of-the-art results. Notably, PL-DC achieves gains of +11.7 mAP with just 1% labeled COCO data and +16.4 mAP with 5% Cityscapes labels, showing its effectiveness under extremely low-label regimes.

Robust Pseudo-Labeling via Decoupled Class-Aware Filtering and Dynamic Category Correction

In multi-objective decision-making with hierarchical preferences, lexicographic bandits provide a natural framework for optimizing multiple objectives in a prioritized order. In this setting, a learner repeatedly selects arms and observes reward vectors, aiming to maximize the reward for the highest-priority objective, then the next, and so on. While previous studies have primarily focused on regret minimization, this work bridges the gap between *regret minimization* and *best arm identification* under lexicographic preferences. We propose two elimination-based algorithms to address this joint objective. The first algorithm eliminates suboptimal arms sequentially, layer by layer, in accordance with the objective priorities, and achieves sample complexity and regret bounds comparable to those of the best single-objective algorithms. The second algorithm simultaneously leverages reward information from all objectives in each round, effectively exploiting cross-objective dependencies. Remarkably, it outperforms the known lower bound for the single-objective bandit problem, highlighting the benefit of cross-objective information sharing in the multi-objective setting. Empirical results further validate their superior performance over baselines.

Beyond the Lower Bound: Bridging Regret Minimization and Best Arm Identification in Lexicographic Bandits

We present CineMPC, a complete cinematographic
system that autonomously controls a drone to film multiple
targets
recording user-specified aesthetic objectives. Existing
solutions in
autonomous cinematography control only the camera
extrinsics,
namely, its position and orientation. In contrast, CineMPC
is the
first solution that includes the camera intrinsic
parameters in the
control loop, which are essential tools for controlling
cinematographic effects such as focus, depth of field, and
zoom. The system
estimates the relative poses between the targets and the
camera
from an RGB-D image and optimizes a trajectory for the
extrinsic
and intrinsic camera parameters to film the artistic and
technical
requirements specified by the user. The drone and the camera
are controlled in a nonlinear model predicted control (MPC)
loop
by reoptimizing the trajectory at each time step in
response to
current conditions in the scene. The perception system of
CineMPC
can track the targets’ position and orientation despite the
camera
effects. Experiments in a photo-realistic simulation and
with a real
platform demonstrate the capabilities of the system to
achieve a
full array of cinematographic effects that are not possible
without
the control of the intrinsics of the camera. Code for
CineMPC is
implemented following a modular architecture in ROS and
released
to the community.

CineMPC: A fully autonomous drone cinematography system
incorporating zoom, focus, pose, and scene composition

Radio Frequency Fingerprinting (RFF) exploits inherent hardware-level imperfections of wireless transmitters as unclonable identifiers for device identification. These unique signatures, concealed in transmitted signals, inevitably experience complex distortions during wireless propagation (i.e., coupled with ambient noise and channel fading), making it extremely challenging for reliable extraction. Despite substantial research efforts dedicated to advancing effective fingerprint extraction techniques, current approaches still struggle in handling fingerprint robustness under distance variations, leading to severe SNR fluctuations and complex multipath effects. To address this gap, we propose the first unsupervised framework for distance-invariant radio frequency fingerprinting, eliminating dependence on labeled target domain data. Specifically, we first preprocess raw RF samples by confining them within a specified variation range and filtering noisy high-frequency components while avoiding aliasing. For source domain data, we then propose a set of physics-inspired data augmentation techniques designed to emulate realistic wireless signal propagation effects. Building on this, we introduce a dual alignment contrastive learning method to explicitly decouple identity-discriminative features, ensuring the model focuses on device-specific traits. Furthermore, we incorporate a pseudo-labeling-based domain adaptation module to refine the model for the unlabeled target domain, enhancing its generalization to unseen distances. Extensive experiments on public datasets show that our method achieves the identification accuracy outperforming state-of-the-art approaches by 40\%, while maintaining computational efficiency suitable for edge deployment.

Towards Distance-Invariant Radio Frequency Fingerprinting via Augmented Unsupervised Learning

Mixed-Integer Linear Programming (MILP) is a cornerstone of combinatorial optimization, yet solving large-scale instances remains a significant computational challenge. 
Recently, Graph Neural Networks (GNNs) have shown promise in accelerating MILP solvers by predicting high-quality solutions.
However, we identify that existing methods misalign with the intrinsic structure of MILP problems at two levels.
At the leaning objective level, the Binary Cross-Entropy (BCE) loss treats variables independently, neglecting their relative priority and yielding plausible logits.
At the model architecture level, standard GNN message passing inherently smooths the representations across variables, msking the natural competitive relationships within constraints.
To address these challenges, we propose CoCo-MILP, which explicitly models inter-variable Contrast and intra-constraint Competition for advanced MILP solution prediction.
At the objective level, CoCo-MILP introduces the Inter-Variable Contrastive Loss (VCL), which explicitly maximizes the embedding margin between variables assigned one versus zero.
At the architectural level, we design an Intra-Constraint Competitive GNN layer that, instead of homogenizing features, learns to differentiate representations of competing variables within a constraint, capturing their exclusionary nature. Experimental results on standard benchmarks demonstrate that CoCo-MILP significantly outperforms existing learning-based approaches, reducing the solution gap by up to 68.12\% compared to traditional solvers.

CoCo-MILP: Inter-Variable Contrastive and Intra-Constraint Competitive MILP Solution Prediction

Orthodontic treatment needs regular tooth alignment checks, but current methods depend on clinic visits, limiting remote care. With the emergence of 3D Gaussian Splatting (3DGS), realistic novel views can be synthesized, making it possible for clinicians to remotely monitor orthodontic conditions. However, using only five intraoral images with unknown camera poses and dynamic lighting presents major challenges in dental applications. To address these challenges, we propose DentalGS, an enhanced 3DGS framework capable of synthesizing novel intraoral views from five post-orthodontic intraoral images and pre-orthodontic intraoral scan (IOS) data as prior, without camera poses. Our method initializes a Gaussian point cloud labeled with ISO-FDI tooth classes based on the patient’s pre-orthodontic IOS data, then estimates camera poses through iterative optimization. We introduce a Progressive Pair Generation Strategy as a data augmentation method that generates damage–repair image pairs to train a RepairNet, aiming to restore degraded geometry and appearance caused by the limited number of intraoral images. Additionally, we introduce a Lighting-Aware 3DGS inspired by physical reflectance properties to mitigate the effects of dynamic lighting conditions. Experimental results show that our method produces high-quality novel views while preserving geometric structure even under extreme viewpoints, offering an efficient and reliable solution for 3D tooth visualization in remote orthodontic monitoring.

DentalGS: Pose-Free 3D Gaussian Splatting from Five Intraoral Images for Novel View Synthesis

Mobile motion sensors such as accelerometers and gyroscopes are now ubiquitously accessible by third-party apps via standard APIs. While enabling rich functionalities like activity recognition and step counting, this openness has also enabled unregulated inference of sensitive user traits, such as gender, age, and even identity, without user consent. Existing privacy-preserving techniques, such as GAN-based obfuscation or differential privacy, typically require access to the full input sequence, introducing latency that is incompatible with real-time scenarios. Worse, they tend to distort temporal and semantic patterns, degrading the utility of the data for benign tasks like activity recognition. To address these limitations, we propose the Predictive Adversarial Transformation Network (PATN), a real-time privacy-preserving framework that leverages historical signals to generate adversarial perturbations proactively. The perturbations are applied immediately upon data acquisition, enabling continuous protection without disrupting application functionality. Experiments on two datasets demonstrate that PATN substantially degrades the performance of privacy inference models, achieving Attack Success Rate (ASR) of 40.11\% and 44.65\% (reducing inference accuracy to near-random) and increasing the Equal Error Rate (EER) from 8.30\% and 7.56\% to 41.65\% and 46.22\%. On ASR, PATN outperforms baseline methods by 16.16\% and 31.96\%, respectively.

Privacy on the Fly: A Predictive Adversarial Transformation Network for Mobile Sensor Data

Multimodal retrieval-augmented generation (RAG) systems enhance large vision-language models by integrating cross-modal knowledge, enabling their increasing adoption across real-world multimodal tasks.
These knowledge databases may contain sensitive information that requires privacy protection.
However, multimodal RAG systems inherently grant external users indirect access to such data, making them potentially vulnerable to privacy attacks, particularly membership inference attacks (MIAs).
Existing MIA methods targeting RAG systems predominantly focus on the textual modality, while the visual modality remains relatively underexplored.
To bridge this gap, we propose MrM, the first black-box MIA framework targeted at multimodal RAG systems.
It utilizes a multi-object data perturbation framework constrained by counterfactual attacks, which can concurrently induce the RAG systems to retrieve the target data and generate information that leaks the membership information.
Our method first employs an object-aware data perturbation method to constrain the perturbation to key semantics and ensure successful retrieval.
Building on this, we design a counterfact-informed mask selection strategy to prioritize the most informative masked regions, aiming to eliminate the interference of model self-knowledge and amplify attack efficacy.
Finally, we perform statistical membership inference by modeling query trials to extract features that reflect the reconstruction of masked semantics from response patterns.
Experiments on two visual datasets and eight mainstream commercial visual-language models (e.g., GPT-4o, Gemini-2) demonstrate that MrM achieves consistently strong performance across both sample-level and set-level evaluations, and remains robust under adaptive defenses.

MrM: Black-Box Membership Inference Attacks Against Multimodal RAG Systems

While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics (like entailment). However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without relying on gold-standard labels. We focus on two core reasoning domains, math and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel Elo-based multi-agent evaluation framework that leverages multiple interacting SLMs to approximate LLM-level judgment accuracy. Experimental results show a clear performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of the SLMs. On the MATH dataset, MAJ framework using smaller-sized models as backbones could perform comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable assessment using SLMs.

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

One of the most important queries in knowledge compilation is weighted model counting (WMC), which has been applied to probabilistic inference on various models, such as Bayesian networks. In practical situations on inference tasks, the model's parameters have uncertainty because they are often learned from data, and thus we want to compute the degree of uncertainty in the inference outcome. One possible approach is to regard the inference outcome as a random variable by introducing distributions for the parameters and evaluate the variance of the outcome. Unfortunately, the tractability of computing such a variance is hardly known. Motivated by this, we consider the problem of computing the variance of WMC and investigate this problem's tractability. First, we derive a polynomial time algorithm to evaluate the WMC variance when the input is given as a structured d-DNNF. Second, we prove the hardness of this problem for structured DNNFs, d-DNNFs, and FBDDs, which is intriguing because the latter two allow polynomial time WMC algorithms. Finally, we show an application that measures the uncertainty in the inference of Bayesian networks. We empirically show that our algorithm can evaluate the variance of the marginal probability on real-world Bayesian networks and analyze the impact of the variances of parameters on the variance of the marginal.

Downloads

Next from AAAI 2026

Robust Pseudo-Labeling via Decoupled Class-Aware Filtering and Dynamic Category Correction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES