Singapore

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self‑supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR&#39;s unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela&#39;s scalability and its promise for self‑supervised retriever learning.

AAAI 2026

Revela: Dense Retriever Learning via Language Modeling

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self‑supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self‑supervised retriever learning.

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large Language Models (LLMs) have shown remarkable capabilities in document processing, but their inability to provide visual grounding without OCR dependencies poses significant challenges in business-critical applications. Current solutions either require model fine-tuning or rely on external OCR services, introducing additional costs, latency, and limitations in handling derived information. This paper presents ViG-LLM, a novel framework that enables closed-box LLMs to generate localization information through a multi-agent system combining U-Net-based layout deconstruction with viewport identification tasks. Evaluated on the FATURA and CORD dataset, our framework achieves perfect accuracy over spatial reasoning tuned LLM like Amazon Nova Pro, while demonstrating superior template-specific consistency. The framework maintains robust performance across LLM architectures while reducing operational costs by 60% compared to Textract-based solutions. In real-world document processing applications, the framework helps retain the high reasoning capabilities of the system in document information extraction tasks while improving explainability, reliability and human interaction for information verification. Through human-in-the-loop learning and closed-box prompt alignment techniques, ViG-LLM provides a robust, adaptable solution for visual grounding tasks in document processing workflows.

ViG-LLM: Enhancing Visual Grounding Capabilities in Closed-Box LLMs for Document Information Extraction without OCR Dependencies

Integration testing for ML systems faces unique challenges due to stochastic behavior and vast input spaces, while Continuous Integration/Continuous Delivery (CI/CD) pipelines have budgets and constraints that limit the number of tests that can be within an integration test suite. We present CAT (Coverage-Adaptive Testing), a framework that generates structured test inputs for coding-agent-based integration testing. CAT enables users to define coverage objectives through category taxonomies (e.g., adversarial attacks, compliance checks), then systematically generates high-coverage inputs through iterative refinement. A generator LLM produces candidate inputs targeting coverage gaps, while a judge LLM validates and classifies them. Greedy minimization selects compact test suites within user-specified budgets. In a case study on AWS Bedrock Guardrails, CAT achieves above 80% coverage across 27 adversarial categories for an integration test suite of 8 tests. CAT's structured outputs, natural language inputs with category labels and validation metadata, are designed to enable handoff to coding agents for test implementation. This architecture demonstrates a reusable human-AI collaboration pattern: humans define coverage objectives, CAT systematically explores the input space creating integration test inputs, and coding agents (future work) transform inputs into executable tests


CAT: Coverage-Adaptive Testing — Structured Test Suite Generation for Coding Agent Handoff

Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ($1.01 per 1K), yielding a 78× reduction vs. GPT-5 (high reasoning) while improving accuracy. Reasoning effort is model-family dependent: GPT-5 benefits from increased reasoning (with predictable accuracy–cost trade-offs), whereas open-weight models degrade across all dimensions as reasoning increases. Overall, cost spans 175× ($0.45–$78.96 per 1K). We release the dataset, framework, and code to support reproducibility and deployment.

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

We address the challenge of training reliable code-fixing agents in real-world repositories, where complex build systems and shifting dependencies make evaluation unstable. We developed a verifiable pipeline that defines success through post-fix build validation and improves reproducibility across roughly one thousand real issues by pinning dependencies and disabling automatic upgrades. Building on this foundation, we introduced a scalable, simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performed comparably while being 56× smaller, and RL added 7–20% absolute gains under matched train–test conditions. “Thinking mode” achieved similar or worse results in our experiments. Both SFT and RL models failed to generalize across environments, underscoring the importance of aligning training and evaluation settings when building reliable, real-world code-fixing agents.

Agentic Reinforcement Learning for Real-World Code Repair

Rapid growth in cloud services has generated substantial demand for cybersecurity tool enhancements. Upon receiving these requests, a security assessment process is required, including investigating documentation, examining code repositories, evaluating coverage, and assigning work to tooling teams with detailed metadata. This security assessment is performed manually, consuming valuable expert time.<br/>We introduce a multi-agent GenAI system that orchestrates specialized sub-agents to automate the security assessment workflow. Evaluation on historical requests shows the system achieves over 95\% accuracy compared to human experts. Early adoption shows the system saves approximately 0.5 to 1 hour of human expert time daily, demonstrating that agent collaboration can maintain quality while reducing manual workload.

Multi-Agent Framework for Automated Cloud Security Assessment

The proliferation of foundation models has given rise to ``vibe coding," a paradigm where end-users create software by describing desired outcomes in natural language. While platforms that enable this approach are democratizing software development, they often produce simplistic, fragile applications that lack the formal guarantees required for scalable, reliable systems. The core software engineering challenges of translating ambiguous user intent into formal specifications and managing the lifecycle of AI-generated code remain largely unaddressed. This demo presents a system built on the Software Engineering for Software Makers (SE4SM) framework, which introduces a disciplined, two-pillar approach to AI-assisted development. The first pillar, Intent Engineering, employs a collaborative human-AI dialogue to transform a user's high-level concept into a set of clear, unambiguous, and verifiable specifications while considering both explicitly stated needs and implicitly inferred requirements. The second, Realization Engineering, uses a spec-driven, multi-agent system to autonomously build, test, and verify the software. This methodology preserves the accessibility of vibe coding while embedding professional engineering practices, producing more trustworthy, evolvable systems.

From Vibe to Verifiable Spec-Driven Development: A Demo of Intent and Realization Engineering

Modern agentic frameworks operate under the fallacy of a Stationarity Trap: the dangerous assumption that meaning in underlying knowledge bases is static and user contexts are invariant. In this talk, evidence is presented from two distinct research streams to demonstrate that this assumption guarantees system failure. Applying our EchoCodes framework, we prove that “ground truth” content (e.g. data such as medical ontologies, intelligence signals, and software APIs) can undergo rate-independent semantic drift; and this can be detected before causing agents to execute employing meanings that have shifted. In the second dimension, as part of our work on the AiVisor system, we illustrate a “Personalization Paradox,” where agents optimizing for user utility incur statistically significant penalties in semantic consistency, effectively decoupling reasoning from lexically grounded baselines. Together, these findings define a non-stationary agent tradespace, where contextual drift (user alignment) and semantic dynamics (meaning discontinuity) converge to produce un-grounded agency. The convergence of these two phenomena, shifting ontological/ground-truth concepts and expanding contextual liabilities, creates a state of semantic insolvency. When such insolvent systems are granted the power to act, the result is un-grounded agency: high-confidence execution untethered from reality. Consequently, agentic deployments risk devolving into blind autonomy. We conclude by proposing a novel Semantic Guidance System to resolve the Stationarity Trap tradespace, moving the field beyond static evaluation toward dynamic semantic grounding for long-lifecycle agents.

Un-Grounded Agency and the Stationarity Trap in Agentic Workflows

We assume a category theoretical approach to reasoning that is separate from but integrates with conventional logical systems and linear algebraic techniques such as employed by large language models, and reinforcement learning. The agenda is motivated by the ability to reason topological-ly in an expanded vocabulary of types not motivated by conventional ontological needs. One high value application is reasoning over open world situations and causally dynamic systems where essential facts are missing, situated influence is substantial, and significant implicit influ-ence is present. We are particularly looking at dynamic, unstable, possible outcomes, and situa-tions where outcome engineering is desired. The program is motivated by collecting open influences as salients, and focusing on type systems to support the concept that can have topological induction. The focus of this paper is on the na-ture of a type system where salience is a first class citizen and category theoretical topology cen-tric induction is enabled. While the research is not motivated by a specific use case, we focus on central nervous system modelling where fear memory extinction is modelled, so that we can leverage prior work. But ap-plications are expected to be rather broad, characterised by open world insights and frangible possibilities. Our group builds systems in Haskell frameworks so our a type system needs to be consistent with what can be supported programmatically. Because our group is particularly interested in human/machine navigation of structures, a consistent visual grammar sensitive to these type def-initions is desirable. This paper will include a brief survey of the literature with an emphasis on situation theory as a framework for sale and influence. Potential sensor technologies are consid-ered.

Categorical Type Systems for Salient Influence

Maintaining human oversight and broad stakeholder participation remain key challenges for trust management in human-AI teams. With the increasing deployment of agentic AI and large language models, the risk of unpredictable yet highly consequential outcomes could, if left unaddressed, undermine public and expert confidence in such teams. This study revisits that debate to articulate a framework that secures fairness, legitimacy, and trust. Drawing on theoretical and empirical analyses of citizen participation in fields adjacent or parallel to AI, it shows that the weaknesses of black-box AI have analogous counterparts in the isomorphic processes of black-box value elicitation and implementation decisions. Specifically, this study focuses on circumstances in which experts operate outside their domain of expertise to develop ad-hoc theories of reality. Examples of such “inexpert expert judgments” include situations where judges must evaluate complex statistical evidence in court, and where algorithm specialists must select features to be included in models based not on relevance to the goal but on availability or intuition. In cases like that, even something as trivial as including or excluding an interaction term in a regression-based model could lead to highly problematic outcomes. Bypassing domain specialist expertise and substituting it with stand-in knowledge is unnecessary when such expertise could be incorporated with modest effort.

The Hidden Black-Box: Expert Labor and the Case for Participatory AI;

Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model’s outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user–agent simulations, such as in tau-bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.

Premium content

Next from AAAI 2026

ViG-LLM: Enhancing Visual Grounding Capabilities in Closed-Box LLMs for Document Information Extraction without OCR Dependencies

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES