Singapore

Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are provided at URL: https://tinyurl.com/3wv58sbw

AAAI 2026

Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Predicting Valence-Arousal-Dominance (VAD) dimensions from bodily-expressed emotions in videos remains a fundamentally challenging task in affective computing, requiring models that capture subtle spatiotemporal patterns while balancing computational efficiency and interpretability. We present a comprehensive investigation of VAD prediction approaches on the newly introduced Annotated Bodily Expressed Emotion (ABEE) dataset, which contains approximately 3,200 video clips spanning 8 primary emotion categories and 20 subcategories. We explore two complementary methodologies: a feature-based gradient boosting approach using XGBoost with carefully engineered spatiotemporal features and dimensionality reduction, and deep learning architectures capable of learning hierarchical representations directly from raw video data. Our feature-based approach demonstrates exceptional computational efficiency, with sub-second training times and minimal resource requirements, while our deep models reveal the fundamental difficulty of capturing continuous VAD dimensions from bodily expressions. Through systematic evaluation on the ABEE dataset, we establish baseline performance for the VAD prediction task, achieving $R^2$ scores of -0.090, -0.014, and -0.058 for valence, arousal, and dominance, respectively, with our gradient boosting approach. These results highlight the substantial gap between current methodologies and the inherent complexity of bodily emotion signals, providing benchmarks for future research. We further discuss critical insights regarding feature engineering, temporal dynamics, and the intrinsic challenges of continuous emotion prediction from naturalistic video data, emphasizing the need for dedicated spatiotemporal modeling strategies tailored to bodily expressions.

Spatiotemporal Modeling of Bodily Emotional Expressions for Continuous Valence-Arousal-Dominance Prediction in Video

Sketching is a widely used medium for generating and exploring early-stage design concepts. While generative AI (GenAI) chatbots are increasingly used for idea generation, designers often struggle to craft effective prompts and find it difficult to express evolving visual concepts through text alone. In the formative study (N=6), we examined how designers use GenAI during ideation, revealing that text-based prompting disrupts creative flow. To address these issues, we developed TalkSketch, an embedded multimodal AI sketching system that integrates freehand drawing with real-time speech input. TalkSketch aims to support a more fluid ideation process through capturing verbal descriptions during sketching and generating context-aware AI responses. Our work highlights the potential of GenAI tools to engage the design process itself rather than focusing on output.

TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech

The capacity for complex, evidence-grounded, and strategically adaptive persuasion remains a formidable grand challenge for artificial intelligence. Prior work, like IBM Project Debater, focused on generating isolated persuasive speeches in highly simplified and shortened debate formats for lay audiences. We introduce a novel autonomous system capable of participating in and winning a full, unmodified two-team competitive policy debate. Our system employs a hierarchi cal architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one an- other to perform discrete argumentative tasks. Each workflow
utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) [Roush et al. 2024] and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface realized and synthesized to audio with OpenAI text-to-speech (gpt-4.1-tts), and then displayed as talking-head portrait
videos with EchoMimic V1 [Chen et al. 2024, 2025, OpenAI 2025e,c]. Beyond fully autonomous (AI vs. AI) matches, the system supports hybrid human–AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against the AI in any speech, enabling AI–human as well as AI–AI rounds. In preliminary evaluations against human-authored cases, our system produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by our system.

A superpersuasive autonomous policy debating system

MOVE-ME is a wearable AI system that functions as a choreographic companion, designed to interrupt and inspire dancers’ improvisational flow in real time. Equipped with an on-body camera and speech synthesis, the system observes the dancer’s environment and responds to visual input and text prompts with spoken suggestions. These responses, ranging from poetic provocations to site-specific directives, create a feedback loop in which human and machine co-compose movement, challenging distinctions between spontaneity and computation, authorship and obedience.
The project explores AI as a relational agent rather than as a tool, an intelligent presence that shapes choreography through co-creation. MOVE-ME was featured in three practice-based research projects between 2024–2025, each testing different choreographic configurations and affective dynamics.

MOVE-ME: Dance Choreography with AI

This study presents a cross-cultural implementation of an artistic Brain–Computer Interface (BCI) and generative artificial intelligence (GenAI) system designed for live performance within the Balinese gamelan tradition. The BCI-GenAI system enables real-time translation of neural synchrony between dyads of performers (musician-musician, dancer-dancer, musician-dancer) into the control of culturally-relevant generative visual projections that interact with performers and audience alike.
Nine Balinese artists (six musicians and three dancers) participated in a two-part performance Janaki Dewi: Sita’s Reverie and Wiwada Manik: Tales of the Brothers. Dyads of artists wore Mobile Brain-Body Imaging (MoBI) technology to capture in real time their brain (electroencephalography, EEG) and ocular (electrooculography, EOG) activities, head motion, and video during rehearsals and a public performance over a period of 3 weeks. All signals were synchronized by hardware. A Brain-Computer Interface (BCI) preprocessed the MoBI Signals and computed the inter-brain synchrony between dyads. Synchrony indices derived from EEG bispectra modulated diffusion parameters in a StreamDiffusion-based GenAI, whereas text prompts consisting of culturally-relevant narrative and emotional descriptors reflected the story’s mythological and affective dimensions. Thus, the BCI-GenAI system linked the real-time inter-brain synchronization to dynamic imagery projected live on stage. The system thus functioned as a creative partner in-the-loop, responsive to both the emotional and rhythmic structure of performance. The multi-institutional, cross-cultural project contributes a methodological framework for BCI-GenAI in artistic settings, emphasizing cross-cultural collaboration, the symbiosis between cultural traditions and emergent technologies, and ethical data governance. It advances a model of responsible human–AI co-creation, where technology supports rather than displaces tradition thereby preserving the continuity of cultural identity through innovation, team science and transdisciplinarity.

An Artistic BCI-GenAI System Enabling Real-Time Co-Creation in Balinese Performance

This work-in-progress introduces SightDog, a hybrid framework that equips a robotic quadruped with multimodal artificial intelligence to support blind and visually impaired users. Designed as a robotic guide dog, SightDog integrates Vision–Language Models (VLMs) with schema-driven function calling to interpret natural language instructions and visual perception for navigation and assistance. A central focus is human–robot interaction: SightDog enables natural language dialog, provides contextual feedback, and adapts its responses to user intent in real time. In doing so, the system demonstrates a form of creative communication, where improvisation and adaptive dialog play a crucial role. In a simulation study, we show that SightDog can deliver environmental information and perform reactive navigation toward user-specified goals while managing obstacles and crosswalks. As part of this creative HRI, the system can also engage in informal dialog (e.g., telling casual jokes, which we illustrate in the experiment provided). While primarily an assistive prototype, the system also raises questions for interactive and creative AI, particularly regarding how AI systems respond to improvised human input. By situating SightDog within both accessibility and interactive AI research, this work contributes to discussions on human–robot collaboration and the future of guidance technologies.

SightDog: Function Calling and Creative Dialogue for AI‑Enhanced Guide Dogs

Recent advances in image generative models have enabled
rich collaborations between humans and AI systems. Among
these, Energy-Based Models (EBMs) learn an energy land-
scape that guides noised samples toward high-probability re-
gions. Unlike diffusion models that use fixed time sched-
ules, EBMs possess equilibrium properties that enable user
feedback during generation without destabilizing the distri-
bution. However, current EBM research primarily optimizes
for high-fidelity images, offering little control over the trade-
off between semantic realism and fine-grained diversity—an
essential feature for interactive creative applications. Artists
and creatives thus lack a modality to balance semantic coher-
ence (e.g., “a red apple”) with creative variation (e.g., apples
of different shapes or colors). To address this, we introduce
a geometry-aware annealing framework for EBMs. We pro-
pose a directionally-aware annealing variable that leverages
local geometric information to directionally adjust the effec-
tive noise level during sampling. Such an annealing feedback
mechanism that allows users to generate semantically real-
istic images before progressively exploring higher-diversity,
more creative variants. Together, these techniques enable a
controllable balance between fidelity and creativity, advanc-
ing the use of EBMs for interactive creative AI.

Geometry-Aware Energy-Based Image Modelling

Algorithmic fairness has grown rapidly, yet key concepts remain unsettled in criminal justice. We review group, individual, and process fairness and map the conditions under which they conflict. We then develop a simple modification to standard group fairness. Rather than exact parity across protected groups, we minimize a weighted error loss while keeping differences in false negative rates within a small tolerance. This improves feasibility, raises accuracy, and highlights the ethical choice of error costs. We situate this proposal within three classes of critique: biased and incomplete data, latent affirmative action, and the explosion of subgroup constraints. Finally, we propose a practical framework for deployment in public systems, built on three pillars: need-based decisions, transparency, and narrowly tailored solutions. Together, these elements link technical design to legitimacy and provide actionable guidance for agencies that use risk assessment and related tools.

Alternative Fairness and Accuracy Optimization in Criminal Justice

Catastrophic forgetting remains a critical challenge in continual learning for large language models (LLMs), where models struggle to retain performance on historical tasks when fine-tuning on new sequential data without access to past datasets. In this paper, we first reveal that the drift of functional directions during the fine-tuning process is a key reason why existing regularization-based methods fail in long-term LLM continual learning. To address this, we propose Dynamic Orthogonal Continual (DOC) fine-tuning, a novel approach that tracks the drift of these functional directions and dynamically updates them during the fine-tuning process. Furthermore, by adjusting the gradients of new task parameters to be orthogonal to the tracked historical function directions, our method mitigates interference between new and old tasks. Extensive experiments on various LLM continual learning benchmarks demonstrate that this approach outperforms prior methods, effectively reducing catastrophic forgetting and providing a robust tool for continuous LLM fine-tuning.

Dynamic Orthogonal Continual Fine-tuning for Mitigating Catastrophic Forgetting of LLMs

Intrinsic self-correction refers to the phenomenon where a language model refines its own outputs purely through prompting, without external feedback or parameter updates. While this approach improves performance across diverse tasks, its internal mechanism remains poorly understood. We analyze intrinsic self-correction from the representation shift induced by prompting. We formalize and introduce the notion of a prompt-induced shift, which is the change in hidden representations caused by a self-correction prompt. Across 5 open-source LLMs, prompt-induced shifts in text detoxification and text toxification align with latent directions constructed from contrastive pairs. In detoxification, the shifts align with the non-toxic direction; in toxification, they align with the toxic direction. These results suggest that intrinsic self-correction functions as representation steering along interpretable latent directions. Our analysis highlights an understanding of model internals can be a direct route to analyzing the mechanisms of prompt-driven LLM behaviors.

Premium content

Next from AAAI 2026

Spatiotemporal Modeling of Bodily Emotional Expressions for Continuous Valence-Arousal-Dominance Prediction in Video

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES