Singapore

Speaker anonymization aims to modify the speech signal in
order to protect the identity of a speaker while preserving
the linguistic content. Despite the increasing use of
children&#39;s voices in educational applications, such as oral
reading fluency (ORF) assessment, there is little work on
anonymization aspects. In this work, we investigate the
effectiveness of available speaker anonymization methods
drawing from traditional speech-production based approaches
and a neural codec based method. We investigate the
trade-off between privacy protection, measured as the
degree of anonymity, and utility preservation, which in the
current context of ORF assessment, includes the segmental
and suprasegmental features of children’s read speech
utterances. We report objective and subjective evaluations
using two child-speaker datasets: MPS and SpeechOcean. Our
objective evaluation results indicate that the
speech-production based method of vocal tract length
normalization coupled with pitch-transposition achieves the
best balance between privacy and utility. Subjective
listening results indicate that naturalness is achievable
across methods while the neural method fails to preserve
age characteristics, which are more easily controlled by
the speech-production driven methods.

AAAI 2026

Speaker Anonymization for Children&#39;s Oral Reading Assessment

peai

Speaker anonymization aims to modify the speech signal in
order to protect the identity of a speaker while preserving
the linguistic content. Despite the increasing use of
children's voices in educational applications, such as oral
reading fluency (ORF) assessment, there is little work on
anonymization aspects. In this work, we investigate the
effectiveness of available speaker anonymization methods
drawing from traditional speech-production based approaches
and a neural codec based method. We investigate the
trade-off between privacy protection, measured as the
degree of anonymity, and utility preservation, which in the
current context of ORF assessment, includes the segmental
and suprasegmental features of children’s read speech
utterances. We report objective and subjective evaluations
using two child-speaker datasets: MPS and SpeechOcean. Our
objective evaluation results indicate that the
speech-production based method of vocal tract length
normalization coupled with pitch-transposition achieves the
best balance between privacy and utility. Subjective
listening results indicate that naturalness is achievable
across methods while the neural method fails to preserve
age characteristics, which are more easily controlled by
the speech-production driven methods.

Speaker Anonymization for Children's Oral Reading Assessment

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Scrap quality directly affects energy use, emissions, and safety in steelmaking. Today, the share of non-metallic inclusions (contamination) is judged visually by inspectors - an approach that is subjective and hazardous due to dust and moving machinery. We present an assistive computer vision pipeline that estimates contamination (per percent) from images captured during railcar unloading and also classifies scrap type. The method formulates contamination assessment as a regression task at the railcar level and leverages sequential data through multi-instance learning (MIL) and multi-task learning (MTL). Best results include MAE 0.27 and R2 0.83 by MIL; and an MTL setup reaches MAE 0.36 with F1 0.79 for scrap class. Also we present the system in near real time within the acceptance workflow: magnet/railcar detection segments temporal layers, a versioned inference service produces railcar-level estimates with confidence scores, and results are reviewed by operators with structured overrides; corrections and uncertain cases feed an active-learning loop for continual improvement. The pipeline reduces subjective variability, improves human safety, and enables integration into acceptance and melt-planning workflows.

From Images to Decisions: Assistive Computer Vision for Non‑Metallic Content Estimation in Scrap Metal

Multimodal Procedural Planning (MPP) facilitates human learning by integrating multiple modalities, such as text and video, to enhance comprehension and execution of procedural tasks. Instructional Videos (IVs) are crucial in MPP as they provide rich visual and auditory cues, making complex procedures more accessible. Large Language Models (LLMs) have promised significant potential for MPP by generating context-aware procedural plans and bridging multimodal information gaps. However, major limitations are the high cost of dataset collection and the under-utilization of the vast number of IVs available on platforms such as YouTube, where procedural content remains largely unstructured. To address this gap, we propose a novel framework, Visually Grounded MPP (VG-MPP), which leverages publicly available raw IVs, thereby eliminating the need for manually curated datasets and enabling the use of a broader range of MPP content. Our method harnesses the zero-shot reasoning capability of LLMs, the video-to-text generation ability of video captioning models, and the text-to-video generation capability of diffusion models, selectively filtering unnecessary information to deliver concise and essential multimodal guidance. Thereby, enabling AI-assisted MPP and skill learning is critical for human-centric manufacturing, where multimodal reasoning can assist humans by adapting to human variability and procedural uncertainty. Comprehensive evaluations demonstrate that VG-MPP provides a substantial advantage over baselines, outperforming existing approaches without depending on any specifically curated dataset.

VG-MPP: Visually Grounded Multimodal Procedural Planning from Unstructured Instructional Videos

Reinforcement learning (RL) has demonstrated great potential in robotic operations. However, its data-intensive nature and reliance on the Markov Decision Process (MDP) assumption limit its practical deployment in real-world scenarios involving complex dynamics and long-term temporal dependencies, such as multi-robot manipulation. Decision Transformers (DTs) have emerged as a promising offline alternative by leveraging causal transformers for sequence modeling in RL tasks. However, their applications to multi-robot manipulations still remain underexplored. To address this gap, we propose a novel framework, Symbolically-Guided Decision Transformer (SGDT), which integrates a neuro-symbolic mechanism with a causal transformer to enable deployable multi-robot collaboration. In the proposed SGDT framework, a neuro-symbolic planner generates a high-level task-oriented plan composed of human-understandable symbolic subgoals. Guided by these subgoals, a goal-conditioned decision transformer (GCDT) performs low-level sequential decision-making for multi-robot manipulation. This hierarchical architecture enables structured, interpretable, and generalizable decision making in complex multi-robot collaborative tasks. We evaluate the performance of SGDT across zero-shot and few-shot scenarios across multiple benchmark tasks analogous to human-centric manufacturing processes. To our knowledge, this is the first work to explore DT-based technology for multi-robot manipulation.

Toward Reliable Multi-Robot Collaboration via a Symbolically-Guided Decision Transformer

Manufacturing planners face complex operational challenges that require seamless collaboration between human expertise and intelligent systems to achieve optimal performance in modern production environments. Traditional approaches to analyzing simulation-based manufacturing data often create barriers between human decision-makers and critical operational insights, limiting effective partnership in manufacturing planning. Our framework establishes a collaborative intelligence system integrating Knowledge Graphs and Large Language Model-based agents to bridge this gap, empowering manufacturing professionals through natural language interfaces for complex operational analysis. The system transforms simulation data into semantically rich representations, enabling planners to interact naturally with operational insights without specialized expertise. A collaborative LLM agent works alongside human decision-makers, employing iterative reasoning that mirrors human analytical thinking while generating precise queries for knowledge extraction and providing transparent validation. This partnership approach to manufacturing bottleneck identification, validated through operational scenarios, demonstrates enhanced performance while maintaining human oversight and decision authority. For operational inquiries, the system achieves near-perfect accuracy through natural language interaction. For investigative scenarios requiring collaborative analysis, we demonstrate the framework's effectiveness in supporting human experts to uncover interconnected operational issues that enhance understanding and decision-making. This work advances collaborative manufacturing by creating intuitive methods for actionable insights, reducing cognitive load while amplifying human analytical capabilities in evolving manufacturing ecosystems.

Intelligent Human-Machine Partnership for Manufacturing: Enhancing Warehouse Planning through Simulation-Driven Knowledge Graphs and LLM Collaboration

Robust object detection in real-world scenarios with out-of-distribution (OOD) shifts is critical for deployment in manufacturing industry, where defect detection systems face limited data, high visual variability, and real-world corruptions (e.g., lighting changes, sensor noise). While image augmentation (IA) has shown promise in improving model robustness, its application often requires deep learning expertise, limiting adoption in industrial settings. To address this, we propose a human-centric pipeline that enables non-expert users to select and fuse IA strategies for fine-tuning object detectors to improve robustness against potential OOD conditions. We conduct a comprehensive benchmark of IA methods (Stylized, AugMix, and PixMix) for enhancing detection robustness on five public OOD datasets (three corruption benchmarks and two natural-shift datasets), evaluating both CNN-based (Faster R-CNN) and Vision Transformer-based (DINO) detectors. Our results show that, on different OOD scenarios, different model and IA strategy may lead to different level of robustness enhancement. Our findings suggest that user-driven selection of models and IA strategies is essential to achieve robust, real-world object detection performance. We also introduce a dataset-agnostic metric, Effectiveness of Robustness Enhancement (ERE), to facilitate cross-dataset comparison. This work provides a practical guidance for deploying robust defect detection in manufacturing applications.

A Human-Centric Augmentation Framework for Robust Object Detection

Human-centric manufacturing requires collaboration between humans and machines, yet digital twins largely ignore human cognitive states. We introduce the adaptive cognitive twin—a lightweight model of worker attention, fatigue, and stress. By fusing EEG, eye-tracking, and heart rate variability with machine learning, our system tracks cognitive states in real time, outperforming state-of-the-art methods in classification accuracy. In real-world factory deployment, it maintained 83.2% fatigue detection accuracy while enabling adaptive robot assistance and safety interventions. We present a complete architecture addressing privacy, interpretability, and scalability, establishing a foundation for human-machine performance optimization.

Adaptive Cognitive Twins: Modeling Human Attention and Fatigue in Collaborative Manufacturing

Understanding human activities in manufacturing requires systems that not only perceive visual cues but also reason about procedural intent. Conventional Temporal Action Segmentation (TAS) methods are primarily perception-centric---optimized to classify visual frames rather than to reconstruct meaningful task structures. To bridge this gap, we propose Cognitive TAS, a lightweight Perception--Reasoning Integration Framework that unites visual adaptation and language-based reasoning for intent-aligned temporal understanding. Stage 1 (Adaptive Perception) employs few-shot prompt tuning of a Video-CLIP model to align textual labels with video representations, reducing annotation dependence and improving generalization. Stage 2 (Language-based Reasoning) leverages a text-only large language model (LLM) to integrate window-level predictions with data-driven soft procedural priors, enabling reasoning about temporal and semantic coherence. Through uncertainty-aware interpretation and contextual reasoning, it reconstructs temporally consistent, intent-aligned action sequences. Experiments on the 50Salads benchmark show that Cognitive TAS improves temporal consistency and semantic coherence over perception-based baselines, while approaching the quality of supervised models in a lightweight, annotation-efficient manner. These results suggest a practical path toward cognitively grounded, intent-aware TAS for human–machine collaboration.

Perception–Reasoning Integration for Temporal Action Segmentation in Human-Centered Manufacturing

The adoption of AI-powered computer vision in industry is often constrained by the need to balance operational utility with worker privacy. Building on our previously proposed privacy-preserving framework, this paper presents its first comprehensive validation on real-world data collected directly by industrial partners in active production environments. We evaluate the framework across three representative use cases: woodworking production monitoring, human-aware AGV navigation, and multi-camera ergonomic risk assessment. The approach employs learned visual transformations that obscure sensitive or task-irrelevant information while retaining features essential for task performance. Through both quantitative evaluation of the privacy–utility trade-off and qualitative feedback from industrial partners, we assess the framework’s effectiveness, deployment feasibility, and trust implications. Results demonstrate that task-specific obfuscation enables effective monitoring with reduced privacy risks, establishing the framework’s readiness for real-world adoption and providing cross-domain recommendations for responsible, human-centric AI deployment in industry.

Privacy-Preserving Computer Vision for Industry: Three Case Studies in Human-Centric Manufacturing

Embodied AI systems in human-robot collaborations continuously capture and process multimodal sensor data (e.g., audio, visual, motion cues, etc.) to perceive and adapt to human actions in real-time. These rich data streams can often expose personally identifiable or proprietary information such as human traits, machine operations, and production processes, posing significant privacy and IP risks during computations. While privacy-enhancing technologies (PETs) such as differential privacy (DP), secure multi-party computation (SMPC), and homomorphic encryption (HE) can protect data during processing, HE incurs high computation overhead, and other PETs such as DP and SMPC are unable to provide runtime protection for active sensor streams in memory. Existing PET solutions primarily focus on securing static data and fail to secure the data-in-use during the sense-to-act loops of embodied AI. The key contributions of this work are thus threefold: (1) defining runtime privacy as a critical dimension of embodied AI security, (2) proposing a PRISM, a hybrid LibOS–TEE framework for continuous multimodal protection, and (3) outlining design insights and future directions for practical deployment. By combining runtime isolation, secure I/O, and context-aware data classification, the proposed framework offers a promising, scalable, and hardware-rooted solution for privacy-preserving embodied AI for human-robot collaborations.

PRISM: A Hybrid LibOS–TEE Framework for Continuous Runtime Privacy in Embodied AI

Explainable AI (XAI) is increasingly adopted in manufacturing to support human-in-the-loop decision-making, enabling practitioners to understand, validate, or override AI-driven recommendations. While, in practice, AI engineers and data scientists often select an XAI method (aka., explainer) based on what they are familiar with and past experience, this is not optimal because it is known -- already proved in the literature -- that the interplay among {dataset, model, explainer} can substantially impact the overall level of trustworthiness of the generated explanation. This work presents a functionally grounded evaluation framework for MTS explanations, organized around three trustworthiness pillars: faithfulness, robustness, and complexity. We adapt state-of-the-art metrics from the image and tabular domains to timechannel attributions and integrate them into an interactive decision support system (DSS) named TRUE-X (TRUstworthy EXplanations). TRUE-X enables practitioners to visualize and balance the trade-off between the trustworthiness of the explanation and the accuracy of the model, fostering a more reliable and transparent XAI in human-centric manufacturing.

Downloads

Next from AAAI 2026

From Images to Decisions: Assistive Computer Vision for Non‑Metallic Content Estimation in Scrap Metal

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES