AAAI 2026

January 23, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Understanding human actions in videos requires robust integration of multimodal cues beyond raw pixels. This work introduces a deep self-supervised action recognition framework that jointly predicts action concepts and auxiliary features from RGB video, then hallucinates missing modalities at test time to improve recognition without added runtime cost. Two new domain-specific descriptors, Object Detection Features (ODF) and Saliency Detection Features (SDF), are proposed to capture spatial context and motion saliency, integrating them with other modalities such as optical flow, skeleton, audio, and improved dense trajectories. The framework incorporates aleatoric uncertainty modeling to handle noisy or unreliable features, along with a robust loss for stable multimodal fusion. Compatible with popular architectures including I3D, AssembleNet, Video Transformer Network, VideoMAE V2, and InternVideo2, the approach achieves state-of-the-art results on Kinetics-400, Kinetics-600, and Something-Something V2.

Downloads

SlidesPaperTranscript English (automatic)

Next from AAAI 2026

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps (Student Abstract)
technical paper

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps (Student Abstract)

AAAI 2026

Xinyan Liu and 1 other author

23 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved