Content not yet available

This lecture has no active video or poster.

AAAI 2026

January 23, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Temporal reasoning is a fundamental capability for large language models (LLMs) to understand real-world dynamics. Existing research on temporal reasoning has predominantly focused on the Gregorian calendar. However, as many countries and regions concurrently adopt multiple calendar systems, temporal reasoning across calendars becomes crucial for LLMs in global and multicultural contexts. Unfortunately, cross-calendar temporal reasoning remains underexplored, with no dedicated benchmark available to evaluate this capability. To bridge this gap, we introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features 10 cross-calendar temporal reasoning directions, two reasoning types, and two question formats, involving the Gregorian, Chinese lunar, Shaka, Hebrew, Islamic, and Persian calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven evaluation protocol for dynamic instance generation, which allows assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.

Downloads

Paper

Next from AAAI 2026

SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining
poster

SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining

AAAI 2026

+1Haoran Sun
Haoyu Bian and 3 other authors

23 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved