Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Temporal reasoning is a fundamental capability for large language models (LLMs) to understand real-world dynamics. Existing research on temporal reasoning has predominantly focused on the Gregorian calendar. However, as many countries and regions concurrently adopt multiple calendar systems, temporal reasoning across calendars becomes crucial for LLMs in global and multicultural contexts. Unfortunately, cross-calendar temporal reasoning remains underexplored, with no dedicated benchmark available to evaluate this capability. To bridge this gap, we introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features 10 cross-calendar temporal reasoning directions, two reasoning types, and two question formats, involving the Gregorian, Chinese lunar, Shaka, Hebrew, Islamic, and Persian calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven evaluation protocol for dynamic instance generation, which allows assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.