
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses, but relying on human expertise is costly and not scalable.
We propose \texttt{The Forecast Critic}, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong reasoning'' capabilities.
As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions.
(1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts?
(2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like?
(3) How does performance vary across model sizes and reasoning capabilities, measured across five state-of-the-art LLMs?
We present three experiments, including both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors.
The best-performing model we evaluated achieves an F1 score of $0.88$, somewhat below human-level performance (F1 score: $0.97$). We demonstrate that multi-modal LLMs can effectively incorporate unstructured contextual signals to refine their assessment of the forecast. Models correctly identify missing or spurious promotional spikes when provided with historical context about past promotions (F1 score: $0.84$). Lastly, we demonstrate that these techniques succeed in identifying significantly inaccurate forecasts on the real-world M5 time series dataset, with unreasonable forecasts having an sCRPS at least 10\% higher than that of reasonable forecasts.
These findings suggest that LLMs, even without domain-specific fine-tuning, may provide a viable and scalable option for automated forecast monitoring and evaluation.
