Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
In this paper, we propose a novel benchmark, ChronoBias, for evaluating group bias in retrieval-augmented language models over time. Our benchmark is built upon a template-based semi-automated generation method, which effectively balances the quality and quantity tradeoff of existing benchmark curation methods. We show that for the type of knowledge that changes over time, group bias must be measured by jointly considering the time-varying nature of the knowledge for each group. Specifically, even a majority group of high parametric knowledge is susceptible to severe performance degradation given incorrect information, especially when the knowledge itself is volatile. Additionally, we show that high-capacity LLMs (e.g., ChatGPT-4o) show knowledge conflict-like patterns even after their knowledge cutoff due to their forecasting capability, which we call forecasting conflict. The forecasting conflict is more prominent in the majority groups with high knowledge volatility. We then propose a fairness metric conditioned on the time interval, and show that time-dependent notion of group fairness is necessary to maintain the group fairness over all time steps. We conclude that an adaptive retrieval for each group and time interval is necessary, to both enhance the group fairness and prevent unnecessary performance degradation.