Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Understanding the generalization behavior of in-context learning (ICL) in Transformers remains a fundamental challenge, as most existing theoretical analyses are based on the assumption that data are independently and identically distributed (i.i.d.), an assumption that often does not hold in practice. Motivated by the theoretical insight that ICL operates similarly to gradient-based optimization, we leverage the concept of gradient stability to establish generalization error bounds for ICL without making any distributional assumptions. Our analysis shows that two factors play a central role in ICL generalization: the number of demonstrations in the prompt and their distributional alignment with the query. In particular, increasing the number of demonstrations and improving their alignment with the query distribution lead to better generalization, even without any parameter tuning. Under mild conditions, we further prove that the generalization error can achieve the optimal convergence rate of $O(N^{-\frac{1}{2}})$, where $N$ is the number of demonstrations. Our empirical evaluations validate the effectiveness of our theoretical findings.