Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Transformer models have achieved remarkable success across diverse deep learning fields, including natural language processing (NLP) and computer vision (CV). One drawback of these models is that the computational cost of the softmax attention, the core component of the transformer, exhibits quadratic complexity in both time and memory. As data scales up various attempts have been reported to overcome this bottleneck. The objective of this study is to propose a novel attention mechanism, “Cumulant Attention,” that systematically balances efficiency and accuracy. This proposal introduces a statistical-mechanics perspective and a reliable approximation based on cumulant expansion into the attention layer. The low-order variant reduces computational complexity to linear order, similar to the linear attention, while keeping nonlinearity of the softmax attention. We evaluate several variants on CV tasks, including image classification with ViT on ImageNet-100 and video classification with ViViT on UCF-101. Experimental results demonstrate that the cumulant attention outperforms the linear attention and achieves accuracy comparable to the softmax attention. These findings validate the effectiveness of our approach and highlight future directions, including scaling to larger models, extending to other modalities, and optimizing implementations for GPU hardware.
