Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Retrieval-Augmented Generation (RAG) enhances language models (LMs) by retrieving and incorporating relevant information to address the user’s request. However, existing embedding-based semantic relevance measurements are ineffective and neural retrievers require expensive fine-tuning. To address the limitations, we propose You Only Use Reactive Attention slice (YOURA), an attention-based, training-free, fine-tuning-free technique to quantify the semantic relevance of two sentences. YOURA leverages a novel retrieval heuristics called reaction score, which measures how the LM's self-attention holistically "reacts" to the appended query and greedily retrieves the most reactive sentences. In addition, we propose a sentence extraction algorithm to facilitate the context preprocessing by splitting the context token sequence and mapping the sequences to sentences efficiently. Evaluation on three open-source pre-trained LMs across six tasks - single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic, and needle-in-a-haystack task - demonstrates that our framework improves the QA task accuracy by up to 15% and inference throughput by up to 30%.