Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/avvb-we13

poster

ACL 2024

August 13, 2024

Bangkok, Thailand

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

keywords:

long context;benchmark;large language model

Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose \OURS, the first LLM benchmark featuring an average data length surpassing 100K tokens. \OURSSPACE comprises synthetic and realistic tasks spanning diverse domains in English and Chinese. The tasks in \OURSSPACE are designed to require an understanding of long dependencies in contexts and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. Based on \OURS, we evaluate several state-of-the-art LLMs tailored for processing long contexts. The experimental results indicate that existing long-context LLMs still require significant advancements to process 100K+ contexts effectively. Furthermore, we present three intriguing analyses regarding the behavior of LLMs processing long context. Our code and data is released\footnote{\url{https://github.com/OpenBMB/InfiniteBench}}\footnote{\url{https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench}}.

Downloads

SlidesTranscript English (automatic)

Next from ACL 2024

ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models
poster

ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models

ACL 2024

+8Zhijiang GuoYunlong FengHaochen Tan
Haochen Tan and 10 other authors

13 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved