data:image/s3,"s3://crabby-images/28a43/28a4387c1e7d956009cb56899722e8bb41656a25" alt="Lecture image placeholder"
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.
data:image/s3,"s3://crabby-images/28a43/28a4387c1e7d956009cb56899722e8bb41656a25" alt="Lecture placeholder background"
poster
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
keywords:
inference acceleration; semi-autoregressive generation; speculative decoding
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose Smart Parallel Auto-Correct dEcoding (SPACE), an approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.