Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
With the development of Large Language Models (LLMs), there is growing interest in how we can apply the knowledge of LLMs to tasks beyond text generation and question answering. Speech processing, a field of interest for decades, has recently seen successful applications of LLM architectures following the release of the Transformer architecture such as in the Whisper models. While significant development has occurred in improving LLM capabilities through Supervised Fine-Tuning (SFT), reasoning, and alignment with Reinforcement Learning (RL), their application to the speech domain remain somewhat underexplored. Recent work has demonstrated that fine tuning LoRA (Low-Rank Adaptation) adapters for LLMs can perform Automatic Speech Recognition (ASR) tasks natively, leveraging existing LLM capabilities and bypassing the pre-training stage. However, no approaches have yet to successfully apply LLM knowledge in a similar fashion to other speech processing tasks like speaker diarisation. Current approaches utilise LLMs as a post-processing step on the outputs of a speaker diarisation model, but no model based on LLMs has yet to be able to natively perform speaker diarisation. Therefore, this research proposal explores how we can create LoRA adapters for LLMs to perform speaker diarisation tasks natively, and explores how we can also overcome the speech domain’s reliance on annotated data.