Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Speech Language Models (SLMs) enable natural interactions via spoken instructions, which has the potential to more effectively capture user intent by detecting nuances in their speech. This enhanced functionality, however, introduces new security risks as it enables adversaries to bypass safety mechanisms by injecting noise into the input. In this work, we analyze the vulnerability of open-source SLMs to adversarial attacks and evaluate various defense mechanisms. In our experiments, we harnessed a standard PGD in white-box scenario. We find that these models are susceptible to jailbreaks with 100% attack success rates in some instances. We propose post hoc defense techniques that include activation patching to improve robustness up to 99% with a negligible impact on utility. Additionally, we evaluate defenses applied at both the audio encoder and the language model components, weighing their impact on adversarial resistance and usability.