Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Rationalization model has recently garnered significant attention for enhancing the interpretability of natural language processing by first using a generator to select the most relevant pieces from the text with respect to the label, before passing the text input to the predictor. However, the robustness of the rationalization models is not sufficiently investigated. Specifically, this paper explores the robustness of rationalization models against backdoor attacks, which has been ignored by previous studies. Surprisingly, we find that conventional backdoor attack techniques fail to inject triggers into the rationalization model because its generator can filter out bad triggers. Considering this, we further propose a novel backdoor attack method named as BadRNL designed specially for the rationalization models. The core idea of BadRNL is first to search for the personalized trigger for each specific dataset and then manipulate the rationales and labels to conduct attacks. Besides, BadRNL controls the order of sample learning through poison-priority sampling strategies. Experimental results show that our method can successfully craft the predictions of samples containing triggers while maintaining the performance of the model on clean data.
