Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its own prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether models can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. We find a trade-off. When simply asked to generate counterfactual explanations, models typically produce SCEs that are valid, but far from minimal, despite this being a well-established property of good counterfactuals. Worryingly, when explicitly instructed to provide minimal counterfactual explanations, the resulting SCEs typically fail to change the models' predictions. No model is able to reliably satisfy both criteria. We examine why models are unable to do this task, arguing they do not engage in self-modelling, the ability to internally predict how they would behave in alternative situations. We argue this is unlikely to be incentivised by standard training techniques and suggest that new learning objectives are required for LLMs to reliably explain themselves counterfactually. Our code is available in the anonymous repository: https://anonymous.4open.science/r/SCEs-3747/README.md.