Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large language models (LLMs) frequently demonstrate reasoning limitations, often conflating content plausibility with logical validity. This can result in biased inferences, where plausible arguments are incorrectly deemed logically valid or vice versa. This paper investigates how to mitigate content biases on reasoning through activation steering, an inference-time intervention technique that modulates model activations. After localising the layers responsible for formal and material inference through probing, we investigate contrastive activation steering methods using a controlled syllogistic reasoning dataset that covers 24 types of logical argument schemes, designed to disentangle formal validity from content plausibility. An extensive empirical analysis reveals that contrastive steering consistently supports linear control over content biases. However, we observe that a static steering approach is insufficient for achieving improvements on all the tested models. We then leverage the possibility to control content effects by dynamically determining the value of the steering parameters via fine-grained conditional methods. We found that conditional steering is effective in reducing biases on unresponsive models, achieving up to 15% absolute improvement in formal reasoning accuracy with a newly introduced kNN-based conditional method. Finally, we found that steering for content effects is robust to prompt variations, incurs minimal side effects on multilingual language modeling capabilities, and can partially generalize to out-of-distribution tasks. Practically, this paper demonstrates that activation-level interventions can offer a scalable test-time strategy for enhancing the robustness of LLMs, contributing towards more systematic and unbiased reasoning
