Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Understanding word meanings in context is a fundamental capability for Large Language Models (LLMs). Despite extensive evaluation efforts, the extent to which LLMs show evidence that they truly grasp word meanings remains underexplored. In this paper, we address this gap by evaluating the Word Sense Disambiguation (WSD) capabilities of instruction-tuned LLMs, comparing their performance to state-of-the-art systems specifically designed for the task. Notably, we find that leading models such as GPT-4o and DeepSeek-V3 reach performance on par with specialized WSD systems, while also demonstrating greater robustness across domains and levels of ambiguity. We further assess the top-performing model, i.e. GPT-4o, across three generative settings: definition generation, free explanation and example generation. Our results reveal that GPT-4o consistently achieves over 90% accuracy, with the highest performance observed when the model is allowed to freely to explain the meaning of target words in context. We release our code and data at: anonimizedurl.