Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
In this work, we investigate how multimodal large language models (MLLMs) reconcile memorized world knowledge and visual input. Understanding this balance is essential for building reliable models that can correctly choose between conflicting sources of information. To study this, we introduce Visual CounterFact, a dataset that constructs realistic visual counterfactuals targeting familiar attributes like object color and size. These examples violate learned priors while preserving visual plausibility, enabling precise comparisons between perception and memory. Using this dataset, we find that MLLMs often default to perception, even when prompted to retrieve general knowledge. In these cases, performance on knowledge-based prompts drops significantly, suggesting that models are overly influenced by visual inputs, even when the question targets memorized facts. Through studying the forward-pass, we observe that model predictions initially reflect stored priors, then transition to visually grounded answers in mid-to-late layers. This transition is often unstable, with models flipping between the two sources of information across layers. To control this behavior, we introduce Pixels Versus Priors steering vectors, which allow us to edit model behavior toward preferring either world knowledge priors or visual input. These activation-level interventions produce significant attention shifts towards or away from the image, depending on our steering vector direction. Our findings offer a new framework for interpreting and controlling MLLMs, advancing our ability to understand and control the interaction between memory and perception in multimodal models.