Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Multimodal Large Language Models (MLLMs) are increasingly used in Personalized Image Aesthetic Assessment (PIAA), offering a scalable alternative to expert evaluation. However, their outputs may reflect subtle biases shaped by demographic cues such as gender, age, or education. In this work, we introduce AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary axes: (1) the presence of stereotype bias, measured by how aesthetic evaluations vary across demographic groups; and (2) the alignment between model outputs and real human aesthetic preferences. Our benchmark spans three subtasks, Aesthetic Perception, Assessment, and Empathy, and introduces structured metrics (IFD, NRD, AAS) to quantify both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results show that smaller models exhibit stronger stereotype bias, while larger models better align with human preferences. Adding identity information often amplifies bias, particularly in emotional judgment. These findings highlight the need for identity-aware evaluation frameworks for subjective vision-language tasks.