Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multimodal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose \textbf{UI-R1}, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. UI-R1 introduces a novel rule-based action reward scheme, facilitating model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). To further improve efficiency during inference, we present \textbf{UI-R1-E}fficient, a two-stage training paradigm that both shortens reasoning length and enhances overall performance. Additionally, we construct a compact yet high-quality dataset comprising 2K challenging tasks across five prevalent mobile device action types. Experimental results show that our proposed models (e.g., UI-R1-3B) achieve substantial improvements over the base model (i.e., Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of \textbf{18.3\%} on ScreenSpot, \textbf{6.0\%} on ScreenSpot-Pro, and \textbf{10.9\%} on \textsc{AndroidControl}. Moreover, our efficient versions deliver competitive performance compared to considerably larger state-of-the-art models. These results underscore the potential of reinforcement learning to advance GUI control, paving the way for future research in Human-Computer Interaction (HCI).