Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
In collaborative creation tasks, people iteratively steer an artifact towards a specific target by refining it over multiple rounds of interaction. In contrast, while generative AI excels at creating artifacts in one turn, it often struggles with making refinements in subsequent exchanges. To close this gap, we present mrCAD, a dataset of humans playing a multi-turn game communicating using natural language and drawing. In each game, a pair of players created and refined computer-aided designs (CADs) over multiple turns to match a target design. Only the Designer could see the target, and they must instruct the Maker using multiple modalities of text and drawing at each turn. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. Analysis finds that players relied more on text in refinement than in initial generation instructions, and used different linguistic elements for refinement than for generation. We also find that state-of-the-art VLMs are better at following generation instructions than refinement instructions. These results lay the foundation for modeling multi-turn, multimodal communication not captured in prior datasets.