Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
This paper presents a novel generative framework for learning shared latent representations across multimodal data. Many advanced multimodal methods typically focus on modeling multimodal space in its entirety (i.e., capturing all combinations of modality-specific details across inputs), which can inadvertently obscure the high-level semantic concepts that are consistent across modalities. Notably, Multimodal VAEs with low-dimensional latent variables are designed to capture these semantic representations, enabling various tasks such as joint multimodal synthesis and flexible cross-modal inference. However, these multimodal VAEs often struggle to design expressive joint variational posteriors and suffer from low-quality synthesis. In this work, ShaLa addresses these challenges by integrating a novel architectural inference model and a second-stage expressive diffusion prior, which not only facilitates effective inference of shared latent representation but also significantly improves the quality of downstream multimodal synthesis. We validate ShaLa extensively across multiple benchmarks, demonstrating superior coherence and synthesis quality compared to state-of-the-art multimodal VAEs. Furthermore, ShaLa scales to highly challenging multi-view settings with many more modalities while prior multimodal VAEs have fallen short in capturing the increasing complexity of the shared latent space. To the best of our knowledge, ShaLa is the first framework to address multi-view multimodal generation using a shared latent variable generative model.