
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
keywords:
multimodal; large language models; in-context learning
In this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo's foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM's remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/.