- Zambrano Chaves, Juan;
- Huang, Shih-Cheng;
- Xu, Yanbo;
- Xu, Hanwen;
- Usuyama, Naoto;
- Zhang, Sheng;
- Wang, Fei;
- Xie, Yujia;
- Khademi, Mahmoud;
- Yang, Ziyi;
- Awadalla, Hany;
- Gong, Julia;
- Hu, Houdong;
- Yang, Jianwei;
- Li, Chunyuan;
- Gao, Jianfeng;
- Gu, Yu;
- Wong, Cliff;
- Wei, Mu;
- Naumann, Tristan;
- Chen, Muhao;
- Lungren, Matthew;
- Chaudhari, Akshay;
- Yeung-Levy, Serena;
- Langlotz, Curtis;
- Wang, Sheng;
- Poon, Hoifung
Large foundation models show promise in biomedicine but face challenges in clinical use due to performance gaps, accessibility, cost, and lack of scalable evaluation. Here we show that open-source small multimodal models can bridge these gaps in radiology by generating free-text findings from chest X-ray images. Our data-centric approach leverages 697K curated radiology image-text pairs to train a specialized, domain-adapted chest X-ray encoder. We integrate this encoder with pre-trained language models via a lightweight adapter that aligns image and text modalities. To enable robust, clinically relevant evaluation, we develop and validate CheXprompt, a GPT-4-based metric for assessing factual accuracy aligned with radiologists evaluations. Benchmarked with CheXprompt and other standard factuality metrics, LLaVA-Rad (7B) achieves state-of-the-art performance, outperforming much larger models like GPT-4V and Med-PaLM M (84B). While not immediately ready for real-time clinical deployment, LLaVA-Rad is a scalable, privacy-preserving and cost-effective step towards clinically adaptable multimodal AI for radiology.