This newsletter explores two recent breakthroughs in the world of multimodal image and text foundation models. We'll examine RadFound, a specialized model designed for radiology, and then delve into an intriguing investigation into the ability of VLMs and LLMs to understand sound symbolism, a phenomenon traditionally associated with auditory perception. Both papers highlight the evolving capabilities and exciting potential of these models.
Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation by Xiaohong Liu, Guoxing Yang, Yulin Luo, Jiaji Mao, Xiang Zhang, Ming Gao, Shanghang Zhang, Jun Shen, Guangyu Wang https://arxiv.org/abs/2409.16183
RadFound addresses the complex challenges of radiology, a field demanding precise image interpretation and comprehensive report generation. Existing vision-language (VL) models often fall short due to their reliance on natural image training data and a lack of specialized architectures for medical imagery. RadFound, however, is purpose-built for radiology, achieving expert-level performance across diverse tasks.
Based on the BLIP-2 architecture, RadFound incorporates two crucial innovations: RadVision, a novel vision encoder, and RadFormer, a vision-language alignment module. RadVision employs a contextualized contrastive masked image modeling (CC-MIM) approach. This combines masked image modeling (for capturing intra-image features) with contextual-based contrastive learning (for inter-image contextual information), mirroring how radiologists analyze images by comparing and contrasting features within and across images. RadFormer utilizes interleaved image-text data augmentation (ITA) and a multi-image-based instruction-aware module to enhance cross-modal learning, allowing the model to understand complex instructions related to multiple images.
Trained on the massive RadVLCorpus dataset (over 8.1 million images and 250,000 image-text pairs spanning 19 organ systems and 10 modalities), RadFound's capabilities were rigorously evaluated using RadVLBench. This benchmark encompasses visual question answering (VQA), text generation (captioning and report generation), and real-world radiology tasks across different modalities (2D chest X-rays, multi-view mammograms, and 3D thyroid CT scans). RadFound significantly outperformed state-of-the-art models like Med-Flamingo on RadVLBench-VQA, achieving accuracy improvements ranging from 1.2% to 15.1% on various datasets. It also demonstrated superior performance in generating captions and reports, as measured by ROUGE-L, BLEU-4, and METEOR metrics.
In real-world scenarios (RadVLBench-RW), RadFound achieved performance comparable to or exceeding Med-Flamingo across all modalities, even with considerably less training data in certain cases. A human evaluation framework further validated its effectiveness, showing that RadFound performed comparably to senior radiologists on chest X-ray and mammography tasks, and comparably to junior radiologists on thyroid CT scans. These results underscore RadFound’s potential for real-world clinical integration, although further research is needed to develop more robust automatic evaluation metrics for report generation and to assess its performance in actual clinical workflows.
With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models by Tyler Loakman, Yucheng Li, Chenghua Lin https://arxiv.org/abs/2409.14917
This research explores a novel question: can multimodal language models (VLMs and LLMs) understand sound-related phenomena, like sound symbolism, even without direct access to audio? Sound symbolism refers to the non-arbitrary connection between sounds and concepts, and this study investigates whether VLMs can perceive these connections through visual and textual information alone.
The researchers focused on three aspects of sound symbolism: Shape Symbolism (the Kiki-Bouba effect), Magnitude Symbolism (the Mil-Mal effect), and Iconicity Rating (judging the similarity between a word's sound and its meaning). Experiments involved prompting VLMs to label images generated by DALL-E 3 with appropriate pseudowords (e.g., Kiki/Bouba for spiky/rounded shapes, Mil/Mal for small/large entities) and asking LLMs to rate the iconicity of English words.
Results were mixed. In Shape Symbolism experiments, agreement with human labels was generally low, although providing explicit task information often improved performance. Magnitude Symbolism proved easier for VLMs, with some models achieving over 90% agreement. For Iconicity Rating, larger LLMs generally aligned better with human judgements, suggesting a correlation between model size and this specific capability.
The study suggests that the emergence of sound symbolism understanding in these models might be attributed to two factors: the indirect encoding of auditory information in orthography and the presence of sound-symbolic language in training data. However, the inconsistent performance highlights the subtlety of these signals and the potential influence of training data biases. Further research, including training models on sound-symbolism-rich datasets and incorporating relevant tasks into training, is needed to fully understand and enhance this capability.
This newsletter showcased two distinct yet interconnected advances in multimodal AI. RadFound demonstrates the potential for highly specialized, expert-level models tailored to complex domains like radiology. The exploration of sound symbolism in VLMs and LLMs, on the other hand, reveals a surprising capacity for these models to infer sound-related concepts from visual and textual cues, opening exciting new avenues for research into the interplay between different modalities in AI. Both studies underscore the rapid progress and expanding possibilities of multimodal foundation models, paving the way for more sophisticated and impactful applications in various fields.