ArXiv Pulse - Stay updated with the latest research papers

Elman, Dive into the Latest Advancements in Multimodal Image and Text Foundation Models

This newsletter explores the cutting edge of multimodal AI, focusing on the exciting developments in image and text foundation models. We'll delve into five recent papers that showcase innovative approaches to leveraging these powerful models for various applications, from biomedical image analysis to poverty prediction and tobacco control. Prepare to be amazed by the rapid progress and vast potential of this field.

Scaling Vision-Language Models for Biomedical Precision

Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis by Robinson Umeike, Neil Getty, Fangfang Xia, Rick Stevens https://arxiv.org/abs/2501.15370

Large Language Models (LLMs) have undoubtedly transformed text processing, but their application to scientific data, which often includes images, necessitates a multimodal approach. Vision-Language Models (VLMs) offer this capability by integrating visual data, but they often falter when dealing with specialized domains like biomedicine and are susceptible to hallucinations (fabricating information not present in the source). This paper presents a solution: intelligent assistants fine-tuned from LLaVA models, specifically designed to enhance multimodal understanding in Low-Dose Radiation Therapy (LDRT).

The researchers compiled a dataset from a vast corpus of 42,673 LDRT-related scientific articles. They extracted image-caption pairs and formulated complex reasoning questions and detailed descriptions using Qwen2-72B-Instruct. These image-text pairs were then used to fine-tune LLaVA v1.6-vicuna-13B and LLaVA v1.5-13B on 50,882 instances. To manage the computational demands of training such large models, the team employed advanced techniques like gradient checkpointing, FlashAttention-2, DeepSpeed ZeRO3, and Low-Rank Adaptation (LoRA). The optimization process aimed to minimize a loss function with a regularization term: θ* = argmin_{θρ, θ₁} [L₂(θρ, θ₁) + λR(θρ, θ₁)], where L₂ represents the primary loss, R the regularization term, and λ the regularization coefficient.

Evaluation involved using external LLMs, Qwen2-72B-Instruct and Llama-3.1-70B-Instruct, as impartial judges to score the responses based on relevance, helpfulness, and accuracy. The fine-tuned models consistently surpassed the base LLaVA models in both detailed description and complex reasoning tasks. For example, the fine-tuned LLaVA v1.5 achieved a mean score of 5.26 ± 3.06, significantly higher than the base model's 3.46 ± 2.53. Furthermore, hallucination analysis, utilizing ROUGE metrics and linguistic uncertainty markers, indicated that the fine-tuned models demonstrated higher confidence and improved factual consistency.

This research underscores the potential of fine-tuning general-purpose VLMs for specialized biomedical applications. The significant improvements in LDRT-based visual question answering, particularly in reducing hallucinations and enhancing domain comprehension, are promising. However, the researchers noted a trade-off in LLaVA v1.6-vicuna-13B between verbosity and reasoning depth, suggesting a direction for future research. Expanding the dataset and adapting the models to other biomedical applications are also key areas for future exploration.

ReferDINO: Mastering Referring Video Object Segmentation

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations by Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, Jian-Fang Hu https://arxiv.org/abs/2501.14607

Caption: This image contrasts two RVOS architectures. (a) A vanilla baseline uses G-DINO for image feature extraction, a box head for prediction, and SAM2 for mask generation. (b) ReferDINO, the proposed model, incorporates an Efficient G-DINO, a temporal module for inter-frame understanding, a box head, and a mask decoder for enhanced video object segmentation guided by text prompts like "Cat walking forward."

Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video based on textual descriptions. Although the field has progressed, existing models struggle with complex descriptions and dynamic scenes due to limitations in video-language understanding. ReferDINO tackles this by harnessing the power of pretrained visual grounding foundation models like GroundingDINO, moving beyond separate detection and segmentation models.

ReferDINO's innovation lies in three key components. First, an object-consistent temporal enhancer utilizes cross-modal text representations to facilitate inter-frame object interaction. This enhancer comprises a memory-augmented tracker and a cross-modal temporal decoder, enabling effective object tracking across frames even with challenging conditions. The memory component is updated using a momentum-based approach, incorporating text relevance: Mᵗ = (1 − α • c⁺) • Mᵗ⁻¹ + α • c⁺ • Ôᵗ, where M is the memory, α the momentum coefficient, and c the cosine similarity between object and sentence embeddings.

Second, a grounding-guided deformable mask decoder integrates text and grounding conditions. This decoder leverages pretrained representations and grounding predictions from GroundingDINO, iteratively refining mask embeddings for accurate and consistent segmentations. Third, a confidence-aware query pruning strategy addresses the computational bottleneck of numerous query embeddings common in visual grounding models. This strategy discards low-confidence queries at each decoder layer, enhancing efficiency without performance loss.

Rigorous evaluation on five RVOS benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, JHMDB-Sentences, and MeViS) showed ReferDINO consistently outperforming state-of-the-art methods. On Ref-DAVIS17, it achieved a 4.0% improvement in J&F score with a Swin-B backbone, and on Ref-YouTube-VOS, a J&F score of 69.3%, a 2.2% improvement. It also significantly outperformed a baseline combining GroundingDINO and SAM2. Ablation studies validated the contribution of each component. ReferDINO showcases the potential of adapting visual grounding models to the video domain, setting a new performance benchmark.

Brain-Adapter: Revolutionizing Neurological Disorder Analysis

Brain-Adapter: Enhancing Neurological Disorder Analysis with Adapter-Tuning Multimodal Large Language Models by Jing Zhang, Xiaowei Yu, Yanjun Lyu, Lu Zhang, Tong Chen, Chao Cao, Yan Zhuang, Minheng Chen, Tianming Liu, Dajiang Zhu https://arxiv.org/abs/2501.16282

Analyzing neurological disorders often involves integrating diverse data sources like medical images and clinical reports. Multimodal Large Language Models (MLLMs) offer a promising approach, but challenges persist in processing 3D medical images and efficiently utilizing limited medical datasets. Brain-Adapter tackles these challenges using adapter-tuning to enhance MLLM performance in this domain.

Brain-Adapter introduces a lightweight bottleneck layer to bridge the gap between pretrained image encoders and 3D MRI scans, reducing training data requirements. This adapter, along with a Contrastive Language-Image Pre-training (CLIP) strategy, aligns image and text data in a shared representation space, effectively combining information from clinical reports (demographics, biomarkers, cognitive assessments, and notes) with 3D MRI scans. The model uses M3D, a medical-domain pre-trained MLLM, as its backbone. For efficiency, only the linear projection layers of the image and text encoders are updated during fine-tuning.

Evaluated on an ADNI dataset for classifying Alzheimer's Disease (AD), cognitively normal (CN), and mild cognitive impairment (MCI), Brain-Adapter outperformed 3D ResNet50 and 3D DenseNet121 baselines. It achieved a macro-averaged F1-score of 0.91 for image-text pairs with trained linear projection layers, compared to 0.77 for 3D DenseNet121 using only images. The loss function used was a combination of contrastive loss and cross-entropy loss: Lcls = λ₁Lcontrastive + λ₂LCE, where Lcontrastive = (1/2*N) * ΣNi=1(Luv + Lvu) (Luv and Lvu being image-to-text and text-to-image losses, respectively). Ablation studies confirmed the contributions of Brain-Adapter and the linear projection layers. Visualization of feature embeddings showed improved separation between diagnostic groups over training epochs. Brain-Adapter offers a promising approach for enhancing neurological disorder analysis by efficiently integrating multimodal data while minimizing computational costs.

LLMs for Socioeconomic Insights: Predicting Poverty with Satellite Imagery

Leveraging ChatGPT's Multimodal Vision Capabilities to Rank Satellite Images by Poverty Level: Advancing Tools for Social Science Research by Hamid Sarmadi, Ola Hall, Thorsteinn Rögnvaldsson, Mattias Ohlsson https://arxiv.org/abs/2501.14546

This research explores using Large Language Models (LLMs), specifically ChatGPT-40, to analyze satellite imagery for predicting poverty at the village level. While LLMs are primarily designed for natural language processing, their potential for multimodal tasks like geospatial analysis remains largely untapped. This study investigated whether vision-enabled LLMs can offer interpretable, scalable, and reliable insights into poverty from satellite images, potentially providing a cost-effective alternative to traditional surveys.

The researchers used a pairwise comparison method, presenting ChatGPT with pairs of high-resolution Google Earth images of locations in Tanzania. The model was prompted to identify the wealthier location based on observable features like infrastructure, buildings, greenery, and amenities. Ground truth data came from the 2015/2016 Tanzania Demographic and Health Survey (DHS) wealth index. The pairwise comparisons were aggregated into a wealth ranking using the Iterative Luce Spectral Ranking (I-LSR) algorithm.

ChatGPT's ranking performance (Spearman's rank correlation ρ = 0.56) was comparable to a Random Forest model trained on expert-extracted features (ρ = 0.59). Although both were outperformed by a CNN model (ρ = 0.78), the LLM achieved human-level performance without domain-specific training, which is noteworthy. Analyzing quintile groupings using Matthew's Correlation Coefficient (MCC) revealed that ChatGPT excelled in distinguishing the poorest quintile, even outperforming the Random Forest model.

Caption: This scatter plot visualizes the correlation between ChatGPT's poverty ranking of Tanzanian villages based on satellite imagery and the ground truth ranking derived from the 2015/2016 Tanzania Demographic and Health Survey (DHS) wealth index. Each point represents a village, with its position determined by its rank according to ChatGPT (y-axis) and the DHS survey (x-axis).

Visual analysis showed that ChatGPT's assessments generally aligned with subjective interpretations of wealth based on visual cues. However, discrepancies with the DHS index highlighted potential limitations in the ground truth data, warranting further investigation into both the LLM's assessments and the DHS data. Despite these limitations, the study demonstrates the potential of LLMs for cost-effective, large-scale poverty monitoring using satellite imagery, opening avenues for future research into more sophisticated prompting and contextual integration.

DEFEND: Combating Tobacco with AI-Powered Image Analysis

DEFEND: A Large-scale 1M Dataset and Foundation Model for Tobacco Addiction Prevention by Naga VS Raviteja Chappa, Matthew Shepard, Connor McCurtain, Charlotte McCormick, Page Daniel Dobbs, Khoa Luu https://arxiv.org/abs/2501.13950

The tobacco industry's rapid product innovation, especially with e-cigarettes, has outpaced traditional public health monitoring. Existing datasets for tobacco product detection are limited, hindering robust monitoring systems. This paper introduces Tobacco-1M, a dataset of one million tobacco product images, and DEFEND (Distillation-enabled Enhanced Feature learning for tobacco ENforcement and Discernment), a foundation model designed for this challenge.

Tobacco-1M boasts hierarchical labels across 75 product categories, from broad classifications like "combustible" to specific types like "cigarettes." Each image has detailed annotations describing features, context, and health impacts, providing a rich resource. This dataset is significantly larger (140x) than existing public datasets. DEFEND utilizes this data with a novel teacher-student architecture. A Feature Enhancement Module captures nuanced visual and textual correlations, while a Local-Global Visual Coherence loss (||f_G - AvgPool(f_P)||) ensures consistency between fine-grained and holistic representations. An Enhanced Image-Text Alignment mechanism refines feature-description mapping using a contrastive loss.

Caption: This image contrasts two approaches to tobacco product image analysis. The left side (red) depicts a basic feature fusion method with single-scale attention, resulting in simple product classification and missing fine details. The right side (green) showcases the DEFEND model, which uses a Feature Enhancement Module (FEM) and multi-scale attention to achieve multi-level semantic understanding, enabling precise product identification, regulatory compliance assessment, and health impact depiction.

Evaluated on product classification, marketing strategy detection, and health impact assessment, DEFEND achieved 78.3% Top-1 and 83.1% Top-5 accuracy on the PHAD dataset, outperforming ImageNet1K-pretrained models and self-supervised learning approaches. Ablation studies validated each component's contribution. DEFEND also demonstrated strong zero-shot learning (45.6% accuracy on novel PHAD categories), surpassing CLIP, CoCa, and MDETR. In visual question answering on Tobacco-1M, it achieved 73.8% accuracy and a 0.75 F1-score, outperforming existing models. DEFEND, powered by Tobacco-1M, represents a significant advancement in tobacco control research, enabling more effective monitoring and targeted interventions. However, limitations exist regarding significantly novel products, diverse cultural contexts, non-English packaging, and applicability to other health domains without retraining.

Conclusion: A Multimodal Future

This newsletter showcased the remarkable progress in multimodal image and text foundation models. From fine-tuning VLMs for specialized biomedical applications like LDRT to leveraging visual grounding models for precise video object segmentation with ReferDINO, the advancements are significant. Brain-Adapter demonstrates the power of adapter-tuning for efficiently analyzing complex 3D medical data, while the innovative application of ChatGPT for poverty prediction using satellite imagery opens up new possibilities for socioeconomic research. Finally, DEFEND, with its massive Tobacco-1M dataset, exemplifies the potential of AI-powered image analysis for tackling critical public health challenges. These advancements underscore the transformative potential of multimodal AI across diverse domains, paving the way for more sophisticated and impactful applications in the future.