Multimodal foundation models, particularly those processing both image and text, are rapidly evolving. This newsletter summarizes four recent papers exploring critical aspects of this field, from tackling hallucinations and improving generalization to efficient fine-tuning and cross-lingual stance detection. These works highlight both the exciting potential and the persistent challenges in building truly robust and adaptable multimodal models.
LoRA-X: Bridging Foundation Models with Training-Free Cross-Model Adaptation by Farzad Farhadzadeh, Debasmit Das, Shubhankar Borse, Fatih Porikli https://arxiv.org/abs/2501.16559
Parameter-efficient fine-tuning (PEFT) methods, like Low-Rank Adaptation (LoRA), are crucial for adapting large foundation models (LFMs) to specific tasks. LoRA achieves efficiency by representing weight updates with low-rank matrices A and B, reducing trainable parameters. However, these LoRA adapters are tied to their base model, requiring retraining when the base model is updated. This necessitates access to the original training data, which is often unavailable, or generating synthetic data, which may not be representative.
LoRA-X offers a solution by enabling training-free transfer of LoRA parameters between different model versions. It constrains the adapter within the column-row subspace of the base model's weights, using the singular value decomposition (SVD) of the original weight matrix (W₀): ∆W = Ŭ∆ΣV<sup>T</sup>, where Ŭ and V are truncated left and right singular matrices of W₀, and ∆Σ learns changes in singular values. This ensures adapter compatibility across model versions. LoRA-X is applied only to layers with sufficient subspace similarity, measured by a metric (Φ) based on unweighted similarity: Φ<sub>l</sub>(A, B) = Ψ(U<sub>A</sub>,U<sub>B</sub>) = (Σ<sub>i</sub>Σ<sub>j</sub> (u<sub>A,i</sub><sup>T</sup>u<sub>B,j</sub>)<sup>2</sup>)/n. An Adapter Transferability Cost (ATC) metric, derived from optimal transport, quantifies the transfer difficulty.
Evaluations on text-to-image generation with Stable Diffusion v1.5 and XL, and other target models, demonstrate LoRA-X's effectiveness. Performance of transferred LoRA-X adapters was comparable to those trained from scratch, with similar HPSv2 and LPIPS scores. High DINOv2 scores further confirmed strong correlation between generated samples. LoRA-X offers a promising approach for adaptable LFM fine-tuning, simplifying the process and promoting longevity of fine-tuned models.
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs by Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng https://arxiv.org/abs/2501.16629
Caption: This figure illustrates different preference optimization strategies for multimodal large language models (MLLMs). (a) shows standard DPO, (b) shows multimodal DPO, and (c) depicts CHiP, which incorporates both hierarchical textual preference optimization (left side of dashed box) and visual preference optimization (right side of dashed box) to mitigate hallucinations. The green and pink shapes represent winning and losing outputs, respectively, while the blue arrows indicate preference comparisons.
Multimodal Large Language Models (MLLMs) often suffer from hallucinations, generating outputs contradicting visual inputs. While Direct Preference Optimization (DPO) has shown promise in text-based LLMs, its direct application to multimodal scenarios has been insufficient due to its difficulty in aligning image and text representations and differentiating hallucinated from factual descriptions.
CHiP (Cross-modal Hierarchical DPO) addresses these limitations by incorporating Hierarchical Textual Preference Optimization and Visual Preference Optimization. The former captures preferences at response, segment, and token levels, providing nuanced feedback. The latter introduces visual preference pairs, allowing the MLLM to learn directly from visual preferences, strengthening image-text alignment. The objective function combines these modules: L<sub>CHiP</sub> = L<sub>DPOv</sub> + L<sub>DPOr</sub> + λL<sub>DPOs</sub> + γL<sub>POk</sub>, where L<sub>DPOv</sub> and L<sub>DPOr</sub> are visual and response-level preferences, L<sub>DPOs</sub> and L<sub>POk</sub> are segment and token-level preferences, and λ and γ are weights.
Evaluations on four hallucination benchmarks (Object HalBench, MMHal-Bench, HallucinationBench, and AMBER) demonstrate CHiP's effectiveness. On Object HalBench, CHiP achieved improvements of 52.7% and 55.5% relative points over DPO for Muffin and LLaVA, respectively. It even outperformed GPT-4V on some benchmarks. Ablation studies confirmed the importance of both hierarchical textual and visual preference optimization. CHiP effectively bridges the semantic gap between image and text representations and improves distinction between hallucinated and factual descriptions.
Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models by Muhammad Atta ur Rahman https://arxiv.org/abs/2501.16769
Caption: This diagram illustrates the Beyond-Labels architecture for open-vocabulary semantic segmentation. It shows how image and text encoders, augmented with Fourier embeddings, feed into a transformer-based fusion module. This module generates upsampled features for the image decoder, which produces the final segmentation mask by comparing them with the text features.
"Beyond-Labels" leverages self-supervised learning for open-vocabulary semantic segmentation, segmenting objects based on textual descriptions. It employs a lightweight transformer-based fusion module, utilizing a small amount of image segmentation data to fuse frozen image representations with language concepts.
Images are encoded using a frozen, self-supervised image model, augmented with Fourier embeddings (zᵢ = xᵢ + f_emb(x, y)) to capture positional information across resolutions. Category names are encoded using a frozen language model. These embeddings are fed into a transformer-based fusion module, using self-attention to refine features through cross-modal interactions. A decoder upsamples the fused image features and compares them with text features using cosine similarity to generate the segmentation mask.
Evaluated on PASCAL-5i, Beyond-Labels achieves a mean Intersection over Union (mIoU) of 41.5%, outperforming existing methods like SPNet (18.3%), ZS3Net (38.3%), and OSLSM (40.8%). Ablation studies demonstrate the importance of Fourier embeddings and the fusion module. Removing Fourier embeddings reduces performance to 25.7% mIoU, while removing the fusion module further reduces it to 36.9%. Beyond-Labels offers a simple yet powerful approach to open-vocabulary semantic segmentation, benefiting from computational efficiency and enhanced generalization across image sizes and resolutions.
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training by Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma https://arxiv.org/abs/2501.17161
Caption: This figure compares the in-distribution and out-of-distribution performance of Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) on two tasks, GeneralPoints (a card game) and V-IRL (a navigation task), as a function of training computation. The results show that RL generalizes significantly better than SFT, especially in out-of-distribution settings, across both language-only and vision-language tasks. The dotted line represents the initial performance before any post-training.
This paper compares Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for post-training foundation models, focusing on generalization and memorization in text and visual domains. The study uses GeneralPoints (an arithmetic card game) and V-IRL (a real-world navigation environment) to assess generalization to unseen variants.
The Llama-3.2-Vision-11B model was trained with SFT or RL, focusing on textual rule-based and visual generalization. GeneralPoints involves creating an equation equaling 24 from four cards, with rule variations in face card interpretations. V-IRL involves navigating to a target location, with rule variations in orientation actions. Visual variations included changing card colors and using different city environments. RL employed a sequential revision formulation with outcome-based rewards.
RL, especially with outcome-based rewards, demonstrated superior generalization across both domains. In GeneralPoints, RL showed improvements of +3.5% and +3.0% on out-of-distribution performance for language and vision-language variants, respectively. SFT struggled to generalize, memorizing training data. In V-IRL, RL improved performance by +61.1% with visual variations, while SFT decreased by -5.6%. RL also enhanced underlying visual recognition capabilities, achieving state-of-the-art on the V-IRL mini benchmark (+33.8%). While SFT is crucial for stabilizing RL training, RL excels in generalizing to complex, multimodal tasks.
Exploring Vision Language Models for Multimodal and Multilingual Stance Detection by Jake Vasilakes, Carolina Scarton, Zhixue Zhao https://arxiv.org/abs/2501.17654
Caption: This bar chart compares the Macro F1 scores of four Vision Language Models (VLMs) across text-only, image-only, and combined text & image inputs for a multilingual stance detection task. The results highlight the VLMs' current reliance on text, with text-only performance often exceeding or matching multimodal performance, indicating the limited contribution of visual information in the current models. The stars indicate statistically significant differences between multimodal and unimodal performance.
The increasing prevalence of multimodal content on social media necessitates robust stance detection systems capable of handling both text and images across languages. This research evaluates state-of-the-art Vision Language Models (VLMs) on a newly extended dataset covering seven languages and multimodal inputs. Four open-source VLMs (InternVL2, Qwen2-VL, Ovis 1.6, and Llama-Vision) were evaluated in a zero-shot setting, focusing on their use of visual cues, language-specific performance, and cross-modality interactions.
The methodology involved three sets of experiments: one exploring multimodality by comparing VLM performance with various input combinations (text-only, image-only, and combined) and through image content ablation (text blackout, content blackout); one focusing on multilinguality, comparing performance across languages and assessing agreement between predictions using Cohen's kappa (κ); and one investigating the intersection of multimodality and multilinguality.
The results revealed a strong reliance on text across all VLMs, with text-only performance often matching or exceeding multimodal performance. This suggests a limited contribution from the visual modality. However, when images were used, VLMs relied heavily on text within the images. This was evidenced by performance drops after text blackouts and logistic regression analysis. In terms of multilinguality, Ovis 1.6 showed the highest consistency across languages, with similar F1 scores and high agreement (κ ≥ 0.7) between language pairs. Llama-Vision, despite officially supporting multiple languages, exhibited the lowest consistency.
This newsletter highlights the ongoing efforts to refine and enhance multimodal foundation models. While models like LoRA-X address practical challenges in adapting large models, others like CHiP tackle the persistent issue of hallucinations. Beyond-Labels demonstrates progress in open-vocabulary segmentation, while the exploration of multilingual stance detection reveals the complex interplay between modalities and languages, and the current limitations of VLMs in effectively utilizing visual information. These diverse approaches underscore the dynamic nature of this field and the exciting possibilities that lie ahead as we strive to build more robust, adaptable, and trustworthy multimodal models.