ArXiv Pulse - Stay updated with the latest research papers

Elman, Catch Up on the Latest in Multimodal Image and Text Foundation Models

This newsletter dives into the cutting-edge research aimed at enhancing the trustworthiness and performance of Multimodal Large Language Models (MLLMs). We'll explore novel approaches to pretraining, alignment, and out-of-distribution detection, focusing on how these techniques address critical challenges like visual hallucinations and improve overall model reliability. Prepare for a deep dive into the latest advancements in this rapidly evolving field.

Modality-Fair Preference Optimization Tames Hallucinations

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment by Songtao Jiang, Yan Zhang, Ruizhe Chen, Yeying Jin, Zuozhu Liu https://arxiv.org/abs/2410.15334

Caption: This figure illustrates the difference between a baseline RLHF approach for MLLMs (a) and the proposed Modality-Fair Direct Preference Optimization (FMDPO) (b). FMDPO balances text and image preferences by generating automated image preference data (chosen vs. rejected images based on text hallucinations), incorporating a combined loss function (Ltotal), and employing a hierarchical alignment strategy based on semantic entropy.

MLLMs hold immense promise, but their tendency towards visual hallucinations, where generated text contradicts image content, poses a significant challenge. Existing alignment methods like Direct Preference Optimization (DPO), while effective for Large Language Models (LLMs), often prioritize text over visual information in MLLMs, leading to unreliable outputs, particularly in tasks like Visual Question Answering (VQA). This bias towards text preferences limits the model's ability to effectively leverage visual cues. Current research often focuses on refining text preference data, neglecting the crucial balance between text and image modalities during preference optimization.

This paper introduces Modality-Fair Preference Optimization (MFPO) to address this imbalance. MFPO balances text and image preferences through three key innovations. First, it generates automated, fine-grained image preference data. This is achieved by identifying hallucination-prone regions in generated text using a multipartite graph, mapping these regions to corresponding image areas with a Segment Anything Model (SAM), and applying diffusion noise to create perturbed versions of these image segments as rejected preferences. Second, MFPO introduces a novel learning objective that combines text preference loss (Ltext), image preference loss (Limage), and a margin loss (Lmargin) to ensure reward consistency and stable training. The total loss is calculated as: Ltotal = Ltext + Limage + Lmargin. Finally, it utilizes a multi-stage alignment approach, employing an easy-to-hard training paradigm based on semantic entropy. This allows the model to progressively refine both text and image preferences, leading to more robust and balanced learning.

Evaluations on LLaVA-v1.5 (7B, 13B) and other state-of-the-art models using benchmarks like Object HalBench, MMHalBench, and AMBER demonstrate MFPO's effectiveness. Notably, MFPO achieved state-of-the-art results on Object HalBench with a CHAIRi score of 5.1 using the 7B LLaVA-v1.5 model, a nearly 40% improvement over previous methods and outperforming even GPT-4V. It also achieved state-of-the-art performance on AMBER, significantly reducing the hallucination rate. Furthermore, MFPO enabled smaller 7B models to match or surpass the trustworthiness of larger 13B and 34B models, demonstrating its effectiveness in balancing optimization across modalities while maintaining scalability. These results underscore the importance of balanced preference optimization for both text and image modalities in MLLMs.

Reflexive Guidance Improves Out-of-Distribution Detection

Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation by Seulbi Lee, Jihyo Kim, Sangheum Hwang https://arxiv.org/abs/2410.14975

Caption: This image illustrates the two stages of the Reflexive Guidance (ReGuide) method for improving out-of-distribution (OOD) detection in vision-language models. Stage 1 involves generating near-OOD and far-OOD class suggestions based on the input image (an airplane). These suggestions are then used in Stage 2 to classify the image among an expanded set of classes, including the original in-distribution classes, the suggested OOD classes, and a rejection class, resulting in a confidence score for each.

While LVLMs demonstrate impressive performance, their real-world trustworthiness, especially concerning Out-of-Distribution (OOD) detection, remains a crucial concern. This paper investigates the OOD capabilities of various LVLMs, including proprietary models like GPT-40 and open-source alternatives, using a novel evaluation framework adapted from CLIP's zero-shot OOD framework. The framework employs a carefully designed prompt to elicit confidence estimates for in-distribution and rejection classes. Initial findings reveal a performance gap between proprietary and open-source models, with the former generally exhibiting superior OOD detection. Open-source models often display overconfidence, assigning extreme confidence scores (0.0 or 1.0), leading to high false positive rates.

To address these limitations, the paper introduces Reflexive Guidance (ReGuide), a two-stage prompting approach. In the first stage, the LVLM generates two sets of image-adaptive concept suggestions: one semantically similar (near-OOD) and another dissimilar (far-OOD) to the input image. These suggestions serve as auxiliary OOD classes in the second stage, where the LVLM classifies the image among an expanded set of classes (in-distribution, suggested OOD, and rejection).

Experiments on ImageNet200 demonstrate that ReGuide significantly improves OOD performance across various LVLMs. It boosts AUROC for near-OOD detection substantially, even benefiting strong performers like GPT-40. ReGuide also elevates open-source models to a level comparable with single-modal OOD detectors and improves ID classification accuracy. The image-adaptive nature of ReGuide proves crucial, outperforming text-based OOD class generation. ReGuide's success stems from guiding LVLMs to classify OOD inputs into suggested auxiliary classes, enhancing separation between ID and OOD inputs. However, challenges remain, including controlling LVLM behavior through prompting and potential suboptimality of image-adaptive suggestions. For example, ReGuide can lead to overconfidence in models like GPT-40, impacting FPR.

Croc: Revolutionizing Multimodal Pretraining

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension by Yin Xie, Kaicheng Yang, Ninghua Yang, Weimo Deng, Xiangzi Dai, Tiancheng Gu, Yumeng Wang, Xiang An, Yongle Zhao, Ziyong Feng, Jiankang Deng https://arxiv.org/abs/2410.14332

Caption: Croc's architecture employs a three-stage training process: cross-modal alignment, cross-modal comprehension, and instruction tuning. The cross-modal comprehension stage utilizes a prompt token pool, mixed attention mechanism, and detailed caption generation to enhance visual understanding. This example shows the first two stages with an image of a Winnie the Pooh birthday setup being encoded and processed by the model to generate the caption "A winnie the pooh birthday."

While LMMs have advanced significantly, their pretraining phase, where joint text and visual processing occurs, requires further refinement. Existing methods often superficially integrate visual features, limiting true visual understanding. This paper introduces Croc, a novel pretraining paradigm featuring a cross-modal comprehension stage to enhance visual understanding in LLMs.

Croc's innovation lies in its three-pronged approach within this stage. First, a dynamically learnable prompt token pool, coupled with the Hungarian algorithm, replaces a subset of visual tokens with the most relevant prompt tokens, fostering deeper visual-textual interaction. Second, a mixed attention mechanism employs bidirectional attention for visual tokens (for richer image context) and unidirectional attention for text tokens (preserving language causality). Third, a detailed caption generation task encourages learning finer-grained visual semantics.

Croc's training builds upon LLaVA-1.5's two-stage instruction tuning, incorporating cross-modal alignment, cross-modal comprehension, and instruction tuning stages. The loss function combines visual token reconstruction loss (LVTR) and detailed caption generation loss (LDCG): L = αLVTR + LDCG.

Experimental results demonstrate Croc's superior performance across vision-language benchmarks. Croc outperforms LLaVA-1.5 significantly on datasets like VQAv2, GQA, and SciQA-I, showing similar improvements in instruction-following benchmarks. Ablation studies validate the effectiveness of each component of the cross-modal comprehension stage.

AITQE: Enhancing Image-Text Quality for Pretraining

Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining by Han Huang, Yuqi Huo, Zijia Zhao, Haoyu Lu, Shu Wu, Bingning Wang, Qiang Liu, Weipeng Chen, Liang Wang https://arxiv.org/abs/2410.16166

Caption: This image contrasts two approaches to image-text data preprocessing for MLLM training. (a) shows traditional filtering, where low-scoring image-text pairs are discarded. (b) illustrates the proposed Adaptive Image-Text Quality Enhancer (AITQE), which dynamically enhances low-quality pairs through text rewriting and contrastive learning, preserving data while improving alignment.

The quality of image-text pairs is paramount for effective MLLM training. Current filtering methods discard substantial data, including potentially valuable images with poorly aligned text, impacting efficiency and scalability. This paper introduces the Adaptive Image-Text Quality Enhancer (AITQE), which dynamically assesses and enhances image-text quality, moving beyond filtering.

AITQE features a two-pronged approach. First, a text rewriting mechanism generates higher-quality descriptions for low-quality pairs, minimizing changes to the original text distribution while improving semantic alignment. Second, a contrastive sample learning strategy incorporates deliberately selected low-quality samples during training, strengthening the model's evaluative capabilities.

AITQE's training involves two stages. Initially, supervised fine-tuning data from GPT-40, based on criteria like text quality and image-text matching, is used for initial training. The second stage incorporates contrastive samples—rewritten low-scoring captions and low-quality captions for high-scoring images—further refining AITQE's ability to enhance quality.

Experimental results demonstrate AITQE's superiority over filtering methods. On a fixed LAION-400M subset, AITQE achieved significantly higher scores. Scaling experiments validated its effectiveness as a data enhancer, consistently producing higher-quality datasets and improving MLLM performance. These findings highlight AITQE's potential for improving data quality and facilitating efficient exploration of scaling laws in MLLM pretraining.

Conclusion: A Converging Focus on Trustworthy Multimodal Models

This newsletter highlights a clear trend in MLLM research: a shift towards enhancing trustworthiness and reliability. From optimizing preference learning to improving OOD detection and refining pretraining data quality, the approaches discussed in this newsletter demonstrate a concerted effort to address critical limitations of current MLLMs. The development of techniques like MFPO, ReGuide, Croc, and AITQE signifies a significant step towards building more robust, reliable, and practically deployable multimodal models. The focus on balancing modalities, leveraging self-guidance, and enhancing data quality paves the way for future research to unlock the full potential of MLLMs in real-world applications.