Hi Elman,
In this newsletter, we'll delve into the exciting world of multimodal image and text foundation models. We'll explore recent research highlighting advancements in retrieval, temporal reasoning, few-shot learning, and novel architectures. From enhancing retrieval accuracy with simple normalization techniques to questioning the true temporal understanding of current models, and even leveraging the power of LLMs for autonomous driving, this newsletter offers a glimpse into the cutting edge of multimodal AI. Let's dive in!
Nearest Neighbor Normalization Improves Multimodal Retrieval by Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah Schwettmann, Tristan Thrush https://arxiv.org/abs/2410.24114
Caption: This figure illustrates the Nearest Neighbor Normalization (NNN) process. It shows how the bias score, calculated from the average similarity between a retrieval candidate (e.g., a skiing image) and its nearest neighbors (q1, q2, qk) in a reference dataset, is subtracted from the original similarity scores to produce debiased scores. This debiasing process, as shown in the tables comparing original and NNN-adjusted scores, leads to more accurate retrieval results.
Multimodal models trained with contrastive learning have revolutionized tasks like image captioning and cross-modal retrieval. However, these models are not without their flaws. They often exhibit biases and struggle with the "hubness" problem, where certain images or captions become overly popular retrieval candidates, leading to incorrect matches. Existing solutions often require computationally expensive retraining or complex normalization schemes. This paper presents a simple yet powerful training-free method called Nearest Neighbor Normalization (NNN) to refine the output of pre-trained contrastive retrieval models.
NNN leverages a reference database of queries to correct retrieval scores. For each retrieval candidate, it identifies the k most similar queries from this database. It then calculates a bias score for the candidate as a constant multiple (α) of the average similarity score between the candidate and its k nearest neighbors. This bias score is subtracted from the original retrieval score, effectively debiasing the retrieval process. The formula for the bias b(r) for a retrieval candidate r is:
b(r) = α(1/k) Σ_{q_j ∈ D_topk(r)} s(q_j, r)
where D_topk(r) represents the set of k nearest neighbors to r in the reference dataset D, and s(q_j, r) is the similarity score between query q_j and candidate r. The debiased score s_d(q, r) is then:
s_d(q, r) = s(q, r) - b(r)
Evaluations on popular models (CLIP, BLIP, ALBEF, SigLIP, BEIT) and datasets (MS-COCO, Flickr30k) demonstrate NNN’s consistent improvement in retrieval metrics for both image and text retrieval. For instance, on image retrieval using COCO, NNN boosted CLIP's Recall@1 by a significant 7.1%. Importantly, NNN achieves these gains with minimal computational overhead, unlike more complex methods like DBNorm. Furthermore, it effectively mitigates gender bias, improving fairness without sacrificing accuracy. This makes NNN a highly promising post-processing technique for enhancing both the performance and fairness of contrastive multimodal retrieval models.
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models by Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan https://arxiv.org/abs/2410.23266
Caption: This graph shows the accuracy of several Multimodal Foundation Models (MFMs) and human performance on the TOMATO benchmark, measuring accuracy against the number of frames used. It highlights the significant performance gap between MFMs and humans, even when MFMs are given access to more frames, emphasizing the limitations in current MFMs' visual temporal reasoning abilities. The results also suggest the potential benefits of incorporating time-aware positional encoding, as demonstrated by Qwen2-VL's relatively strong performance.
While Multimodal Foundation Models (MFMs) have achieved impressive results on video understanding benchmarks, this paper questions their true visual temporal reasoning abilities. The authors argue that existing benchmarks often allow MFMs to exploit shortcuts by answering questions using single, few, or out-of-order frames, thereby bypassing the need for genuine temporal understanding.
To address this, they introduce three key principles for evaluating temporal reasoning: Multi-Frame Gain (κ), Frame Order Sensitivity (τ), and Frame Information Disparity (ρ). These principles assess the importance of using multiple frames, the sensitivity to frame order, and the distribution of information across frames, respectively. The formulas for these metrics are: κ = Acc(m frames)/Acc(1 frame) - 1, τ = Acc(m frames)/Acc(shuffled m frames) - 1, and ρ = Acc(handpicked 1 frame)/Acc(random-sampled 1 frame) - 1.
Applying these principles to existing benchmarks reveals significant shortcomings in their ability to truly evaluate temporal reasoning. To provide a more rigorous assessment, the researchers introduce TOMATO (Temporal Reasoning Multimodal EvaluaTiOn). This new benchmark comprises 1,484 human-annotated questions across six tasks targeting various aspects of temporal understanding. These tasks are applied to a diverse set of 1,417 videos, minimizing the reliance on common sense and forcing models to engage in genuine temporal reasoning.
Evaluations on TOMATO reveal a substantial performance gap between humans and even the most advanced MFMs. The best open-source model achieves only 37.9% accuracy, far below human performance of 95.2%. This gap highlights the limitations of current models in interpreting frames as a continuous sequence, even with explicit instructions. The study further reveals that MFMs tend to rely on common sense or hallucinate based on single frames, often failing to connect individual frame understanding to the overall temporal context. The relative success of Qwen2-VL, which incorporates Multimodal Rotary Positional Encoding (M-ROPE), suggests that explicitly encoding temporal information might be crucial for improving temporal reasoning capabilities. TOMATO provides a valuable tool for future MFM development, emphasizing the need for models capable of true temporal understanding.
Multimodality Helps Few-Shot 3D Point Cloud Segmentation by Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Min Wu, Ming-Ming Cheng, Ender Konukoglu, Serge Belongie https://arxiv.org/abs/2410.22489
Caption: The MultiModal Few-Shot SegNet (MM-FSS) architecture leverages textual class names and simulated 2D visual features to enhance few-shot 3D point cloud segmentation. Key components include the Multimodal Correlation Fusion (MCF) and Multimodal Semantic Fusion (MSF) modules, which integrate information from multiple modalities, and the Test-time Adaptive Cross-modal Calibration (TACC) module for refining predictions. The model uses a shared 3D backbone with separate heads for intermodal and unimodal feature extraction, guided by text embeddings from a pre-trained text encoder.
Few-shot 3D point cloud segmentation (FS-PCS) faces the challenge of generalizing to novel categories with limited labeled data. This paper introduces a novel approach that leverages readily available multimodal information, specifically textual labels and simulated 2D images, to significantly improve FS-PCS performance. The proposed MultiModal Few-Shot SegNet (MM-FSS) model utilizes a shared 3D backbone with two heads: one for extracting intermodal features aligned with 2D visual features, and another for extracting unimodal (point cloud) features. A pre-trained text encoder processes class names to generate text embeddings.
The key innovation lies in the fusion modules. The Multimodal Correlation Fusion (MCF) module combines correlations from both feature types, and the Multimodal Semantic Fusion (MSF) module refines these correlations using text-aware semantic guidance. Importantly, the 2D modality is only used during pre-training, making this a cost-free multimodal approach. A novel Test-time Adaptive Cross-modal Calibration (TACC) technique further enhances performance by mitigating training bias. This is achieved through a dynamic calibration of predictions during testing, using an adaptive indicator (γ) based on the quality of semantic guidance:
Pq = γGq + Pq
where Pq is the prediction, Gq is the semantic guidance from intermodal features and text embeddings, and γ is the adaptive indicator.
Evaluations on S3DIS and ScanNet datasets show substantial improvements over the state-of-the-art. MM-FSS achieves average mIoU gains of up to +9.92% on ScanNet and +8.58% on S3DIS in various few-shot settings. These results demonstrate the effectiveness of incorporating multimodal information. Ablation studies further confirm the contributions of the MCF and MSF modules, the interaction between the feature heads, and the TACC technique. This work highlights the potential of multimodality in FS-PCS and sets a new benchmark for performance in this challenging field.
EMMA: End-to-End Multimodal Model for Autonomous Driving by Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, James Guo, Dragomir Anguelov, Mingxing Tan https://arxiv.org/abs/2410.23262
Caption: This diagram illustrates the architecture of Waymo's EMMA, a multimodal LLM for autonomous driving. It shows how EMMA processes visual and textual inputs, leverages chain-of-thought reasoning to generate driving rationales, and produces outputs like future waypoints, relying on underlying capabilities such as spatial reasoning and scene understanding. The input consists of a sequence of camera images, high-level commands, and historical ego vehicle status, while the output is a plan for the ego vehicle's future trajectory.
Waymo introduces EMMA, a groundbreaking approach to autonomous driving based on a multimodal large language model (MLLM). Unlike traditional modular systems, EMMA directly maps raw camera data to driving outputs like trajectories, object detections, and road graph elements. Its key innovation lies in representing all non-sensor inputs and outputs as natural language text, enabling EMMA to leverage the world knowledge and reasoning capabilities of pre-trained LLMs like Gemini. By using task-specific prompts, EMMA effectively frames driving tasks as visual question answering problems.
The model's architecture is remarkably straightforward. It takes camera videos, high-level commands, and ego vehicle status as input. For trajectory generation, the core formulation is: O<sub>trajectory</sub> = G(T<sub>intent</sub>, T<sub>ego</sub>, V), where G represents the Gemini MLLM. EMMA also integrates chain-of-thought reasoning, generating a driving rationale (O<sub>rationale</sub>) before predicting trajectories, thereby enhancing explainability. The combined formulation becomes: (O<sub>rationale</sub>, O<sub>trajectory</sub>) = G(T<sub>intent</sub>, T<sub>ego</sub>, V).
EMMA achieves state-of-the-art performance in motion planning on nuScenes, surpassing previous self-supervised and some supervised methods. It also shows competitive results on WOMD and in 3D object detection on WOD, significantly improving vehicle precision. Furthermore, co-training EMMA on multiple tasks leads to performance gains across all domains. Despite these promising results, EMMA has limitations, including limited frame processing capacity, lack of LiDAR/radar integration, and high computational cost. Future research directions include incorporating memory modules, integrating 3D sensing, and optimizing for real-time performance.
Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model by Keito Sasagawa, Koki Maeda, Issa Sugiura, Shuhei Kurita, Naoaki Okazaki, Daisuke Kawahara https://arxiv.org/abs/2410.22736
Caption: This diagram illustrates the architecture and training process of VILA-jp, a Japanese Visual Language Model. The architecture combines a vision transformer (ViT), a projector, and a large language model (LLM) to process visual and textual tokens. The training involves three steps: projector initialization, interleaved pre-training with Japanese image-text pairs, and joint vision-text instruction fine-tuning using a combination of existing and newly created Japanese instruction tuning datasets.
The development of high-performing Visual Language Models (VLMs) has primarily focused on English, leaving other languages with a scarcity of resources. This paper introduces VILA-jp, a Japanese VLM built from the ground up using newly constructed datasets. The authors argue against simply translating English datasets, as this fails to capture the linguistic nuances and cultural context of the Japanese language.
Their methodology involves collecting Japanese image-text pairs and interleaved data from web archives, followed by rigorous filtering for quality control. For instruction tuning, they generate Japanese instruction data directly from images using an existing VLM, ensuring a tighter alignment between visual and textual content. VILA-jp integrates a vision encoder (SigLIP), a large language model (llm-jp-3-13b-instruct), and a two-layer MLP projector, trained in three stages: projector initialization, multimodal continual pretraining, and multimodal instruction tuning.
VILA-jp achieves state-of-the-art performance on Japanese benchmarks, including Heron Bench, JA-VLM-Bench-In-the-Wild, and JA-VG-VQA-500, even surpassing GPT-40 on the latter. An ablation study demonstrates the importance of the native Japanese dataset, as using translated instruction data significantly degrades performance. This work highlights the critical role of language-specific data in VLM development and provides valuable resources for Japanese VLM research. The proposed approach can be adapted to other languages, promoting the development of high-performing VLMs across diverse linguistic landscapes.
This newsletter showcased several key advancements in the field of multimodal image and text foundation models. We've seen how simple techniques like Nearest Neighbor Normalization can significantly boost retrieval accuracy, while TOMATO challenges our assumptions about the temporal reasoning capabilities of current models. The introduction of MM-FSS demonstrates the power of incorporating multimodal information in few-shot learning for 3D point cloud segmentation. Finally, EMMA pushes the boundaries of autonomous driving by leveraging the power of multimodal LLMs. These diverse approaches highlight the rapid evolution of this field and offer exciting glimpses into the future of multimodal AI.