Hi Elman,
This newsletter dives into the latest breakthroughs and challenges in the rapidly evolving field of multimodal image and text foundation models. We'll explore novel architectures, training paradigms, applications, and ethical considerations that are shaping the future of AI. From augmenting data with LLMs to unifying all modalities as pixels, and even making images dance to music, this week's selection offers a glimpse into the exciting possibilities and critical challenges that lie ahead.
Image, Text, and Speech Data Augmentation using Multimodal LLMs for Deep Learning: A Survey by Ranjan Sapkota, Shaina Raza, Maged Shoman, Achyut Paudel, Manoj Karkee https://arxiv.org/abs/2501.18648
Caption: This infographic visually represents the survey methodology for analyzing LLM-driven data augmentation across image, text, and speech modalities. The flow chart (a) illustrates the literature review process, while pie charts (b) and (c) depict the distribution of peer-reviewed papers and preprints, respectively, across the three data modalities.
Data augmentation, a crucial technique for improving the generalization and robustness of deep learning models, has been revolutionized by the advent of Large Language Models (LLMs), particularly those with multimodal capabilities. This survey provides a comprehensive overview of this emerging field, exploring how LLMs are transforming data augmentation across image, text, and speech modalities. Traditional augmentation techniques, relying on manual transformations or basic algorithms, often struggle to capture the complexity and diversity of real-world data. LLMs offer a powerful new approach, leveraging their contextual understanding and generative capabilities to create synthetic data that significantly enhances model training. The survey meticulously examines the technical processes involved in LLM-based augmentation, including encoding, prompt generation, transformation execution, and quality assessment across all three modalities.
The survey's methodology involved a rigorous literature review, analyzing 24 studies on image data augmentation, 45 on text data augmentation, and 35 on speech data augmentation, published since 2020. This comprehensive analysis provides a detailed snapshot of the current research landscape, highlighting the diverse applications and emerging trends. The results reveal a wide array of LLM-based techniques. For image augmentation, these include image-to-text synthesis, semantic content transfer, and image captioning. Text augmentation techniques range from paraphrasing and back-translation to noise injection and controlled generation. In speech augmentation, methods like background noise addition, amplitude scaling, pitch shifting, and synthetic speech generation are explored. For example, in image augmentation, LLMs can generate synthetic images from textual descriptions, enriching datasets with diverse visual representations. In text augmentation, LLMs can paraphrase sentences while preserving meaning, enhancing model robustness to varied linguistic formulations. In speech, LLMs can generate synthetic speech samples with varying intonations and background noise, improving the resilience of speech recognition systems.
However, the survey also critically examines the limitations of current LLM-based approaches. These include challenges related to generating ambiguous or unrealistic outputs, contextual and semantic misalignment, user dependency, over-specialization, and computational costs. For instance, in image augmentation, LLMs might generate unrealistic images due to limitations in understanding complex visual semantics. In text, issues like loss of context and semantic drift can arise. In speech, challenges include temporal distortion, timbre loss, and signal degradation. The survey proposes potential solutions to these limitations, such as refining LLM architectures, incorporating feedback loops, and enhancing training datasets with more diverse and high-quality examples. It also emphasizes the importance of ethical considerations, highlighting the need to address potential biases and promote fairness in generated data. Finally, the survey outlines future research directions, including the development of more refined pipelines for multimodal data generation and the exploration of reinforcement learning-based methodologies for self-augmentation.
Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models by Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, Olga Fink https://arxiv.org/abs/2501.18592
Caption: This figure illustrates the difference between traditional unimodal and multimodal domain adaptation/generalization (DA/DG). Traditional methods use a single modality (sketch, cartoon, art, photo) as input, while multimodal methods combine information from different modalities (e.g., real images, cartoon images, and audio) to improve generalization to unseen target domains, as exemplified by the multimodal adaptation framework leveraging text embeddings and feature augmentation.
Domain adaptation (DA) and domain generalization (DG) are essential for deploying machine learning models in real-world scenarios where data distributions can shift between training and testing environments. This survey provides a comprehensive overview of the field, spanning from traditional methods to the transformative impact of Multimodal Foundation Models (MFMs). It focuses on Multimodal Domain Adaptation (MMDA) and Multimodal Domain Generalization (MMDG), addressing the challenges of adapting models to unseen multimodal distributions. Traditional MMDA leverages labeled source data and unlabeled target data to adapt a model to a new target domain, while MMDG trains models solely on source domains to generalize to unseen target domains without any target data during training. A key challenge in both is effectively leveraging complementary information from diverse modalities.
Traditional MMDA methods often employ techniques like domain-adversarial learning, aligning multimodal features across domains. Contrastive learning is also used, pulling positive sample pairs closer while pushing negative pairs apart in feature space. Cross-modal interaction methods focus on information exchange between modalities to capture complementary relationships. For semantic segmentation, methods like xMUDA and its extensions promote cross-modal prediction consistency through mutual mimicking and minimizing cross-modal divergence, often using Kullback-Leibler (KL) divergence: $L_{xM} = DKL(P(n,c) || Q(n,c)) = \frac{1}{N} \sum_{n=1}^{N} \sum_{c=1}^{C} P(n,c) \log \frac{P(n,c)}{Q(n,c)}$, where P and Q represent target and mimicking predictions. Multimodal Test-Time Adaptation (MMTTA), a specialized form of MMDA, adapts a pre-trained model online to a target domain without accessing source data during adaptation.
The advent of MFMs like CLIP has revolutionized DA and DG. These models are leveraged for data augmentation, generating synthetic data to enhance diversity and address domain shifts. Knowledge distillation techniques transfer knowledge from MFMs to smaller models. Prompt-based methods are also gaining traction, adapting MFMs by learning domain-specific prompts. For example, LADS learns transformations of image embeddings to unseen test domains using text descriptions, while DGInStyle leverages diffusion models to generate diverse training data. Adapting MFMs to downstream tasks involves techniques like prompt-based adaptation, adapter-based methods, fine-tuning, and training-free adaptation.
Despite significant progress, open challenges remain, including the need for comprehensive benchmarks and larger-scale multimodal datasets. Addressing open-set scenarios with unknown classes and exploring the theoretical foundations of multimodal adaptation and generalization are also important future directions.
PixelWorld: Towards Perceiving Everything as Pixels by Zhiheng Lyu, Xueguang Ma, Wenhu Chen https://arxiv.org/abs/2501.19339
Caption: This image visualizes the PEAP framework, which processes all modalities as pixels, alongside a plain framework using tokens for text. It also presents key insights regarding performance trends across modalities and task complexity, transferability across model scales, and the similarity of attention patterns between PEAP and token-based approaches, highlighting the potential and challenges of perceiving everything as pixels.
This paper introduces a paradigm shift in multimodal AI, proposing to "Perceive Everything as Pixels" (PEAP). Challenging the conventional approach of processing images as pixels and text as tokens, PEAP unifies all modalities (text, tables, code, diagrams, images) as pixel inputs. To evaluate this approach, the researchers created PIXELWORLD, a novel evaluation suite that converts various modalities into pixel space, allowing for a direct comparison of performance between pixel-based and token-based input for LLMs.
The methodology involved rendering text and structured data into images, varying font sizes and adding noise for robustness. For multimodal datasets, existing OCR pipelines or provided text components were utilized. Several Vision Language Models (VLMs) were evaluated on PIXELWORLD in a zero-shot setting.
The results revealed that PEAP improved performance on intrinsically multimodal tasks like website rendering and slide comprehension. However, it led to performance degradation on complex text-centric tasks like reasoning and code generation. Larger models showed better transferability between pixel-based and token-based performance compared to smaller models. For example, GPT-40's performance declined minimally with PEAP on certain tasks, while smaller models suffered significant drops. Interestingly, the attention patterns of PEAP were highly aligned with token-based input, suggesting the potential of vision encoders as universal multimodal tokenizers. However, PEAP also introduced computational overhead, which was addressed by PEAP-Fast, a sparsification algorithm that significantly reduced overhead with minimal accuracy loss. Chain-of-Thought prompting was also found to be more effective with PEAP.
Every Image Listens, Every Image Dances: Music-Driven Image Animation by Zhikang Dong, Weituo Hao, Ju-Chiang Wang, Peng Zhang, Pawel Polak https://arxiv.org/abs/2501.18801
Caption: The MuseDance model architecture depicts a two-stage process for animating still images with music and text. Stage 1 uses a VAE encoder, text encoder, and DensePose features to generate a single frame, while Stage 2 incorporates music and beat information into a denoising U-Net to create a synchronized video sequence. The model takes noise, user prompts, music, and beat as input, and outputs animated dance videos in two stages, with the second stage building upon the first by adding motion dynamics.
This paper introduces MuseDance, a novel end-to-end model that animates still images using both music and text inputs. This approach enables personalized dance video creation without requiring complex motion guidance like pose sequences, making it accessible to a wider audience. MuseDance addresses several key challenges in this area. It introduces a new multimodal dataset of dance videos with corresponding music and text descriptions, enabling the model to learn motion dynamics from both modalities. The model offers flexible control, allowing users to specify motions through text while synchronizing animation with the music. Its diffusion-based approach ensures robust generalization and temporal consistency.
MuseDance's training is a two-stage process. The first stage focuses on single-frame generation, learning visual features and disentangling appearance and motion using DensePose and text prompts. The latent representation of the input image is modified by incorporating DensePose features: $z_0 = E(I_i) + Conv(D_i)$, where $z_0$ is the latent representation, $E(I_i)$ is the encoded input image, and $D_i$ is the DensePose mask. The second stage introduces music, beat, and motion modules to the denoising U-Net, generating video sequences synchronized with the music and guided by the text.
Quantitative evaluation demonstrates MuseDance's superior performance compared to adapted existing methods. It achieves better image quality and temporal consistency. Ablation studies confirm the contribution of each module, particularly the motion module for temporal coherence. Qualitative results showcase the model's ability to generate realistic and diverse dance videos for various objects, adhering to text guidance and music dynamics.
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding by Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou https://arxiv.org/abs/2501.18362
Caption: This infographic details the MedXpertQA benchmark, a new resource for evaluating expert-level medical AI. It showcases the data sources, image types, skills assessed, medical categories covered, and example questions, highlighting the benchmark's complexity compared to previous methods. The graphic also presents the performance of leading large language and multimodal models, demonstrating the challenge MedXpertQA poses and its potential to drive further advancements in medical AI.
This paper introduces MedXpertQA, a new benchmark designed to evaluate expert-level medical knowledge and reasoning in AI. Existing medical AI benchmarks, both text-based and multimodal, have limitations. Text benchmarks lack comprehensive coverage and are no longer sufficiently challenging for advanced models. Multimodal benchmarks often rely on simplified QA pairs, lacking the detailed clinical information and expert-level reasoning required for real-world medical scenarios. MedXpertQA addresses these shortcomings with two subsets: MedXpertQA Text for text-only evaluations and MedXpertQA MM for multimodal assessments. It covers 17 medical specialties, 11 body systems, and includes diverse image types and rich clinical information.
The construction of MedXpertQA involved rigorous data collection, filtering, augmentation, and expert review. Questions were sourced from medical exams, textbooks, and image-rich sources. A hierarchical filtering process ensured the selection of challenging and diverse questions. Data augmentation mitigated data leakage risk. Medical experts reviewed all questions for accuracy and clinical relevance.
Evaluation of leading LLMs and LMMs on MedXpertQA revealed that even state-of-the-art models struggle, especially in complex reasoning tasks. This highlights the benchmark's difficulty and its potential to drive advancements in medical AI. A reasoning-oriented subset was also developed to specifically assess reasoning capabilities.
Foundational Models for 3D Point Clouds: A Survey and Outlook by Vishal Thengane, Xiatian Zhu, Salim Bouzerdoum, Son Lam Phung, Yunpeng Li https://arxiv.org/abs/2501.18594
Caption: This figure illustrates three key approaches to leveraging 2D Foundational Models (FMs) for 3D point cloud understanding: direct adaptation of 2D encoders, dual encoder architectures using separate 2D and 3D encoders, and triplet alignment methods that align point cloud, image, and text representations. These approaches utilize pre-trained models like ResNet, ViT, and CLIP to enhance 3D tasks such as classification, segmentation, and object detection.
3D point clouds provide rich geometric information, crucial for applications like robotics and autonomous driving. However, utilizing this data effectively is challenging. This survey explores how Foundational Models (FMs) are transforming 3D point cloud understanding. It categorizes methods for building 3D FMs using 2D FMs into three approaches: direct adaptation (modifying 2D models to process point clouds), dual encoder (using separate encoders for 2D and 3D data), and triplet alignment (aligning point cloud, image, and text representations). The survey details how these models are adapted for downstream tasks like classification, segmentation, and object detection. Examples include PointCLIP and CALIP for classification, PartSLIP and SAM3D for segmentation, and VFMM3D for object detection. Open-vocabulary methods like OpenScene and CLIP-FO3D leverage CLIP for 3D understanding without explicit 3D annotations.
The survey also discusses integrating Large Language Models (LLMs) with 2D FMs. Models like Cap3D enhance object-level captioning, while LL3DA and RegionBLIP improve scene-level understanding. Uni3D-LLM decomposes point clouds into objects and utilizes LLMs for cognitive tasks. Future directions include developing more robust 3D FMs, creating larger 3D datasets, and improving data and resource efficiency.
Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification by Xiangyu Sun, Xiaoguang Zou, Yuanquan Wu, Guotai Wang, Shaoting Zhang https://arxiv.org/abs/2501.19086
Caption: The figure presents an analysis of fairness metrics for different CLIP-based models applied to X-ray image classification. It shows (a) the variance of F1 scores across diseases, (b) F1 scores across demographic groups, and fairness metrics after (c) full fine-tuning and (d) zero-shot application, highlighting disparities across models and demographic attributes like age and gender.
This study investigates the fairness implications of CLIP-like models for X-ray image classification, focusing on demographic attributes. Researchers evaluated several models, including CLIP, GLORIA, MedCLIP, and BioMedCLIP, on a balanced dataset, using zero-shot inference and four fine-tuning strategies: Linear Probing (LP), Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning (FT). Performance was evaluated using utility and fairness metrics, including variance of F1-score (Var F1): $Var F1 = (1/C) * Σ(F1c - F1)^2$, where C is the number of disease categories, F1c is the F1 score for category c, and F1 is the mean F1 score. Fairness across demographics was measured using F1 Gap (F1△), Equalized Odds (EqOdds), and Gap of Expected Calibration Error (ECE△).
Results showed that while fine-tuning improved accuracy, fairness disparities persisted. MedCLIP achieved the highest accuracy but exhibited the largest fairness gaps across gender and age. GLORIA demonstrated the highest overall fairness. Fine-tuning improved fairness for GLORIA but worsened disparities for other models, suggesting inherent model biases. This study highlights the need for fairness interventions in medical AI, emphasizing the importance of mitigating biases to ensure equitable healthcare outcomes.
This newsletter has highlighted the exciting progress and persistent challenges in multimodal AI. We've seen how LLMs are revolutionizing data augmentation, how foundation models are enhancing adaptation and generalization, and how novel paradigms like "Perceive Everything as Pixels" are challenging conventional approaches. The creative application of animating images with music demonstrates the expanding possibilities of multimodal AI, while the development of rigorous benchmarks like MedXpertQA and the exploration of 3D foundational models underscore the ongoing need for robust evaluation and advancement. Finally, the critical examination of fairness in medical AI reminds us of the ethical considerations that must accompany technological progress. These diverse studies collectively paint a picture of a dynamic field brimming with potential, driving towards more versatile, robust, and equitable AI systems.