This newsletter explores the cutting-edge research in multimodal image and text foundation models, showcasing innovative approaches to improve efficiency, accuracy, and applicability across diverse tasks. From revolutionizing salient object detection with minimal manual annotation to developing unified models for complex image generation and grounded video understanding, these papers push the boundaries of what's possible with multimodal AI. We'll delve into new architectures, datasets, and training strategies that are shaping the future of this exciting field.
Boosting Salient Object Detection with Knowledge Distilled from Large Foundation Models by Miaoyang He, Shuyong Gao, Tsui Qin Mok, Weifeng Ge, Wengqiang Zhang https://arxiv.org/abs/2501.04582
Caption: This diagram illustrates a novel weakly supervised Salient Object Detection (SOD) pipeline. It leverages a fine-tuned BLIP model for text generation and GroundingDINO for bounding box creation, feeding into SAM for pseudo-label generation. The lower section details the mask generation process using a novel edge-preserving decoder (DEDecoder) incorporating image and text features.
Salient Object Detection (SOD), the task of identifying and segmenting the most visually prominent objects in an image, traditionally relies on expensive and time-consuming manual annotation. This paper proposes a novel weakly supervised approach that leverages the knowledge distilled from large foundation models to generate high-precision pseudo-labels with minimal manual effort. The method employs a text-guided approach, fine-tuning a pre-trained BLIP model with a small set of manually annotated text descriptions of salient objects. These descriptions are then used with GroundingDINO to generate bounding boxes, which are subsequently fed into SAM to create accurate pseudo-labels. This pipeline drastically reduces the manual annotation burden while maintaining high label quality.
Addressing the limitations of existing SOD datasets in terms of scale and diversity, the authors introduce a new dataset, BDS-TR. Expanding on the DUTS-TR dataset, BDS-TR boasts approximately 260,000 images, encompassing over 960 major object categories and over 3,000 subcategories. This larger and more diverse dataset aims to improve the generalization capabilities of SOD models, enabling them to perform effectively across a broader range of real-world scenarios.
Furthermore, the paper introduces DEDecoder, a novel edge-preserving decoder based on dynamic upsampling. Inspired by previous work, the DEDecoder progressively restores feature resolution during the decoding phase. A multi-scale edge-preserving module is incorporated to better recover structural information and reinforce boundary details, further enhancing the accuracy of the generated saliency maps. The combined loss function used is defined as: L<sub>final</sub> = α<sub>1</sub>L<sub>bce</sub> + α<sub>2</sub>L<sub>pbce</sub> + α<sub>3</sub>L<sub>IoU</sub>, where L<sub>bce</sub> is the binary cross-entropy loss, L<sub>pbce</sub> is the partial cross-entropy loss, L<sub>IoU</sub> is the Intersection over Union loss, and α<sub>1</sub>, α<sub>2</sub>, and α<sub>3</sub> are weights (set to 1 in the experiments).
Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein by Xiaotong Guo, Deqian Yang, Dan Wang, Haochen Zhao, Yuan Li, Zhilin Sui, Tao Zhou, Lijun Zhang, Yanda Meng https://arxiv.org/abs/2501.03722
Caption: This diagram illustrates a novel self-adaptive vision-language model for 3D pulmonary artery and vein segmentation. It leverages pre-trained CLIP and U-Net models, coupled with adapter modules and a cross-attention mechanism, to fuse text embeddings from medical prompts with image features extracted from CT scans. The framework also incorporates a data augmentation strategy using fully and partially labeled scans to enhance performance.
Accurate 3D segmentation of pulmonary arteries and veins from CT scans is vital for diagnosis and treatment of pulmonary vascular diseases. This research introduces a self-adaptive vision-language model that leverages pre-trained models like CLIP to achieve remarkable segmentation accuracy with limited labeled data. The core of the proposed framework is a language-guided self-adaptive cross-attention fusion mechanism. A pre-trained CLIP encoder generates text embeddings from a medical prompt (e.g., "A computerized tomography of a category with small branches"), while a pre-trained U-Net encodes the 3D CT scan into a feature map. Adapter modules fine-tune both CLIP and U-Net embeddings, allowing the model to adapt to the specific characteristics of medical image data. A cross-attention module, represented by the formula: f<sub>CA</sub>(H<sup>t</sup>, H<sup>v</sup>) = softmax(q(H<sup>t</sup>)k(H<sup>t</sup> + H<sup>v</sup>)/√d<sub>k</sub>)v(H<sup>v</sup>), fuses the text and image embeddings, leveraging contextual clues from both modalities.
The researchers also introduce a novel data augmentation strategy. They compiled a dataset of 718 3D CT scans, including fully and half-labeled scans (where only one lung is annotated). This combined dataset, the largest of its kind for pulmonary A/V segmentation, enables the model to learn from a wider range of cases. A label augmentation technique converts the three-class segmentation task (background, artery, vein) into a five-class task, differentiating between left and right vessels, further enhancing accuracy.
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token by Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng https://arxiv.org/abs/2501.03895
Caption: LLaVA-Mini introduces a novel architecture for multimodal models, compressing visual input into a single token for efficient processing. The model utilizes modality pre-fusion to integrate visual information with text instructions before feeding them into the LLM, achieving comparable performance to larger models while significantly reducing computational cost. This diagram illustrates the workflow of LLaVA-Mini, highlighting the compression and pre-fusion stages.
LLaVA-Mini addresses the computational cost of Large Multimodal Models (LMMs) by drastically reducing the number of vision tokens to just one, while maintaining performance comparable to more computationally intensive models. Through analysis, the researchers found that vision tokens primarily contribute to the early layers of the LLM, where they fuse visual information into text tokens. LLaVA-Mini introduces modality pre-fusion, implemented as Transformer blocks, to fuse visual information directly into the instruction text before it reaches the LLM. This pre-fusion allows for subsequent drastic reduction in vision tokens. A compression module then condenses the visual information into a single token (Ĥ) using learnable queries and cross-attention, represented by: Ĥ = A • H', where A = Softmax ((Q" + PE(Q')) · (H' + PE(H'))). Here, H' are the compressed vision tokens, Q' are learnable queries, and PE is positional encoding. This single token, along with pre-fused text tokens (Ĥº), is then fed into the LLM.
EditAR: Unified Conditional Generation with Autoregressive Models by Jiteng Mu, Nuno Vasconcelos, Xiaolong Wang https://arxiv.org/abs/2501.04699
Caption: The architecture of EditAR, a unified autoregressive model for diverse conditional image generation tasks. It processes both image and text inputs through separate encoders, combining them in an autoregressive model to generate edited or translated images. A distillation loss from a vision foundation model and classifier-free guidance enhance image quality and alignment with text instructions.
EditAR offers a unified autoregressive model for various conditional image generation tasks, moving away from task-specific diffusion models. Built upon LlamaGen, EditAR incorporates image and text inputs, using a two-stage process: a VQVAE encodes image patches into tokens, and an autoregressive transformer models the probability distribution of output tokens conditioned on text and image inputs. A key innovation is the distillation loss, leveraging knowledge from vision foundation models like DINOv2 to improve visual coherence and text-image alignment. Classifier-free guidance is used during inference to further enhance quality and alignment.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos by Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang https://arxiv.org/abs/2501.04001
Sa2VA is a unified model for dense grounded understanding of images and videos. It combines SAM-2's segmentation capabilities with LLaVA's multimodal understanding, unifying text, image, and video into a shared LLM token space. The LLM generates instruction tokens that guide SAM-2 in producing precise masks, enabling grounded understanding of visual content. A key innovation is representing tasks like referring segmentation and grounded conversation generation as a single instruction-tuning process: T<sub>o</sub>, M<sub>o</sub> = LLM({I<sub>i</sub>, V<sub>i</sub>, VP<sub>i</sub>}, T<sub>i</sub>). A "[SEG]" token's hidden state serves as a spatial-temporal prompt for SAM-2, guiding its segmentation based on the LLM's textual understanding. The authors also introduce Ref-SAV, a new referring video object segmentation dataset with over 72k object expressions in complex video scenes.
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis by Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang https://arxiv.org/abs/2501.04561
Caption: The image illustrates the architecture of OpenOmni, an open-source omnimodal large language model. It showcases three key functionalities: speech-to-text generation, image-to-text generation, and speech generation, all leveraging a central Omni Language Model and specialized encoders/decoders for different modalities. The consistent use of a dog image and example speech/text demonstrates the model's ability to process and generate information across various modalities.
OpenOmni is an open-source omnimodal large language model (OLLM) that addresses the challenges of limited datasets and real-time emotional speech generation. It utilizes a two-stage training method: omnimodal alignment and speech generation. Using language as a pivot, OpenOmni leverages existing speech-text and image-text data for implicit omnimodal alignment. A lightweight streaming speech decoder facilitates real-time, emotion-rich speech generation. Direct Emotional Preference Optimization (DEPO) is employed to enhance emotional expressiveness.
This newsletter has highlighted significant advancements in multimodal image and text foundation models. We've seen a push towards greater efficiency with LLaVA-Mini's single vision token approach, a move towards unified architectures with EditAR handling diverse image generation tasks, and groundbreaking work in grounded video understanding with Sa2VA. The introduction of new datasets like BDS-TR and Ref-SAV provides crucial resources for future research, while OpenOmni's open-source nature and focus on emotional speech generation opens exciting avenues for broader accessibility and application. These diverse yet interconnected advancements underscore the rapid pace of innovation in the field, promising more sophisticated and impactful multimodal AI systems in the near future.