This newsletter dives into the cutting-edge advancements in multimodal image and text foundation models, exploring novel approaches to misalignment detection, disease diagnosis, and controllable image generation. We'll dissect four recent papers that leverage the power of large language models and innovative architectural designs to push the boundaries of what's possible in this exciting field. From zero-shot misalignment detection with CLIP to LLM-powered glottic carcinoma diagnosis, and unified controllable image generation with diffusion transformers, this newsletter offers a comprehensive overview of the latest developments you won't want to miss.
Extract Free Dense Misalignment from CLIP by JeongYeon Nam, Jinbae Im, Wonjae Kim, Taeho Kil https://arxiv.org/abs/2412.18404
Caption: A man is skiing down a snow-covered slope. He wears a teal jacket and gray pants, using ski poles and skis to navigate the descent.
Despite advancements, vision-language models still struggle to perfectly align generated outputs with their inputs. This misalignment can manifest as object hallucination in captioning or prompt misalignment in text-to-image generation. Existing detection methods often rely on resource-intensive large language models or require fine-tuning with human-annotated data, limiting their scalability. This paper introduces CLIP4DM, a novel zero-shot approach leveraging the pre-trained CLIP model for efficient detection of dense misalignments, specifically targeting misaligned words between image and text.
CLIP4DM introduces a revamped gradient-based attribution computation method. Unlike traditional methods that primarily focus on positive gradients, CLIP4DM incorporates negative gradients of individual text tokens as indicators of misalignment. By removing the ReLU operation typically used in gradient-based attribution, negative gradients are allowed to contribute to the explanation of the model's behavior. These attributions are then averaged across layers to improve interpretability. Word-level attributions are calculated by averaging the attributions of their constituent tokens. A word is classified as misaligned if its attribution falls below a predefined threshold (€).
Furthermore, the paper introduces F-CLIPScore, a novel metric that combines the overall CLIP similarity score with the attributions of misaligned tokens. The formula for F-CLIPScore is: F-CLIPScore(v, t) = (1-score_{v,t}) * Σ mis(w_j) * w_j where score_{v,t} is the CLIP similarity score, w_j is the attribution for word j, and mis(w_j) is an indicator function equal to 1 if the word is misaligned and 0 otherwise.
Evaluations on various benchmarks like FOIL, nocaps-FOIL, HAT, SeeTRUE-Feedback, and Rich-HF, covering diverse image and text domains, demonstrate CLIP4DM's effectiveness. It achieves state-of-the-art performance among zero-shot models on FOIL and nocaps-FOIL, and competitive performance with fine-tuned models while maintaining superior efficiency. Notably, the ViT-H/14 variant of CLIP4DM shows particularly strong results. Qualitative examples highlight its ability to detect entity-level objects, intangible objects, and attributes, which often pose challenges for existing methods. While the method inherits some limitations from CLIP, such as difficulty with backgrounds or small objects, CLIP4DM offers a promising direction for efficient and interpretable dense misalignment detection.
VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection by Zhaohui Jin, Yi Shuai, Yongcheng Li, Lingcong Cai, Yun Li, Huifen Liu, Xiaomao Fan https://arxiv.org/abs/2412.18124
Caption: The architecture of MMGC-Net, a VisionLLM-powered multimodal fusion network for glottic carcinoma detection, is illustrated. It processes laryngoscopic images and clinical reports using separate encoders (BLIP-2/Q-Former for images, Llama3 for text) before fusing the embeddings for classification. The model outputs a prediction ŷᵢ indicating the likelihood of glottic carcinoma.
Early detection of glottic carcinoma is paramount for successful treatment, but differentiating it from vocal cord dysplasia remains a challenge. This paper introduces MMGC-Net, a VisionLLM-based multimodal fusion network that integrates image and text data for improved glottic carcinoma detection.
MMGC-Net utilizes a three-pronged strategy. A laryngoscopic image encoder, based on a pre-trained BLIP-2 model and a Q-Former, extracts image embeddings. Concurrently, the Llama3 large language model processes clinical reports to generate text embeddings. These embeddings are then fused within a laryngeal feature fusion block, projecting both modalities into a unified representation space and concatenating them. The resulting joint feature vector is then fed to a classifier for final prediction. The model is trained using cross-entropy loss: L<sub>CE</sub> = - Σᵢ yᵢ log(ŷᵢ), where C is the number of classes, and y represents the one-hot encoded ground truth labels.
The authors also introduce SYSU1H, a new dataset of 5,799 image-text pairs of laryngoscopic images and clinical reports, categorized as either vocal cord dysplasia or glottic carcinoma. Evaluation on SYSU1H demonstrates MMGC-Net's superior performance, achieving 76.10% accuracy, 76.70% precision, 76.16% recall, and a 74.41% F1 score. This represents a substantial improvement over existing multimodal models, including an 8.86% increase in accuracy and a 9.90% increase in recall compared to the second-best performing model. Ablation studies confirm the critical role of multimodal fusion, with models using only image or text data performing significantly worse. The introduction of MMGC-Net and the SYSU1H dataset represents a significant advancement in glottic carcinoma detection, paving the way for earlier and more effective intervention.
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting by Siyu Jiao, Haoye Dong, Yuyang Yin, Zequn Jie, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei https://arxiv.org/abs/2412.19142
Caption: CLIP-GS leverages 3D Gaussian Splatting (3DGS) to learn multimodal representations by aligning 3DGS, image, and text embeddings using a novel GS tokenizer and transformer layers. This architecture enables CLIP-GS to achieve state-of-the-art performance in 3D multimodal retrieval and classification tasks, surpassing existing point cloud-based methods. The image details the architecture of CLIP-GS, showcasing the flow from 3DGS input to the final embedding E^G, highlighting the GS Tokenizer and the transformer layers.
While 3D multimodal learning has advanced, existing models primarily rely on point clouds, which struggle to capture texture information vital for 3D reconstruction. CLIP-GS addresses this limitation by leveraging 3D Gaussian Splatting (3DGS), a richer representation that incorporates texture.
CLIP-GS aligns 3DGS embeddings with CLIP's visual and textual embeddings. A novel GS Tokenizer converts 3DGS into serialized gaussian tokens, processed by transformer layers (pre-initialized with weights from point cloud models) to generate gaussian features. To handle varying viewpoints in rendered images, CLIP-GS introduces an image voting loss, Limg, leveraging CLIP's image-text alignment. This voting strategy on the image contrastive loss guides gradient optimization and improves convergence. The Limg is defined as: Limg = (1/N) Σᵢ (2 * Si * (Contra(E^S_G, E^T_i) + Contra(E^I_i, E^G_i))), where N is the batch size, Si is the voting score, Contra is the contrastive loss function, and E^S_G, E^T_i, E^I_i, E^G_i represent the normalized embeddings for 3DGS, text, image, and gaussian features, respectively.
Trained on approximately 240K triplets of 3DGS, images, and text generated from the Objaverse dataset, CLIP-GS achieves state-of-the-art performance. Evaluations on multimodal retrieval, zero-shot, and few-shot 3D classification demonstrate significant improvements over point cloud-based models. These results showcase the potential of 3DGS for 3D multimodal learning and establish CLIP-GS as a new state-of-the-art in the field.
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation by Lunhao Duan, Shanshan Zhao, Wenjun Yan, Yinglun Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Mingming Gong, Gui-Song Xia https://arxiv.org/abs/2412.18928
Caption: The UNIC-Adapter architecture unifies controllable image generation by integrating task instructions (e.g., text prompts) and conditional images (e.g., style, depth) within a single model. It leverages a cross-attention mechanism enhanced by ROPE to inject multi-modal information into the MM-DiT generation process, enabling flexible control over pixel layouts, object appearances, and global styles. The diagram showcases how different conditional inputs are processed and integrated to influence the final generated image.
While text-to-image generation has progressed remarkably, fine-grained control over pixel layouts, object appearances, and styles remains a challenge. Existing methods often require specialized models for different types of conditional image inputs. The unified image-instruction adapter (UNIC-Adapter) addresses this by enabling flexible and controllable generation across diverse conditions within a single framework.
Built upon the Multi-Modal-Diffusion Transformer (MM-DiT) architecture, UNIC-Adapter incorporates both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding (ROPE). Task instructions and conditional image features are extracted by text encoders and a VAE, respectively. The adapter processes these features using MM-DiT blocks, integrating them into the generation branch via cross-attention. ROPE enhances spatial awareness within the cross-attention for pixel-level control. The core feature injection uses cross-attention: Z<sub>img</sub> = Z<sub>img</sub> + Attn(Q<sub>img</sub>, [K<sub>ist</sub>||K<sub>con</sub>], [V<sub>ist</sub> || V<sub>con</sub>]), where image features Z<sub>img</sub> are updated by attending to task instruction features Z<sub>ist</sub> and conditional image features Z<sub>con</sub>.
Experimental results across various tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based synthesis, demonstrate UNIC-Adapter's effectiveness. It achieves state-of-the-art performance on certain pixel-level tasks and demonstrates strong performance across other tasks, outperforming baseline methods like ControlNet and IPAdapter-Instruct in its unified approach. Ablation studies confirm the importance of design choices, including the use of both K<sub>con</sub> and K<sub>ist</sub> and the inclusion of ROPE. UNIC-Adapter represents a promising step towards truly unified controllable image generation, simplifying training and enhancing flexibility.
This newsletter highlighted significant progress in multimodal image and text foundation models. From CLIP4DM's innovative approach to misalignment detection, leveraging negative gradients and introducing the F-CLIPScore, to MMGC-Net's powerful application of VisionLLM technology in medical diagnosis, and CLIP-GS's exploitation of 3D Gaussian Splatting for enhanced 3D multimodal learning, we see a clear trend towards more sophisticated and nuanced understanding of the interplay between images and text. Furthermore, UNIC-Adapter offers a compelling vision for the future of controllable image generation, unifying diverse conditional inputs within a single framework. These advancements collectively point towards a future where multimodal models are not only more powerful and efficient but also more interpretable and controllable, opening up exciting possibilities across various domains.