多模态(Multimodal) 及 多模态大语言模型(MLLMs) 学习笔记

多模态(Multimodal) 及 多模态大语言模型(MLLMs) 学习笔记

LeoJeshua Lv2

多模态基础 | Multimodal

Keywords:

  • 早期
    • VLP(Vision-and-Language Pre-training) | “视觉-语言”预训练
    • VLMs(Vision Language Models) | 视觉语言模型
  • 当前:
    • Multimodal Pre-training | 多模态预训练

Architectures:

  • dual-encoder | 双编码器
    • 适合理解任务,侧重模态特征提取(Features Extraction)
    • eg: CLIP
  • encoder-decoder | 编码器-解码器
    • 适合生成任务,侧重模态交互(Modality Interaction)
    • eg: SimVLM, AlBeF, BLIP
  • fusion-encoder | 混合编码器
    • eg: ViLT, VLMo, BEiT-3

Tasks:

  • Vision
  • Language
  • Vision-Language

Tutorials

Milestones

CLIP (2102)

Learning Transferable Visual Models From Natural Language Supervision
https://arxiv.org/abs/2103.00020

ViLT (2102)

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
https://arxiv.org/abs/2102.03334



AlBeF (2107)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
https://arxiv.org/abs/2107.07651


VLMo (2111)

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
https://arxiv.org/abs/2111.02358



BLIP (2201)

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
https://arxiv.org/abs/2201.12086


CoCa (2205)

CoCa: Contrastive Captioners are Image-Text Foundation Models
https://arxiv.org/abs/2205.01917



BEiT-3 (2208)

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
https://arxiv.org/abs/2208.10442



Uni-perceiver系列

Uni-perceiver,Uni-perceiver-moe,Uni-perceiver-v2

Uni-perceiver (2112)

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
https://arxiv.org/abs/2112.01522



PaLI系列

PaLI,PaLI-X,PaLI-v2

多模态大语言模型 | MLLMs

Keywords:

  • LMMs(Large Multimodal Models) | 大多模态模型
  • MLLMs(Multimodal Large Language Models) | 多模态大语言模型

Survey

MM-LLMs (2024.05)

MM-LLMs: Recent Advances in MultiModal Large Language Models




A Survey on MLLMs (2024.11)

A Survey on Multimodal Large Language Models

MME-Survey (2024.12)

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs





Models

LLaVA系列

LLaVA [NeurIPS’23]

paper: Visual Instruction Tuning

  • arXiv:2304.08485v2 [cs.CV] 11 Dec 2023
  • NeurIPS’23 Oral
  • 核心Idea
    • 此前的 LLMs 通过 Instruction Tuning(指令调优)提高了zero-shot能力
    • 我们提出 Visual Instruction Tuning (视觉指令调优) 来提高LMMs的zero-shot能力
  • Performance:主要对标 GPT-4(闭源),超越 OpenFlamingo(开源)和 BLIP-2(开源)
    • LLaVA+GPT-4(judge) 当时在多模态推理数据集 ScienceQA 上实现 new SOTA (92.53)
  • 模型大小:13B(default), 7B
  • 训练成本:仅用单机8张A100,训练不到一天
  1. Multimodal Instruction-following Agents

    • 第1类:端到端模型。如 vision-language navigation task, Habitat(具身智能);InstructPix2Pix(图像编辑)
    • 第2类:多模型协作(通过LangChain)。如 Visual ChatGPT, X-GPT, MM-REACT, VisProg, ViperGPT
  2. Instruction Tuning

    • LLMs:GPT-3, T5, PaLM, OPT
      • 指令微调(简单方法提高zero-shot/few-shot能力):INstructGPT/ChatGPT, FLAN-T5, FLAN-PaLM, OPT-IML
    • LMMs:BLIP-2, FROMAGe, KOSMOS-1, PaLM-E; Flamingo (因其强大的zero-shot迁移能力和In-Context-Learning能力,可以被视为多模态领域的GPT-3 moment)
    • OpenFlamingo, LLaMA-Adapter 让 recent “best” open-source LLM LLaMA 能接收image输入(但他们没有显式的在vision-language instruction data上微调,并且在多模态任务上通常表现下降)
Contributions
  1. 【合成数据】 Multimodal language-image instruction-following data
    • 一个重要挑战是缺乏vision-language instruction-folling data,我们提出第一个使用 language-only ChatGPT/GPT4,从 image-text pairs 构造训练数据 的pipeline
  2. 【模型架构】 Large Multimodal Models(LMMs)
    • pretrained Visual Encoder
      • 架构:CLIP visual encoder ViT-L/14
      • 功能:提取 input image 的视觉特征得到 visual features
      • 视觉特征:the grid features before/after the last Transformer layer
        • features before last layer: 更关注 localized properties,有助于模型理解specific image details ⭐(效果更好)
        • features after last layer: 更关注 global and abstract image properties
    • projection layer
      • 架构:一个简单的线性层 ⭐(将来也可以考虑尝试gated cross-attention in Flamingo 或者 Q-former in BLIP-2)
      • 功能:映射 visual features 到LLM的word embedding space中得到 visual tokens
      • 备注:后续LLaVA-1.5中叫做 Vision-Language Connector
    • pretrained Language Decoder (LLM)
      • 架构:Vicuna (小羊驼,在语言任务上的instrction following能力是当时开源模型中最好的)
      • 功能:提供 instruction following 和 reasoning 能力
  3. 【新的基准】 Multimodal instruction
    • 我们提出 LLaVA-Bench (COCO)LLaVA-Bench (In-the-Wild) 作为第一个量化评估LMMs的 visual instruciton following 能力 的benchmark
  4. 【全开源】 Open-source
    • the generated data
    • the model checkpoints
    • a visual chat demo
Model Training (two-stage)

(Visual Encoder一直是冻结的)
Stage 1: Pre-training for Feature Alignment
冻结Visual Encoder和LLM,仅训练 projection layer -> 使Visual Encoder兼容LLM,从而visual tokens能对齐pretrained LLM的词向量

Stage 2: Fine-tuning End-to-End
更新 the pre-trained weight of the projection layer and LLM

Experiment

1.Multimodal Chatbot
定性评估

定量评估(在LLaVA-Bench上)

2.SinceQA

Ablation
  1. 视觉特征:feature before last layer(关注局部特征) > feature after last year(关注全局和抽象特征)
  2. 思维链(Chain-of-Thought): answer-first(收敛慢,效果好) > reasoning-first(收敛快,效果差)
  3. 预训练:有pre-training > 无pre-training
  4. 模型大小:13B > 7B

LLaVA-1.5 [CVPR’24]

paper: Improved Baselines with Visual Instruction Tuning

  • arXiv:2310.03744v2 [cs.CV] 15 May 2024
  • CVPR’24 (highlight)
  • LLaVA 对比 InstructBLIP/Qwen-VL
    • LLaVA 预训练 了一个 MLP cross-modal connector,并 微调 connector 和 LLM
      • 数据:在学术任务相关数据上,如 VQA
      • 不足:short-form answers (e.g. single-word)
    • InstructBLIP or Qwen-VL 预训练 了一个 visual resamplers (e.g. Qformer),并只 微调 instruction-aware Qformer
      • 数据:在数亿(hundreds of millions, 129M, InstructBLIP) or 甚至数十亿(even billions, 1.4B, Qwen-VL) 的 image-text paired 数据上
      • 不足:long-form conversation (overfit short-form)
  • 改进模型架构:对LLaVA的简单修改
    • Visual Encoder: CLIP-ViT-L-336px (the highest resolution available for CLIP)
    • Vision-Language Connector: MLP projection (从 1-layer MLP 变成 2-layer MLP)
  • 一些开放问题:
    1. Scaling to high-resolution image inputs | 兼容更高分辨率
      • 采用 “split-encode-merge” 操作(训练得到 LLaVA-1.5-HD 模型)
    2. Data efficiency | 训练数据效率
      • 通过随机下采样训练集来提高数据效率(尝试下采样率0.1-0.5),证实了和其他多模态模型一样,具有 less-is-more benefit
        • 发现下采样为原始数据的50%时,模型在full dataset上保持原始表现的98%
        • 发现下采样为原始数据的30%时,模型的表现依然稳定
    3. Hallucination in LMMs | 幻觉问题
      • model的hallucination 可能来自于 训练集的errors或hallucination
      • 当input的resolution变高,幻觉减少
      • 需要 【更详细的data annotation】 和 【正确处理信息的model】
    4. Compositional capabilities | 组合能力(1+1>2)
      • 证据:
        1. ShareGPT data 同时促进了 multimodal multilingual capability
        2. academic-task-oriented datasets 同时促进了 visual groundness 能力
      • 问题:对于需要一定能力组合的某些task,仍然难以实现理想的表现
        1. 例如能够正确回答VQA中某个对象的属性,不能保证在整个图像的详细说明中准确描述该对象属性
        2. 此外,与某些外语(例如韩语)进行对话的能力仍然落后。
  • 模型大小:13B(default),7B
  • 训练成本:仅用单机8张A100,~1 day

LLaVA-NeXT/LLaVA-1.6

LLaVA-OneVision

LLaVA-Video

LLaVA-Critic

Qwen-VL系列

Qwen-VL

paper: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen2-VL

paper: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

Qwen2.5-VL

report: Qwen2.5-VL Technical Report

InternVL系列

InternVL

InternVL1.5

InternVL2.5

InternVL2.5-MPO

InternVL3

DeepSeek-VL系列

DeepSeek-VL

DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL2

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Benchmark

Leaderboard

司南 OpenCompass

  • Title: 多模态(Multimodal) 及 多模态大语言模型(MLLMs) 学习笔记
  • Author: LeoJeshua
  • Created at : 2025-05-19 11:31:00
  • Updated at : 2025-07-25 23:46:56
  • Link: https://leojeshua.github.io/Multimodal/Multimodal/
  • License: This work is licensed under CC BY-NC-SA 4.0.