多模态基础 | Multimodal

Keywords：

早期
- VLP(Vision-and-Language Pre-training) | “视觉-语言”预训练
- VLMs(Vision Language Models) | 视觉语言模型
当前：
- Multimodal Pre-training | 多模态预训练

Architectures:

dual-encoder | 双编码器
- 适合理解任务，侧重模态特征提取(Features Extraction)
- eg: CLIP
encoder-decoder | 编码器-解码器
- 适合生成任务，侧重模态交互(Modality Interaction)
- eg: SimVLM, AlBeF, BLIP
fusion-encoder | 混合编码器
- eg: ViLT, VLMo, BEiT-3

Tasks:

Vision
Language
Vision-Language

Tutorials

bilibili: DeepFinder/多模态经典论文集
- 配套博客：gnn.club: 多模态算法专栏

Milestones

CLIP (2102)

Learning Transferable Visual Models From Natural Language Supervision
https://arxiv.org/abs/2103.00020

ViLT (2102)

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
https://arxiv.org/abs/2102.03334

AlBeF (2107)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
https://arxiv.org/abs/2107.07651

VLMo (2111)

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
https://arxiv.org/abs/2111.02358

BLIP (2201)

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
https://arxiv.org/abs/2201.12086

CoCa (2205)

CoCa: Contrastive Captioners are Image-Text Foundation Models
https://arxiv.org/abs/2205.01917

BEiT-3 (2208)

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
https://arxiv.org/abs/2208.10442

Uni-perceiver系列

Uni-perceiver，Uni-perceiver-moe，Uni-perceiver-v2

Uni-perceiver (2112)

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
https://arxiv.org/abs/2112.01522

PaLI系列

PaLI，PaLI-X，PaLI-v2

多模态大语言模型 | MLLMs

Keywords:

LMMs(Large Multimodal Models) | 大多模态模型
MLLMs(Multimodal Large Language Models) | 多模态大语言模型

Survey

MM-LLMs (2024.05)

MM-LLMs: Recent Advances in MultiModal Large Language Models

arXiv:2401.13601v5 [cs.CL] 28 May 2024
https://arxiv.org/pdf/2401.13601
web: https://mm-llms.github.io/ ⭐
- Recent Advances in MLLMs

A Survey on MLLMs (2024.11)

A Survey on Multimodal Large Language Models

arXiv:2306.13549v4 [cs.CV] 29 Nov 2024
https://arxiv.org/abs/2306.13549
- The first comprehensive survey for Multimodal Large Language Models (MLLMs). ✨
code: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

MME-Survey (2024.12)

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

arXiv:2411.15296v2 [cs.CV] 8 Dec 2024
https://arxiv.org/abs/2411.15296

Models

LLaVA系列

blog: https://llava-vl.github.io/
github: https://github.com/haotian-liu/LLaVA
- Model Zoo
datasets: liuhaotian/LLaVA-Instruct-150K

LLaVA [NeurIPS’23]

paper: Visual Instruction Tuning

arXiv:2304.08485v2 [cs.CV] 11 Dec 2023
NeurIPS’23 Oral
核心Idea：
- 此前的 LLMs 通过 Instruction Tuning（指令调优）提高了zero-shot能力
- 我们提出 Visual Instruction Tuning （视觉指令调优） 来提高LMMs的zero-shot能力
Performance：主要对标 GPT-4（闭源），超越 OpenFlamingo（开源）和 BLIP-2（开源）
- LLaVA+GPT-4(judge) 当时在多模态推理数据集 ScienceQA 上实现 new SOTA (92.53)
模型大小：13B(default), 7B
训练成本：仅用单机8张A100，训练不到一天

Multimodal Instruction-following Agents
- 第1类：端到端模型。如 vision-language navigation task, Habitat(具身智能)；InstructPix2Pix(图像编辑)
- 第2类：多模型协作（通过LangChain）。如 Visual ChatGPT, X-GPT, MM-REACT, VisProg, ViperGPT
Instruction Tuning
- LLMs：GPT-3, T5, PaLM, OPT
  - 指令微调（简单方法提高zero-shot/few-shot能力）：INstructGPT/ChatGPT, FLAN-T5, FLAN-PaLM, OPT-IML
- LMMs：BLIP-2, FROMAGe, KOSMOS-1, PaLM-E; Flamingo (因其强大的zero-shot迁移能力和In-Context-Learning能力，可以被视为多模态领域的GPT-3 moment)
- OpenFlamingo, LLaMA-Adapter 让 recent “best” open-source LLM LLaMA 能接收image输入（但他们没有显式的在vision-language instruction data上微调，并且在多模态任务上通常表现下降）

Contributions

【合成数据】 Multimodal language-image instruction-following data
- 一个重要挑战是缺乏vision-language instruction-folling data，我们提出第一个使用 language-only ChatGPT/GPT4，从 image-text pairs 构造训练数据的pipeline
【模型架构】 Large Multimodal Models(LMMs)
- pretrained Visual Encoder
  - 架构：CLIP visual encoder ViT-L/14
  - 功能：提取 input image 的视觉特征得到 visual features
  - 视觉特征：the grid features before/after the last Transformer layer
    - features before last layer: 更关注 localized properties，有助于模型理解specific image details ⭐（效果更好）
    - features after last layer: 更关注 global and abstract image properties
- projection layer
  - 架构：一个简单的线性层 ⭐（将来也可以考虑尝试gated cross-attention in Flamingo 或者 Q-former in BLIP-2）
  - 功能：映射 visual features 到LLM的word embedding space中得到 visual tokens
  - 备注：后续LLaVA-1.5中叫做 Vision-Language Connector
- pretrained Language Decoder (LLM)
  - 架构：Vicuna (小羊驼，在语言任务上的instrction following能力是当时开源模型中最好的)
  - 功能：提供 instruction following 和 reasoning 能力
【新的基准】 Multimodal instruction
- 我们提出 LLaVA-Bench (COCO) 和 LLaVA-Bench (In-the-Wild) 作为第一个量化评估LMMs的 visual instruciton following 能力的benchmark
【全开源】 Open-source
- the generated data
- the model checkpoints
- a visual chat demo

Model Training (two-stage)

（Visual Encoder一直是冻结的）
Stage 1: Pre-training for Feature Alignment
冻结Visual Encoder和LLM，仅训练 projection layer -> 使Visual Encoder兼容LLM，从而visual tokens能对齐pretrained LLM的词向量

Stage 2: Fine-tuning End-to-End
更新 the pre-trained weight of the projection layer and LLM

Experiment

1.Multimodal Chatbot
定性评估

定量评估（在LLaVA-Bench上）

2.SinceQA

Ablation

视觉特征：feature before last layer（关注局部特征） > feature after last year（关注全局和抽象特征）
思维链(Chain-of-Thought): answer-first（收敛慢，效果好） > reasoning-first（收敛快，效果差）
预训练：有pre-training > 无pre-training
模型大小：13B > 7B

LLaVA-1.5 [CVPR’24]

paper: Improved Baselines with Visual Instruction Tuning

arXiv:2310.03744v2 [cs.CV] 15 May 2024
CVPR’24 (highlight)
LLaVA 对比 InstructBLIP/Qwen-VL
- LLaVA 预训练 了一个 MLP cross-modal connector，并微调 connector 和 LLM
  - 数据：在学术任务相关数据上，如 VQA
  - 不足：short-form answers (e.g. single-word)
- InstructBLIP or Qwen-VL 预训练 了一个 visual resamplers (e.g. Qformer)，并只微调 instruction-aware Qformer
  - 数据：在数亿(hundreds of millions, 129M, InstructBLIP) or 甚至数十亿(even billions, 1.4B, Qwen-VL) 的 image-text paired 数据上
  - 不足：long-form conversation (overfit short-form)
改进模型架构：对LLaVA的简单修改
- Visual Encoder: CLIP-ViT-L-336px (the highest resolution available for CLIP)
- Vision-Language Connector: MLP projection (从 1-layer MLP 变成 2-layer MLP)
一些开放问题：
1. Scaling to high-resolution image inputs | 兼容更高分辨率
  - 采用 “split-encode-merge” 操作（训练得到 LLaVA-1.5-HD 模型）
2. Data efficiency | 训练数据效率
  - 通过随机下采样训练集来提高数据效率（尝试下采样率0.1-0.5），证实了和其他多模态模型一样，具有 less-is-more benefit
    - 发现下采样为原始数据的50%时，模型在full dataset上保持原始表现的98%
    - 发现下采样为原始数据的30%时，模型的表现依然稳定
3. Hallucination in LMMs | 幻觉问题
  - model的hallucination 可能来自于训练集的errors或hallucination
  - 当input的resolution变高，幻觉减少
  - 需要【更详细的data annotation】和【正确处理信息的model】
4. Compositional capabilities | 组合能力(1+1>2)
  - 证据：
    1. ShareGPT data 同时促进了 multimodal multilingual capability
    2. academic-task-oriented datasets 同时促进了 visual groundness 能力
  - 问题：对于需要一定能力组合的某些task，仍然难以实现理想的表现
    1. 例如能够正确回答VQA中某个对象的属性，不能保证在整个图像的详细说明中准确描述该对象属性
    2. 此外，与某些外语（例如韩语）进行对话的能力仍然落后。
模型大小：13B(default)，7B
训练成本：仅用单机8张A100，~1 day

Qwen-VL

paper: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

arXiv:2308.12966v3 [cs.CV] 13 Oct 2023
blog: Qwen-VL全新升级！
github: https://github.com/QwenLM/Qwen-VL

Qwen2-VL

paper: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

arXiv:2409.12191v2 [cs.CV] 3 Oct 2024
blog: Qwen2-VL: 更清晰地看世界
github: https://github.com/QwenLM/Qwen2-VL

Qwen2.5-VL

report: Qwen2.5-VL Technical Report

arXiv:2502.13923v1 [cs.CV] 19 Feb 2025
blog: Qwen2.5 VL！Qwen2.5 VL！Qwen2.5 VL！
github: https://github.com/QwenLM/Qwen2.5-VL

InternVL系列

blog: https://internvl.github.io/
github: https://github.com/OpenGVLab/InternVL

InternVL

InternVL1.5

InternVL2.5

InternVL2.5-MPO

InternVL3

DeepSeek-VL系列

DeepSeek-VL

DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL2

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

多模态(Multimodal) 及 多模态大语言模型(MLLMs) 学习笔记