Vision-language model
Overview
A vision-language model (VLM) is a multimodal architecture that processes both images and text as inputs to perform cross-modal reasoning tasks. Unlike large language models that operate exclusively on text, or computer vision models that handle only images, VLMs learn joint representations bridging visual and linguistic modalities during training. This joint training typically occurs through contrastive learning objectives, masked language modeling on image-text pairs, or causally-masked generation approaches.[1]
The architecture typically comprises three components: an image encoder (often a vision transformer or CNN backbone), a text encoder or decoder, and a fusion mechanism that aligns representations across modalities. The context window accommodates both image tokens (from patch embeddings or region features) and text tokens, requiring careful inference infrastructure design to manage computational overhead. VLMs are commonly used for image captioning, visual question answering (VQA), image-text retrieval, and visual grounding.
VLMs differ fundamentally from systems that apply separate models sequentially (e.g., image classification followed by LLM processing) because they learn unified representations during fine-tuning and inference. The knowledge cutoff in a VLM may differ across modalities, and hallucination patterns often manifest differently when grounded by image evidence versus pure text. Adversarial robustness of VLMs to imperceptible image perturbations remains an active research area.
How it works
VLMs employ a shared embedding space where image and text representations occupy compatible geometric relationships. The training process typically follows these steps:
- Image encoding: An image is processed through a frozen or trainable vision encoder (e.g., Vision Transformer, ResNet), producing a sequence of visual tokens or patch embeddings.
- Text encoding/decoding: Text is tokenized and processed through a language model component, producing text embeddings or hidden states.
- Alignment objective: The model optimizes a contrastive loss (e.g., CLIP-style similarity matching) or a generative loss (e.g., next-token prediction given image context) to align representations.
- Fine-tuning: Task-specific instruction tuning on image-text pairs (e.g., (image, caption) pairs for VQA) adjusts the joint representation.
At inference, an image and prompt are jointly encoded, and the model generates or retrieves text outputs. Prompt engineering for VLMs often involves spatial descriptions ("top-left region") or attribute enumeration, as natural language descriptions of visual content influence output quality. In-context learning with few example image-text pairs can adapt VLM behavior without retraining.[2]
| Term | Distinction |
|---|---|
| Multimodal LLM | Multimodal LLM is the broader category encompassing VLMs and other cross-modal models (audio-text, video-text). Vision-language model specifically denotes image-text alignment. |
| Text-only LLM | LLMs process text exclusively and lack visual grounding. They cannot directly analyze images without separate preprocessing (e.g., dense captions from an external model). |
| Foundation model | Foundation models are large-scale pre-trained models applicable to multiple downstream tasks. VLMs are a category of foundation model specialized for vision-language tasks. |
| Image encoder + LLM pipeline | Sequential stacking of an image encoder and language model differs from joint VLM training; the latter learns unified representations and typically achieves better transfer and in-context learning. |
| Embedding model | Embedding models map inputs (text or images) to fixed-dimensional vectors for retrieval. VLMs generate variable-length token sequences and are generative rather than purely embedding-focused. |
Examples
- CLIP (Contrastive Language-Image Pre-training): Trained on 400M image-text pairs from the web using a contrastive objective aligning image and caption embeddings. Widely used for zero-shot image classification and vision-language retrieval without task-specific fine-tuning.[1]
- GPT-4V / GPT-4 with Vision: A generative vision-language model accepting both images and text prompts, generating captions, answering visual questions, and performing reasoning over image content. Trained on a mixture of image-caption and image-QA datasets.
- LLaVA (Large Language and Vision Assistant): An open-source VLM combining a Vision Transformer encoder with a language model (LLaMA), fine-tuned on instruction-following image-text pairs. Publicly available weights support quantized inference on consumer hardware.
See also
- Multimodal LLM — broader category of models processing multiple input modalities
- Foundation model — large-scale pre-trained models serving as basis for task-specific adaptation
- Fine-tuning — technique for specializing pre-trained VLMs on downstream tasks
- In-context learning — VLM ability to adapt behavior from in-prompt examples without retraining
- Embedding model — related architecture for mapping inputs to fixed vector representations
References
- ↑ 1.0 1.1 Radford, A., Kim, J. W., Hallacy, C., et al. "Learning Transferable Models for Computer Vision Tasks." International Conference on Machine Learning (ICML). 2021.
- ↑ Alayrac, J. B., Donahue, J., Luc, P., et al. "Flamingo: a Visual Language Model for Few-Shot Learning." arXiv preprint arXiv:2204.14198. 2022.