Vision-language model

Vision-language model — A foundation model jointly trained on paired images and text to perform tasks requiring both visual and linguistic understanding.

Overview

A vision-language model (VLM) is a multimodal architecture that processes both images and text as inputs to perform cross-modal reasoning tasks. Unlike large language models that operate exclusively on text, or computer vision models that handle only images, VLMs learn joint representations bridging visual and linguistic modalities during training. This joint training typically occurs through contrastive learning objectives, masked language modeling on image-text pairs, or causally-masked generation approaches.^[1]

The architecture typically comprises three components: an image encoder (often a vision transformer or CNN backbone), a text encoder or decoder, and a fusion mechanism that aligns representations across modalities. The context window accommodates both image tokens (from patch embeddings or region features) and text tokens, requiring careful inference infrastructure design to manage computational overhead. VLMs are commonly used for image captioning, visual question answering (VQA), image-text retrieval, and visual grounding.

VLMs differ fundamentally from systems that apply separate models sequentially (e.g., image classification followed by LLM processing) because they learn unified representations during fine-tuning and inference. The knowledge cutoff in a VLM may differ across modalities, and hallucination patterns often manifest differently when grounded by image evidence versus pure text. Adversarial robustness of VLMs to imperceptible image perturbations remains an active research area.

How it works

VLMs employ a shared embedding space where image and text representations occupy compatible geometric relationships. The training process typically follows these steps:

Image encoding: An image is processed through a frozen or trainable vision encoder (e.g., Vision Transformer, ResNet), producing a sequence of visual tokens or patch embeddings.

Text encoding/decoding: Text is tokenized and processed through a language model component, producing text embeddings or hidden states.

Alignment objective: The model optimizes a contrastive loss (e.g., CLIP-style similarity matching) or a generative loss (e.g., next-token prediction given image context) to align representations.

Fine-tuning: Task-specific instruction tuning on image-text pairs (e.g., (image, caption) pairs for VQA) adjusts the joint representation.

At inference, an image and prompt are jointly encoded, and the model generates or retrieves text outputs. Prompt engineering for VLMs often involves spatial descriptions ("top-left region") or attribute enumeration, as natural language descriptions of visual content influence output quality. In-context learning with few example image-text pairs can adapt VLM behavior without retraining.^[2]

Distinction from related terms

Term	Distinction
Multimodal LLM	Multimodal LLM is the broader category encompassing VLMs and other cross-modal models (audio-text, video-text). Vision-language model specifically denotes image-text alignment.
Text-only LLM	LLMs process text exclusively and lack visual grounding. They cannot directly analyze images without separate preprocessing (e.g., dense captions from an external model).
Foundation model	Foundation models are large-scale pre-trained models applicable to multiple downstream tasks. VLMs are a category of foundation model specialized for vision-language tasks.
Image encoder + LLM pipeline	Sequential stacking of an image encoder and language model differs from joint VLM training; the latter learns unified representations and typically achieves better transfer and in-context learning.
Embedding model	Embedding models map inputs (text or images) to fixed-dimensional vectors for retrieval. VLMs generate variable-length token sequences and are generative rather than purely embedding-focused.

Examples

CLIP (Contrastive Language-Image Pre-training): Trained on 400M image-text pairs from the web using a contrastive objective aligning image and caption embeddings. Widely used for zero-shot image classification and vision-language retrieval without task-specific fine-tuning.^[1]

GPT-4V / GPT-4 with Vision: A generative vision-language model accepting both images and text prompts, generating captions, answering visual questions, and performing reasoning over image content. Trained on a mixture of image-caption and image-QA datasets.

LLaVA (Large Language and Vision Assistant): An open-source VLM combining a Vision Transformer encoder with a language model (LLaMA), fine-tuned on instruction-following image-text pairs. Publicly available weights support quantized inference on consumer hardware.

References

↑ ^1.0 ^1.1 Radford, A., Kim, J. W., Hallacy, C., et al. "Learning Transferable Models for Computer Vision Tasks." International Conference on Machine Learning (ICML). 2021.
↑ Alayrac, J. B., Donahue, J., Luc, P., et al. "Flamingo: a Visual Language Model for Few-Shot Learning." arXiv preprint arXiv:2204.14198. 2022.

[clip-1] 1.0 ^1.1 Radford, A., Kim, J. W., Hallacy, C., et al. "Learning Transferable Models for Computer Vision Tasks." International Conference on Machine Learning (ICML). 2021.

[flamingo-2] Alayrac, J. B., Donahue, J., Luc, P., et al. "Flamingo: a Visual Language Model for Few-Shot Learning." arXiv preprint arXiv:2204.14198. 2022.

[1]

[2]

Anonymous

Search

Vision-language model

Namespaces

More

Page actions

Contents

Overview

How it works

Distinction from related terms

Examples

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Vision-language model

Overview

How it works

Distinction from related terms

Examples

See also

References

Navigation

Wiki tools

Page tools

Categories