Synthetic data

From llmref.wiki
Synthetic data — Data generated by a machine learning model to augment or replace human-collected training or evaluation datasets.

Overview

Synthetic data is information created by a trained model or procedural algorithm rather than collected from human activity, real-world events, or labeled annotation. In the LLM context, synthetic data serves two primary roles: as training material to improve model performance, and as evaluation material for automated assessment. The practice addresses practical constraints in dataset curation—scarcity of labeled examples, privacy concerns, class imbalance, and cost of human annotation—while introducing distinct quality and representativeness tradeoffs.

Synthetic data in LLM training differs fundamentally from the model's training on naturally occurring text. When a model generates text used to train another model (or itself in iterative refinement), that generated text inherits statistical properties and potential biases from the generator. This creates both opportunities for targeted skill amplification and risks of compounding errors or distributional collapse, sometimes termed "model collapse" in contexts of repeated synthetic-data iteration.

The ethical and practical implications of synthetic data are substantial. Benchmark contamination can occur when synthetic data inadvertently overlaps with evaluation sets. Factual consistency and hallucinations in the generating model propagate into downstream training. Conversely, synthetic data enables experimentation with distribution shifts, rare linguistic phenomena, and adversarial examples without relying on human effort or proprietary corpora.

How it is used

Synthetic data in LLM workflows operates in distinct modes:

  • Training augmentation: A model generates paraphrases, question-answer pairs, or task-specific examples to expand a training set. Example: generating synthetic math problems to improve code reasoning capabilities.
  • Instruction tuning: Models like those trained with instruction-following objectives use synthetically created (instruction, response) pairs. Constitutional AI methods generate diverse preference pairs synthetically by applying critique and revision rules.
  • Evaluation and distillation: Gold-relevance distillation and LLM-as-judge approaches generate synthetic labels or rankings for unlabeled data, enabling training on weak supervision.
  • Adversarial robustness: Synthetic adversarial examples—typos, paraphrases, out-of-distribution inputs—test model robustness without manual construction.

Quality of synthetic data depends on the fidelity of the generating model, diversity constraints applied during generation, and alignment between the synthetic distribution and the target task. Techniques to improve quality include sampling temperature control, rejection sampling, and iterative refinement via human feedback or synthetic critiques.

Distinction from related terms

Term Distinction
In-context learning (ICL) ICL uses existing examples to guide a model's next prediction, without persistent parameter updates. Synthetic data modifies model parameters through training. ICL is transient; synthetic training data is permanent.
Fine-tuning Fine-tuning adapts an existing model using labeled or annotated data, which may be natural or synthetic. Synthetic data is the material; fine-tuning is the method. A dataset can be fine-tuned on natural data or synthetic data.
Human evaluation Human evaluation judges model outputs against human standards on specific tasks. Synthetic data can be generated labels or examples, but is not inherently an evaluation methodology. Humans evaluate; models generate synthetically.
Benchmark contamination Contamination occurs when a model has seen evaluation examples during training (intentionally or not). Synthetic data can cause or mitigate contamination depending on its source and overlap with evaluation sets.
Hallucination A hallucination is a false or unsupported claim in a model's output. Synthetic data generated by a model may contain hallucinations, but synthesis itself is neutral—a technique, not an error.

Examples

  • Alpaca dataset (Taori et al., 2023): Stanford researchers generated 52,000 instruction-following examples by prompting text-davinci-003 with seed instructions and manually written exemplars. This synthetic dataset was used to train Alpaca, demonstrating that instruction-tuned models could be created cost-effectively via synthetic data rather than human annotation alone.
  • Constitutional AI (Bai et al., Anthropic): Synthetic preference pairs are generated by having the model critique its own outputs against a set of constitutional principles, then revise. These synthetic critiques and revisions replace human preference labels in DPO and other preference-learning pipelines, scaling evaluation and training without proportional human effort.
  • Synthetic medical data in biomedical LLMs: Models trained on sensitive clinical text may use synthetically generated patient narratives—preserving statistical properties of medical language while removing personally identifiable information—to expand training without privacy violations or HIPAA compliance burden.

See also

  • Golden dataset – a curated, typically human-created reference set against which synthetic or model outputs are measured
  • Fine-tuning – the training process often applied to datasets containing synthetic examples
  • Instruction tuning – a common application domain for synthetic instruction-response pairs
  • Automated evaluation – methods that use synthetic or model-generated labels to assess performance
  • Benchmark contamination – the risk that synthetic data overlaps with or compromises evaluation integrity

References