Synthetic Data Generation at Scale
Teacher-student distillation, Evol-Instruct, and building proprietary training datasets.
The Best Training Data Doesn't Come From Humans
Synthetic Data Pipeline
Phi-3-mini (3.8B parameters) matches GPT-3.5 performance on benchmarks. Microsoft's secret: 95% of its training data was synthetically generated by GPT-4. A weak model trained on high-quality synthetic data from a stronger model can dramatically exceed its expected capability ceiling. This is the Teacher-Student distillation paradigm that now underpins nearly every frontier open-source model.
Evol-Instruct: Evolving Instructions to Higher Complexity
Evol-Instruct (WizardLM's method) takes a seed dataset (like Alpaca's 52K examples) and uses GPT-4 to rewrite each instruction into progressively harder variants. It applies operations like "add constraints," "deepen the reasoning required," "switch to a different domain," and "add multiple sub-questions." This generates 250K complex instructions from 52K simple seeds — purely using LLMs.
Generating Step-by-Step Reasoning Traces
Models trained on final answers alone cannot reason through problems (Chain of Thought). To inject reasoning capability, you generate thinking traces: ask GPT-4o to solve a problem by explicitly showing every reasoning step. Train your student model on these traces. The student learns to think, not just to recall answers.
Code Example
Evol-Instruct pipeline: 2 seed tasks expand into 8 synthetic training examples. Each example includes a full reasoning trace. Training on these traces teaches the model to reason through problems step-by-step.
1from anthropic import Anthropic
2import json
3
4client = Anthropic()
5
6# Step 1: Evol-Instruct - evolve simple tasks into complex ones
7def evolve_instruction(seed_instruction: str, evolution_type: str) -> str:
8 """Use Claude to evolve a simple instruction into a harder variant."""
9
10 evolution_prompts = {
11 "add_constraints": f"Make this task harder by adding 3 specific constraints or requirements: {seed_instruction}",
12 "deepen_reasoning": f"Rewrite this task to require multi-step analysis and explicit reasoning: {seed_instruction}",
13 "add_edge_cases": f"Extend this task to also handle edge cases, errors, and unusual inputs: {seed_instruction}",
14 "change_domain": f"Adapt this task concept to a new domain (medicine, finance, or law): {seed_instruction}"
15 }
16
17 response = client.messages.create(
18 model="claude-3-5-sonnet-20241022",
19 max_tokens=300,
20 messages=[{"role": "user", "content": evolution_prompts[evolution_type]}]
21 )
22 return response.content[0].text
23
24# Step 2: Generate Chain-of-Thought reasoning traces
25def generate_reasoning_trace(problem: str) -> dict:
26 """Generate problem + full reasoning trace for training."""
27
28 response = client.messages.create(
29 model="claude-3-5-sonnet-20241022",
30 max_tokens=1000,
31 messages=[{
32 "role": "user",
33 "content": f"""Solve this problem step-by-step. Show ALL reasoning.
34Format as JSON: {{"thinking": "step-by-step reasoning", "answer": "final answer"}}
35
36Problem: {problem}"""
37 }]
38 )
39
40 trace = json.loads(response.content[0].text)
41 return {
42 "instruction": problem,
43 "thinking": trace["thinking"],
44 "output": trace["answer"]
45 }
46
47# Generate synthetic dataset
48seed_tasks = [
49 "Write a function to find duplicate elements in an array",
50 "Explain how to use a hash map",
51]
52
53synthetic_dataset = []
54
55for task in seed_tasks:
56 # Evolve each seed into 4 harder variants
57 for evo_type in ["add_constraints", "deepen_reasoning", "add_edge_cases", "change_domain"]:
58 harder_task = evolve_instruction(task, evo_type)
59 # Generate reasoning trace for each evolved task
60 training_example = generate_reasoning_trace(harder_task)
61 synthetic_dataset.append(training_example)
62
63print(f"Generated {len(synthetic_dataset)} training examples from {len(seed_tasks)} seeds")
64
65# Save to JSONL for fine-tuning
66with open("synthetic_train.jsonl", "w") as f:
67 for example in synthetic_dataset:
68 f.write(json.dumps(example) + "\n")Use Cases
Common Mistakes
Interview Insight
Relevance
High - The secret weapon behind WizardLM, Phi-3, and most top open-source models.