markpitt

fine-tuning-data-generator

5
1
# Install this skill:
npx skills add markpitt/claude-skills --skill "fine-tuning-data-generator"

Install specific skill from multi-skill repository

# Description

Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance.

# SKILL.md


name: fine-tuning-data-generator
description: Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance.
version: 2.0
allowed-tools: Read, Write, Edit, Bash


Fine-Tuning Data Generator

This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.

What Do I Need?

Need Resource
Planning my dataset - requirements, strategy, quality checklist resources/dataset-strategy.md
How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance resources/generation-techniques.md
ChatML format details - structure, specification, common issues, framework compatibility resources/chatml-format.md
Example datasets - inspiration across domains, multi-turn samples, edge cases resources/examples.md
Validating quality - validation workflow, analyzing datasets, troubleshooting resources/quality-validation.md
Training & deployment - framework setup, hyperparameters, optimization, deployment resources/framework-integration.md

Workflow

Phase 1: Gather Requirements

Start with these essential clarifying questions:

Task Definition:
- What is the model being trained to do? (e.g., customer support, code generation, creative writing)
- What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
- How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)

Quality & Diversity:
- Complexity range: simple to complex mix, or focus on specific difficulty level?
- Diversity: edge cases, error handling, unusual scenarios?
- Tone/style: professional, friendly, technical, concise, detailed?
- Response length preferences?
- Any specific formats: code blocks, lists, tables, JSON?

Dataset Composition:
- Distribution across subtopics: evenly distributed or weighted?
- Include negative examples (what NOT to do)?
- Need validation split? (Recommend 10-20% of total)

See resources/dataset-strategy.md for detailed question templates.

Phase 2: Create Generation Plan

Present a plan covering:
- Number and distribution of examples across categories
- Key topics/scenarios to cover
- Diversity strategies (phrasing variations, complexity levels, edge cases)
- System prompt approach (consistent vs. varied)
- Quality assurance approach

Get user approval before generating.

Phase 3: Generate Synthetic Data

Create examples following these quality standards:

Key Principles:
- Realistic scenarios reflecting real-world use cases
- Natural language with varied phrasing and formality levels
- Accurate, helpful responses aligned with desired behavior
- Consistent ChatML formatting throughout
- Balanced difficulty (unless specified)
- Meaningful variety (no repetition)
- Include edge cases and error scenarios

Diversity Techniques:
- Vary query phrasing (questions, commands, statements)
- Include different expertise levels (beginner, intermediate, expert)
- Cover both positive and negative examples
- Mix short and long-form responses
- Include multi-step reasoning when appropriate
- Add context variations

See resources/generation-techniques.md for detailed techniques, domain-specific guidance, and batch generation workflow.

Phase 4: Validate & Document

Run validation tools and checks:

# Validate JSON formatting and structure
python scripts/validate_chatml.py training_data.jsonl

# Analyze dataset statistics and diversity
python scripts/analyze_dataset.py training_data.jsonl

# Export statistics
python scripts/analyze_dataset.py training_data.jsonl --export stats.json

Quality Checklist:
- [ ] JSON validation passed (no errors)
- [ ] Analysis shows good diversity metrics
- [ ] Manual sample review passed
- [ ] No duplicate or near-duplicate examples
- [ ] All required fields present
- [ ] Realistic user queries
- [ ] Accurate, helpful responses
- [ ] Balanced category distribution
- [ ] Dataset metadata documented

See resources/quality-validation.md for validation details, troubleshooting, and documentation templates.

Phase 5: Integration & Training

Prepare for training with your framework of choice:

Output Files:
- training_data.jsonl - Main training set
- validation_data.jsonl - Optional validation set
- dataset_info.txt - Metadata and statistics

Framework Setup:
- Unsloth: Automatic ChatML detection, efficient 4-bit training
- Axolotl: Specify type: chat_template and chat_template: chatml
- Hugging Face: Use tokenizer's apply_chat_template() method
- Custom: Load from JSONL, handle ChatML formatting

See resources/framework-integration.md for setup code, hyperparameters, deployment options, and best practices.

ChatML Format Overview

Each training example is a JSON object with a messages array:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}

Roles:
- system: Sets assistant behavior (optional but recommended)
- user: User's input/query
- assistant: Model's expected response

Multi-turn: Add additional user/assistant message pairs for conversations.

See resources/chatml-format.md for detailed specification, validation, common issues, and framework-specific notes.

Tool Reference

Scripts in scripts/

validate_chatml.py

Validates ChatML format JSONL files:

python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose

Checks:
- Valid JSON formatting
- Required fields (messages, role, content)
- Valid role values (system, user, assistant)
- Proper message order
- Duplicate detection
- Diversity metrics

analyze_dataset.py

Provides comprehensive statistics and analysis:

python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json

Provides:
- Dataset overview (total examples, message counts)
- Message length statistics
- System prompt variations
- User query patterns (questions, commands, code-related, length categories)
- Assistant response patterns (code blocks, lists, headers, length categories)
- Quality indicators (diversity score, balance ratio)
- Token estimates and cost projection

Common Workflows

Small Dataset (100-200 examples)

  1. Gather requirements
  2. Create generation plan for 1-2 categories
  3. Generate in single batch, review quality
  4. Validate and document
  5. Ready for training

Medium Dataset (500-1000 examples)

  1. Gather requirements
  2. Create detailed plan with multiple categories
  3. Generate in 2-3 batches, reviewing after each
  4. Analyze diversity and adjust approach
  5. Fill any gaps
  6. Final validation and documentation

Large Dataset (2000+ examples)

  1. Gather comprehensive requirements
  2. Create multi-batch generation plan
  3. Batch 1 (50-100): Foundation examples
  4. Batch 2 (100-200): Complexity expansion
  5. Batch 3 (100-200): Coverage filling
  6. Batch 4 (50-100): Polish and validation
  7. Run full validation suite
  8. Generate comprehensive documentation

Best Practices

Start Small, Iterate

  1. Generate 10-20 examples first
  2. Review and get feedback
  3. Refine approach based on feedback
  4. Scale up to full dataset

Quality Over Quantity

  • Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
  • Each example should teach something new
  • Maintain consistent response quality throughout

Diversify Systematically

  • Vary query phrasing (questions, commands, statements)
  • Cover different expertise levels
  • Mix response complexities
  • Include edge cases (typically 20-30% of dataset)
  • Use batch generation workflow for large datasets

Test Before Deployment

  • Test dataset with actual training framework
  • Monitor training metrics for issues
  • Test fine-tuned model outputs before deployment
  • Compare results to base model

Document Everything

  • Keep notes on generation parameters
  • Save different dataset versions
  • Document any modifications made
  • Record generation strategies used
  • Track model performance metrics

Advanced Features

Batch Generation Strategy

For datasets 500+ examples:
- Generate 50-100 examples at a time
- Review distribution and diversity after each batch
- Adjust generation strategy based on identified gaps
- Prevents repetition and maintains creativity

Common Pitfalls to Avoid

  • Over-templating: Creates repetitive patterns (vary naturally)
  • Unrealistic Queries: Overly formal/robotic user inputs (use varied phrasing)
  • Narrow Coverage: Limited scenarios and phrasing (ensure diversity)
  • Inconsistent Quality: Quality degradation over time (use quality checklist)
  • JSON Errors: Invalid formatting breaking training (always validate)
  • Missing Context: System prompts without detail (provide clear instructions)
  • Response Mismatch: Responses don't address queries (verify relevance)

Dataset Size Recommendations

Task Complexity Recommended Size Notes
Simple tasks 100-500 Well-defined, limited variation
Medium tasks 500-2,000 Multiple scenarios, moderate complexity
Complex tasks 2,000-10,000+ Many edge cases, high variability
Domain adaptation 1,000-5,000 Specialized knowledge required

Resources


Version: 2.0 | Updated: 2024 | Pattern: Modular Orchestration

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.