Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add markpitt/claude-skills --skill "fine-tuning-data-generator"
Install specific skill from multi-skill repository
# Description
Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance.
# SKILL.md
name: fine-tuning-data-generator
description: Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance.
version: 2.0
allowed-tools: Read, Write, Edit, Bash
Fine-Tuning Data Generator
This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.
What Do I Need?
| Need | Resource |
|---|---|
| Planning my dataset - requirements, strategy, quality checklist | resources/dataset-strategy.md |
| How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance | resources/generation-techniques.md |
| ChatML format details - structure, specification, common issues, framework compatibility | resources/chatml-format.md |
| Example datasets - inspiration across domains, multi-turn samples, edge cases | resources/examples.md |
| Validating quality - validation workflow, analyzing datasets, troubleshooting | resources/quality-validation.md |
| Training & deployment - framework setup, hyperparameters, optimization, deployment | resources/framework-integration.md |
Workflow
Phase 1: Gather Requirements
Start with these essential clarifying questions:
Task Definition:
- What is the model being trained to do? (e.g., customer support, code generation, creative writing)
- What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
- How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)
Quality & Diversity:
- Complexity range: simple to complex mix, or focus on specific difficulty level?
- Diversity: edge cases, error handling, unusual scenarios?
- Tone/style: professional, friendly, technical, concise, detailed?
- Response length preferences?
- Any specific formats: code blocks, lists, tables, JSON?
Dataset Composition:
- Distribution across subtopics: evenly distributed or weighted?
- Include negative examples (what NOT to do)?
- Need validation split? (Recommend 10-20% of total)
See resources/dataset-strategy.md for detailed question templates.
Phase 2: Create Generation Plan
Present a plan covering:
- Number and distribution of examples across categories
- Key topics/scenarios to cover
- Diversity strategies (phrasing variations, complexity levels, edge cases)
- System prompt approach (consistent vs. varied)
- Quality assurance approach
Get user approval before generating.
Phase 3: Generate Synthetic Data
Create examples following these quality standards:
Key Principles:
- Realistic scenarios reflecting real-world use cases
- Natural language with varied phrasing and formality levels
- Accurate, helpful responses aligned with desired behavior
- Consistent ChatML formatting throughout
- Balanced difficulty (unless specified)
- Meaningful variety (no repetition)
- Include edge cases and error scenarios
Diversity Techniques:
- Vary query phrasing (questions, commands, statements)
- Include different expertise levels (beginner, intermediate, expert)
- Cover both positive and negative examples
- Mix short and long-form responses
- Include multi-step reasoning when appropriate
- Add context variations
See resources/generation-techniques.md for detailed techniques, domain-specific guidance, and batch generation workflow.
Phase 4: Validate & Document
Run validation tools and checks:
# Validate JSON formatting and structure
python scripts/validate_chatml.py training_data.jsonl
# Analyze dataset statistics and diversity
python scripts/analyze_dataset.py training_data.jsonl
# Export statistics
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
Quality Checklist:
- [ ] JSON validation passed (no errors)
- [ ] Analysis shows good diversity metrics
- [ ] Manual sample review passed
- [ ] No duplicate or near-duplicate examples
- [ ] All required fields present
- [ ] Realistic user queries
- [ ] Accurate, helpful responses
- [ ] Balanced category distribution
- [ ] Dataset metadata documented
See resources/quality-validation.md for validation details, troubleshooting, and documentation templates.
Phase 5: Integration & Training
Prepare for training with your framework of choice:
Output Files:
- training_data.jsonl - Main training set
- validation_data.jsonl - Optional validation set
- dataset_info.txt - Metadata and statistics
Framework Setup:
- Unsloth: Automatic ChatML detection, efficient 4-bit training
- Axolotl: Specify type: chat_template and chat_template: chatml
- Hugging Face: Use tokenizer's apply_chat_template() method
- Custom: Load from JSONL, handle ChatML formatting
See resources/framework-integration.md for setup code, hyperparameters, deployment options, and best practices.
ChatML Format Overview
Each training example is a JSON object with a messages array:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}
Roles:
- system: Sets assistant behavior (optional but recommended)
- user: User's input/query
- assistant: Model's expected response
Multi-turn: Add additional user/assistant message pairs for conversations.
See resources/chatml-format.md for detailed specification, validation, common issues, and framework-specific notes.
Tool Reference
Scripts in scripts/
validate_chatml.py
Validates ChatML format JSONL files:
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose
Checks:
- Valid JSON formatting
- Required fields (messages, role, content)
- Valid role values (system, user, assistant)
- Proper message order
- Duplicate detection
- Diversity metrics
analyze_dataset.py
Provides comprehensive statistics and analysis:
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
Provides:
- Dataset overview (total examples, message counts)
- Message length statistics
- System prompt variations
- User query patterns (questions, commands, code-related, length categories)
- Assistant response patterns (code blocks, lists, headers, length categories)
- Quality indicators (diversity score, balance ratio)
- Token estimates and cost projection
Common Workflows
Small Dataset (100-200 examples)
- Gather requirements
- Create generation plan for 1-2 categories
- Generate in single batch, review quality
- Validate and document
- Ready for training
Medium Dataset (500-1000 examples)
- Gather requirements
- Create detailed plan with multiple categories
- Generate in 2-3 batches, reviewing after each
- Analyze diversity and adjust approach
- Fill any gaps
- Final validation and documentation
Large Dataset (2000+ examples)
- Gather comprehensive requirements
- Create multi-batch generation plan
- Batch 1 (50-100): Foundation examples
- Batch 2 (100-200): Complexity expansion
- Batch 3 (100-200): Coverage filling
- Batch 4 (50-100): Polish and validation
- Run full validation suite
- Generate comprehensive documentation
Best Practices
Start Small, Iterate
- Generate 10-20 examples first
- Review and get feedback
- Refine approach based on feedback
- Scale up to full dataset
Quality Over Quantity
- Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
- Each example should teach something new
- Maintain consistent response quality throughout
Diversify Systematically
- Vary query phrasing (questions, commands, statements)
- Cover different expertise levels
- Mix response complexities
- Include edge cases (typically 20-30% of dataset)
- Use batch generation workflow for large datasets
Test Before Deployment
- Test dataset with actual training framework
- Monitor training metrics for issues
- Test fine-tuned model outputs before deployment
- Compare results to base model
Document Everything
- Keep notes on generation parameters
- Save different dataset versions
- Document any modifications made
- Record generation strategies used
- Track model performance metrics
Advanced Features
Batch Generation Strategy
For datasets 500+ examples:
- Generate 50-100 examples at a time
- Review distribution and diversity after each batch
- Adjust generation strategy based on identified gaps
- Prevents repetition and maintains creativity
Common Pitfalls to Avoid
- Over-templating: Creates repetitive patterns (vary naturally)
- Unrealistic Queries: Overly formal/robotic user inputs (use varied phrasing)
- Narrow Coverage: Limited scenarios and phrasing (ensure diversity)
- Inconsistent Quality: Quality degradation over time (use quality checklist)
- JSON Errors: Invalid formatting breaking training (always validate)
- Missing Context: System prompts without detail (provide clear instructions)
- Response Mismatch: Responses don't address queries (verify relevance)
Dataset Size Recommendations
| Task Complexity | Recommended Size | Notes |
|---|---|---|
| Simple tasks | 100-500 | Well-defined, limited variation |
| Medium tasks | 500-2,000 | Multiple scenarios, moderate complexity |
| Complex tasks | 2,000-10,000+ | Many edge cases, high variability |
| Domain adaptation | 1,000-5,000 | Specialized knowledge required |
Resources
- Planning & Strategy:
resources/dataset-strategy.md- Requirements gathering, planning, quality checklists - Generation Techniques:
resources/generation-techniques.md- Diversity techniques, domain-specific guidance, batch workflows - ChatML Specification:
resources/chatml-format.md- Format details, validation, framework notes - Example Datasets:
resources/examples.md- Diverse domain examples, multi-turn patterns - Quality Validation:
resources/quality-validation.md- Validation workflow, analysis, troubleshooting - Framework Integration:
resources/framework-integration.md- Setup for Unsloth, Axolotl, HuggingFace; deployment options
Version: 2.0 | Updated: 2024 | Pattern: Modular Orchestration
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.