zysilm-ai

ai-video-producer

4
1
# Install this skill:
npx skills add zysilm-ai/ai-video-producer-skill

Or install specific skill: npx add-skill https://github.com/zysilm-ai/ai-video-producer-skill

# Description

>

# SKILL.md


name: ai-video-producer
description: >
Complete AI video production workflow using WAN 2.1 and Qwen Image Edit 2511 models via ComfyUI.
Creates any video type: promotional, educational, narrative, social media,
animations, game trailers, music videos, product demos, and more. Use when
users want to create videos with AI, need help with video storyboarding,
keyframe generation, or video prompt writing. Follows a philosophy-first
approach: establish visual style and production philosophy, then execute
scene by scene with user feedback at each stage. Supports advanced features
like layer-based compositing, reference-based generation, and style
consistency. Runs locally on RTX 3080+ (10GB+ VRAM).
allowed-tools: Bash, Read, Write, Edit, Glob, AskUserQuestion, TodoWrite


AI Video Producer

Create professional AI-generated videos through a structured, iterative workflow using local models.

Prerequisites & Auto-Setup

Requires: WAN 2.1 + Qwen Image Edit via ComfyUI (GGUF quantization, ~40GB models)

Setup Commands

python {baseDir}/scripts/setup_comfyui.py         # Full setup (first time)
python {baseDir}/scripts/setup_comfyui.py --check # Verify setup
python {baseDir}/scripts/setup_comfyui.py --start # Start server before generating

System Requirements

Component Minimum
GPU VRAM 10GB
RAM 16GB
Storage 40GB free

See README.md for detailed setup instructions.

MANDATORY WORKFLOW REQUIREMENTS

YOU MUST FOLLOW THESE RULES:

  1. ALWAYS use TodoWrite at the start to create a task list for the entire workflow
  2. ALWAYS ask about Approval Mode at the very start (see below)
  3. NEVER skip phases - complete each phase in order before proceeding
  4. ALWAYS create required files - philosophy.md, style.json, scene-breakdown.md, and pipeline.json are REQUIRED
  5. ALWAYS break videos into multiple scenes - minimum 2 scenes for any video over 5 seconds
  6. NEVER generate without a complete pipeline.json - plan ALL prompts first, execute second
  7. ALWAYS use execute_pipeline.py for generation - deterministic execution, no ad-hoc commands
  8. ALWAYS review generated outputs using VLM - view images after each stage, assess quality

Approval Mode Selection (FIRST STEP)

At the very beginning of the workflow, ask the user to choose an approval mode:

Use AskUserQuestion with these options:
- "Manual approval" - User approves each phase before proceeding (philosophy, scenes, pipeline, assets, keyframes, videos)
- "Automatic approval" - LLM proceeds automatically, user only reviews final output

Mode User Interaction Best For
Manual Checkpoint at each phase First-time projects, precise control, learning the workflow
Automatic Only final review Trusted workflow, quick generation, batch production

Store the selected mode and apply it to all checkpoints throughout the workflow.

Pipeline-Based Architecture

This skill uses a two-phase approach:

Phase A: Planning (LLM-Driven)

  • LLM creates philosophy.md, style.json, scene-breakdown.md
  • LLM generates ALL prompts and stores in pipeline.json
  • User reviews and approves the complete plan before any generation

Phase B: Execution (Programmatic)

  • execute_pipeline.py reads pipeline.json and executes deterministically
  • LLM reviews outputs using VLM capability after each stage
  • User approves or requests regeneration

Benefits:
- All prompts visible before ANY generation starts
- Deterministic execution - no LLM deviation during generation
- Reproducible - same pipeline.json = same commands executed
- Traceable - status tracking in pipeline.json

Pipeline Mode: Scene/Segment v3.0

Proper hierarchical structure distinguishing Scenes (narrative/cinematographic units) from Segments (5-second technical chunks).

Key Concepts:
- Scene: A continuous shot from a single camera perspective (e.g., "woman at cafe table", "phone screen close-up")
- Segment: A 5-second video chunk within a scene (due to model limitations)
- Transition: How scenes connect ("cut", "continuous", "fade", "dissolve")

Scene 1 (cut)           Scene 2 (continuous)       Scene 3 (fade)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ KF (generated)  β”‚     β”‚ KF (extracted)  β”‚       β”‚ KF (generated)  β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚     β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚       β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Segment A   β”‚ β”‚     β”‚ β”‚ Segment A   β”‚ β”‚       β”‚ β”‚ Segment A   β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚     β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚       β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ Segment B   β”‚ β”‚             β”‚                         β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚             β”‚ continuous              β”‚ fade
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β–Ό                         β–Ό
        β”‚ cut          ─────────────────          ─────────────────
        β–Ό              Final merged video with transitions

Pros:
- Semantic clarity (scenes = narrative units, segments = technical units)
- Scene-level keyframes (generated for cuts, extracted for continuous)
- Automatic video merging with transitions (cut, fade, dissolve)
- Hierarchical status tracking

Cons: More complex schema

When to use each transition:
| Transition | Use When | Keyframe |
|------------|----------|----------|
| cut | Camera angle/location changes | Generated (new) |
| continuous | Same shot continues (landscape only) | Extracted (from previous scene) |
| fade | Time skip, dramatic moment | Generated (new) |
| dissolve | Smooth transition between related scenes | Generated (new) |

CRITICAL: Character Scenes and Continuous Transitions

When a scene contains a character (even partially visible, like hands or clothing), do NOT use "extracted" keyframes. Extracted keyframes lose character identity anchoring and cause visual drift (e.g., clothing color changes, style inconsistency).

Scene Type Transition Keyframe Type Why
Landscape (no characters) continuous extracted OK - no character identity to preserve
Character visible continuous generated REQUIRED - re-anchor character identity
Character visible cut generated Standard - new camera angle

Rule: If ANY part of a character is visible (hands, clothing, body), use "type": "generated" with character references.

Standard Checkpoint Format (ALL PHASES)

Checkpoint behavior depends on the selected Approval Mode:

Manual Approval Mode (default)

  1. Show the output to user (file path or display content)
  2. Ask for approval using AskUserQuestion:
  3. "Approve" - Proceed to next step
  4. User can select "Other" to specify what needs to be changed
  5. If user does not approve:
  6. User specifies what to change
  7. Make the requested adjustments
  8. Show updated result β†’ Ask again β†’ Repeat until approved
  9. Do NOT proceed to next phase until approved

Automatic Approval Mode

  1. Show the output to user (file path or display content)
  2. LLM reviews the output using VLM capability (for images/videos)
  3. If LLM assessment is positive β†’ Proceed automatically
  4. If LLM detects issues β†’ Fix and regenerate before proceeding
  5. User only reviews final output at the end

Workflow Phases (MUST COMPLETE IN ORDER)

Phase LLM Actions Required Outputs Manual Mode Auto Mode
1. Production Philosophy Create visual identity & style philosophy.md, style.json User approval LLM proceeds
2. Scene Breakdown Plan scenes with segments and transitions scene-breakdown.md User approval LLM proceeds
3. Pipeline Generation Generate prompts with v3.0 schema pipeline.json User approval LLM proceeds
4. Asset Execution Run execute_pipeline.py --stage assets assets/ folder VLM review + user approval VLM review + proceed
5. Scene Keyframes Run execute_pipeline.py --stage scene_keyframes keyframes/scene-*.png VLM review + user approval VLM review + proceed
6. Scene Execution Run execute_pipeline.py --stage scenes scene-*/merged.mp4 + final/video.mp4 User approval LLM proceeds
7. Review & Iterate Handle regeneration requests Refinements User signs off User reviews final

Phase 1: Production Philosophy (REQUIRED)

DO NOT PROCEED TO PHASE 2 UNTIL BOTH FILES EXIST:
- {output_dir}/philosophy.md
- {output_dir}/style.json

Step 1.1: Create philosophy.md

Create this file with ALL sections filled in:

# Production Philosophy: [Project Name]

## Visual Identity
- **Art Style**: [e.g., cinematic realistic, stylized animation, painterly]
- **Color Palette**: [primary colors, mood, temperature]
- **Lighting**: [natural, dramatic, soft, high-contrast]
- **Composition**: [rule of thirds, centered, dynamic angles]

## Motion Language
- **Movement Quality**: [smooth/fluid, dynamic/energetic, subtle/minimal]
- **Pacing**: [fast cuts, slow contemplative, rhythmic]
- **Camera Style**: [static, tracking, handheld, cinematic sweeps]

## Subject Consistency
- **Characters/Products**: [detailed descriptions for consistency]
- **Environment**: [setting details that persist across scenes]
- **Props/Elements**: [recurring visual elements]

## Constraints
- **Avoid**: [unwanted elements, styles, or actions]
- **Maintain**: [elements that must stay consistent]

Step 1.2: Create style.json

Create this file for programmatic use with generation scripts:

{
  "project_name": "Project Name Here",
  "visual_style": {
    "art_style": "description",
    "color_palette": "description",
    "lighting": "description",
    "composition": "description"
  },
  "motion_language": {
    "movement_quality": "description",
    "pacing": "description",
    "camera_style": "description"
  },
  "subject_consistency": {
    "main_subject": "detailed description",
    "environment": "detailed description"
  },
  "constraints": {
    "avoid": ["list", "of", "things"],
    "maintain": ["list", "of", "things"]
  }
}

Step 1.3: CHECKPOINT - Get User Approval

  1. Inform user that philosophy.md and style.json have been created
  2. Use AskUserQuestion:
  3. "Approve" - Proceed to scene breakdown
  4. User selects "Other" to specify changes

If user requests changes β†’ make adjustments β†’ ask again β†’ repeat until approved


Phase 2: Scene Breakdown (REQUIRED)

DO NOT PROCEED TO PHASE 3 UNTIL scene-breakdown.md EXISTS AND USER APPROVES

Step 2.1: Analyze Video Requirements

Before creating scenes, determine:
- Total video duration needed
- Number of scenes required (minimum 2 for videos > 5 seconds)
- Key story beats or content moments
- Transitions between scenes

Step 2.2: Create scene-breakdown.md

MANDATORY FORMAT - v3.0 Scene/Segment structure:

# Scene Breakdown: [Project Name]

## Overview
- **Total Duration**: [X seconds]
- **Number of Scenes**: [N]
- **Video Type**: [promotional/narrative/educational/etc.]
- **Pipeline Mode**: Scene/Segment v3.0

---

## Scene Overview

| Scene | Description | Camera | Transition | Duration |
|-------|-------------|--------|------------|----------|
| 1 | [Brief description] | [Camera type] | - | 5s |
| 2 | [Brief description] | [Camera type] | cut/continuous | 5s |

---

## Scenes

### Scene N: [Title]

**Type**: character | landscape
**Duration**: 5 seconds
**Camera**: [static/tracking/pan/zoom]
**Purpose**: [What this scene communicates]

**First Keyframe**: Generated (character scenes) OR Extracted (continuous landscape only)
- Characters: [list IDs - REQUIRED if visible]
- Background: [background ID]

**Segments**:
| ID | Motion | Duration |
|----|--------|----------|
| seg-Na | [Motion description] | 5s |

**Transition to Next**: cut | continuous | fade

[Repeat for all scenes]

Scene Type Explanation:
- character: Scenes with characters - MUST include character references in keyframe
- landscape: Scenes without characters - can use extracted keyframes for continuous transitions

Scene Count Guidelines

WAN generates 5-second clips (81 frames at 16fps)

Total Video Length Minimum Scenes Recommended Scenes
1-5 seconds 1 1
6-10 seconds 2 2
11-15 seconds 3 3
16-20 seconds 4 4
20+ seconds 5+ Break into 5s beats

Step 2.3: CHECKPOINT - Get User Approval

  1. Inform user that scene-breakdown.md has been created with [N] scenes
  2. Use AskUserQuestion:
  3. "Approve" - Proceed to asset generation
  4. User selects "Other" to specify changes

If user requests changes β†’ make adjustments β†’ ask again β†’ repeat until approved


Phase 2.5: Asset Generation (REQUIRED)

DO NOT PROCEED TO PHASE 3 UNTIL assets.json EXISTS AND USER APPROVES

This phase creates reusable assets that maintain consistency across all scenes.

Step 2.5.1: Analyze Required Assets

Review scene-breakdown.md and identify all unique:
- Characters (each character that appears in scenes)
- Backgrounds (each unique location/environment)
- Styles (the visual style to apply consistently)
- Objects (any recurring props or items)

Step 2.5.2: Create assets.json

MANDATORY FORMAT:

{
  "characters": {
    "samurai": {
      "description": "Feudal Japanese warrior, red armor, stern expression, dark hair",
      "identity_ref": "assets/characters/samurai.png"
    }
  },
  "backgrounds": {
    "temple_courtyard": {
      "description": "Ancient temple with cherry blossoms, stone paths, morning light",
      "ref_image": "assets/backgrounds/temple_courtyard.png"
    }
  },
  "styles": {
    "ghibli": {
      "description": "Studio Ghibli anime aesthetic, soft colors, painterly",
      "ref_image": "assets/styles/ghibli.png"
    }
  },
  "objects": {
    "katana": {
      "description": "Traditional Japanese sword with black sheath",
      "ref_image": "assets/objects/katana.png"
    }
  }
}

Step 2.5.3: Asset Generation

Assets are generated via execute_pipeline.py --stage assets. This phase defines assets in assets.json for later execution.

Step 2.5.4: CHECKPOINT - Get User Approval

  1. Inform user that assets.json and asset images have been created
  2. Show the generated assets to user
  3. Use AskUserQuestion:
  4. "Approve" - Proceed to keyframe generation
  5. User selects "Other" to specify which assets need adjustment

If user requests changes β†’ regenerate specific assets β†’ ask again β†’ repeat until approved


Phase 3: Pipeline Generation (REQUIRED)

DO NOT PROCEED TO EXECUTION UNTIL pipeline.json EXISTS AND USER APPROVES

This phase consolidates all prompts into a single structured file that will be executed deterministically.

Step 3.1: Create pipeline.json

Based on philosophy.md, style.json, scene-breakdown.md, and assets.json, create a complete pipeline.json.

Pipeline Schema v3.0

{
  "version": "3.0",
  "project_name": "project-name",
  "metadata": {
    "created_at": "ISO timestamp",
    "philosophy_file": "philosophy.md",
    "style_file": "style.json",
    "scene_breakdown_file": "scene-breakdown.md"
  },
  "assets": {
    "characters": {
      "protagonist": {
        "prompt": "Character sheet description...",
        "output": "assets/characters/protagonist.png",
        "status": "pending"
      }
    },
    "backgrounds": {
      "location": {
        "prompt": "Background description...",
        "output": "assets/backgrounds/location.png",
        "status": "pending"
      }
    }
  },
  "scenes": [
    {
      "id": "scene-01",
      "description": "Scene description",
      "camera": "medium shot",
      "transition_from_previous": null,
      "first_keyframe": {
        "type": "generated",
        "prompt": "Keyframe description...",
        "background": "location",
        "characters": ["protagonist"],
        "output": "keyframes/scene-01-start.png",
        "status": "pending"
      },
      "segments": [
        {
          "id": "seg-01-a",
          "motion_prompt": "Motion description...",
          "output_video": "scene-01/seg-a.mp4",
          "output_keyframe": "keyframes/scene-01-seg-a-end.png",
          "status": "pending"
        }
      ],
      "output_video": "scene-01/merged.mp4",
      "status": "pending"
    }
  ],
  "final_video": {
    "output": "final/video.mp4",
    "status": "pending"
  }
}

Additional scene examples:
- Cut transition: "transition_from_previous": {"type": "cut"}
- Continuous (landscape only): "transition_from_previous": {"type": "continuous"}, "first_keyframe": {"type": "extracted"}

v3.0 Schema Key Concepts:

Field Description
scenes[].first_keyframe.type "generated" = create new keyframe with character refs, "extracted" = use previous scene's end (landscape only!)
scenes[].first_keyframe.characters Array of character IDs to reference - REQUIRED if character is visible
scenes[].transition_from_previous null for first scene, or {"type": "cut/continuous/fade/dissolve"}
scenes[].segments[] Array of 5-second video chunks within the scene
segments[].output_keyframe Extracted last frame (required for all but last segment)
scenes[].output_video Merged video of all segments in this scene
final_video All scene videos merged with transitions

Keyframe Type Selection (CRITICAL for consistency):

Scene Content first_keyframe.type characters array Notes
Character fully visible "generated" Required Include all visible characters
Character partially visible (hands, clothing) "generated" Required Include character for clothing/style consistency
Landscape only (no characters) "generated" or "extracted" Optional Can use extracted for continuous transitions

Transition Types:
- cut: Hard cut (instant switch between scenes)
- continuous: Seamless continuation - use "extracted" ONLY for landscape scenes, use "generated" with character refs for character scenes
- fade: Fade through black (duration configurable, default 0.5s)
- dissolve: Cross-dissolve (duration configurable, default 0.5s)

Step 3.2: Pipeline Prompt Writing Guidelines

For Character Prompts:
- Include full physical description (hair, eyes, clothing, distinguishing features)
- Mention "anime style character sheet, A-pose, full body, white background"
- Include multiple views: "front view, side view, back view"

For Background Prompts:
- Describe setting, lighting, atmosphere
- Include "no people, establishing shot"
- Match style from philosophy.md

For Keyframe Prompts:
- Use positional language: "On the left:", "On the right:", "In the center:"
- Reference character appearance from assets
- Include background context and lighting
- Match style from philosophy.md

For Video Prompts (I2V Motion Prompts):

I2V models tend to produce static video. Use these rules:

  1. Separate subject motion from camera motion - describe both explicitly
  2. Describe physical body movements - "legs pumping", "arms swinging", not just "running"
  3. Include environmental interaction - "boots splashing through mud", "hair flowing in wind"
  4. Avoid POV/first-person - I2V struggles with perspective-based motion
  5. Use motion verbs - "sprinting" not "in motion"

Motion Prompt Structure:

[SUBJECT] [ACTION VERB] [BODY PART DETAILS], [ENVIRONMENTAL INTERACTION], camera [CAMERA MOVEMENT]

Example: "soldier sprints through trench, legs driving forward, rifle bouncing against chest, mud splashing from boots, camera tracking from behind at shoulder height"

Camera Terms: "tracking from behind", "dollying alongside", "pushing in slowly", "holding steady", "panning left/right", "tilting up/down"

See references/prompt-engineering.md for detailed guidance.

Step 3.3: CHECKPOINT - Get User Approval

  1. Show the complete pipeline.json to user
  2. Highlight all prompts for review
  3. Use AskUserQuestion:
  4. "Approve" - Proceed to execution
  5. User selects "Other" to specify which prompts need adjustment

If user requests changes β†’ update pipeline.json β†’ ask again β†’ repeat until approved


Phase 4: Asset Execution

Execute asset generation using the pipeline executor.

Step 4.1: Run Asset Stage

python {baseDir}/scripts/execute_pipeline.py {output_dir}/pipeline.json --stage assets

This will:
- Generate all characters, backgrounds defined in pipeline.json
- Automatically use --free-memory for each generation
- Update status in pipeline.json as items complete

Step 4.2: Review Assets with VLM

After execution completes, use the Read tool to view each generated asset:

1. View each character asset - verify appearance matches description
2. View each background asset - verify setting and style

Step 4.3: CHECKPOINT - Get User Approval

  1. Show generated assets to user (use Read tool to display images)
  2. Report on quality and any issues noticed
  3. Use AskUserQuestion:
  4. "Approve" - Proceed to keyframes
  5. User selects "Other" to specify which assets need regeneration

If regeneration needed:
1. Update the prompt in pipeline.json
2. Run: python {baseDir}/scripts/execute_pipeline.py {output_dir}/pipeline.json --regenerate <asset_id>
3. Review again β†’ repeat until approved


Phase 5: Keyframe Execution

Execute keyframe generation using the pipeline executor.

Step 5.1: Run Keyframe Stage

python {baseDir}/scripts/execute_pipeline.py {output_dir}/pipeline.json --stage keyframes

This will:
- Generate all keyframes defined in pipeline.json
- Reference assets by ID (resolved to file paths automatically)
- Use --free-memory for EVERY keyframe (mandatory)
- Update status in pipeline.json as items complete

Step 5.2: Review Keyframes with VLM

After execution completes, use the Read tool to view each keyframe:

1. View each keyframe image
2. Check character consistency with assets
3. Check background consistency
4. Check style matches philosophy

Step 5.3: CHECKPOINT - Get User Approval

  1. Show generated keyframes to user (use Read tool to display images)
  2. Report on quality and consistency
  3. Use AskUserQuestion:
  4. "Approve" - Proceed to videos
  5. User selects "Other" to specify which keyframes need regeneration

If regeneration needed:
1. Update the prompt in pipeline.json
2. Run: python {baseDir}/scripts/execute_pipeline.py {output_dir}/pipeline.json --regenerate <KF-id>
3. Review again β†’ repeat until approved


Phase 6: Video Execution

Execute video generation using the pipeline executor.

Step 6.1: Run Video Stage

python {baseDir}/scripts/execute_pipeline.py {output_dir}/pipeline.json --stage videos

This will:
- Generate all videos defined in pipeline.json
- Reference keyframes by ID (resolved to file paths automatically)
- Use --free-memory only on first video (switching from image to video models)
- Update status in pipeline.json as items complete

Step 6.2: CHECKPOINT - Get User Approval

  1. Inform user of generated video locations
  2. Use AskUserQuestion:
  3. "Approve" - Complete
  4. User selects "Other" to specify which videos need regeneration

If regeneration needed:
1. Update the prompt in pipeline.json
2. Run: python {baseDir}/scripts/execute_pipeline.py {output_dir}/pipeline.json --regenerate <scene-id>
3. Review again β†’ repeat until approved


Phase 7: Review & Iterate

Handle any final adjustments requested by user.


Reference: Keyframe Generation Details

The following sections provide reference information used by the pipeline executor.

v3.0 Keyframe Strategy

In Scene/Segment v3.0 mode:
- Each scene has ONE first keyframe (generated or extracted)
- Video generation uses I2V: first keyframe + motion prompt β†’ video
- End frames are automatically extracted from generated videos

Scene 1 (cut)         Scene 2 (cut)         Scene 3 (continuous)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ KF generatedβ”‚       β”‚ KF generatedβ”‚       β”‚ KF extractedβ”‚
β”‚     ↓       β”‚       β”‚     ↓       β”‚       β”‚     ↓       β”‚
β”‚ I2V video   β”‚ ──→   β”‚ I2V video   β”‚ ──→   β”‚ I2V video   β”‚
β”‚     ↓       β”‚       β”‚     ↓       β”‚       β”‚     ↓       β”‚
β”‚ End extract β”‚       β”‚ End extract β”‚       β”‚ End extract β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scene Type and Keyframe Rules

Scene Type Keyframe Type Character Refs Notes
character (any character visible) Generated REQUIRED Always re-anchor identity
landscape (no characters) Generated or Extracted Optional Can use extracted for continuous

Reference Chain Rules

These rules ensure consistency across scenes:

Asset Type Chain Behavior
Character Identity ALWAYS use original asset from assets/characters/ (never chain)
Background Chain from previous scene's background for continuity
Style ALWAYS apply with style asset reference

Character Consistency Rules (CRITICAL)

Problem: Without character references, I2V models cause "identity drift" - clothing colors change, styles shift, and characters become unrecognizable across scenes.

Solution: Include character references for ANY scene where the character is visible, even partially.

When to include character references:

What's Visible Include Character Reference? Example
Full body YES Wide shot of character walking
Upper body only YES Medium shot conversation
Hands only YES Close-up of hands holding object
Clothing only (no face) YES Back view of character running
Character's belongings Optional Close-up of character's bag
No character elements NO Landscape, building exterior

Common Mistakes:
- Close-up of hands without character reference β†’ clothing inconsistency
- Using "type": "extracted" for character scenes β†’ identity drift

The Character Drift Problem: Without character references, each scene accumulates deviations. By Scene 4, character may have different clothing color/style.

Solution: Always use "type": "generated" with "characters": ["id"] for any scene with visible character parts.

Reference Slot Allocation

The keyframe generator uses 3 reference image slots:

Slot Without --background With --background
image1 Character 1 Background
image2 Character 2 Character 1
image3 Character 3 Character 2

Note: With --background, maximum 2 characters are supported (3 slots total).

Character Count Decision Matrix

The system has 3 reference image slots. Use this matrix to determine the approach:

# Characters Background Approach
0 Any Use landscape type with asset_generator.py background
1 No --character A (empty slots auto-filled with fallback)
1 Yes --background B --character A (empty slot auto-filled)
2 No --character A --character B (slot 3 auto-filled)
2 Yes --background B --character A --character B (all slots used)
3 No --character A --character B --character C (all slots used)
3+ Yes Workaround required - see below
4+ Any Workaround required - see below

Handling 3+ Characters with Background OR 4+ Characters:

When exceeding 3 reference slots:
1. Select 2-3 most important characters for reference slots
2. Describe ALL characters in prompt with positional language
3. Trade-off: Referenced characters have strong identity; others rely on prompt

Keyframe Quality Checklist

Before proceeding to video, verify EACH keyframe:
- Subject appears correctly (no distortion)
- Style matches Production Philosophy
- Composition allows for intended motion
- Characters are consistent with assets
- Background/environment is consistent
- Lighting direction is consistent

Key Rule: NEVER chain keyframes as references - always use original assets from assets/characters/


Output Directory Structure (REQUIRED)

{output_dir}/
β”œβ”€β”€ philosophy.md              # Production philosophy
β”œβ”€β”€ style.json                 # Style configuration
β”œβ”€β”€ scene-breakdown.md         # Scene breakdown with segments
β”œβ”€β”€ pipeline.json              # v3.0 pipeline definition
β”‚
β”œβ”€β”€ assets/                    # Reusable character/background assets
β”‚   β”œβ”€β”€ characters/
β”‚   β”‚   β”œβ”€β”€ protagonist.png
β”‚   β”‚   └── sidekick.png
β”‚   └── backgrounds/
β”‚       β”œβ”€β”€ city_street.png
β”‚       └── rooftop.png
β”‚
β”œβ”€β”€ keyframes/                 # Scene start keyframes + extracted end frames
β”‚   β”œβ”€β”€ scene-01-start.png    # Generated from assets
β”‚   β”œβ”€β”€ scene-01-seg-a-end.png # Extracted from video
β”‚   β”œβ”€β”€ scene-02-start.png    # Generated (cut) or extracted (continuous)
β”‚   └── scene-02-seg-a-end.png
β”‚
β”œβ”€β”€ scene-01/
β”‚   β”œβ”€β”€ seg-a.mp4             # Segment video
β”‚   └── merged.mp4            # Scene merged video
β”‚
β”œβ”€β”€ scene-02/
β”‚   β”œβ”€β”€ seg-a.mp4
β”‚   └── merged.mp4
β”‚
└── final/
    └── video.mp4             # All scenes merged

Note: Each scene has one start keyframe. End frames are extracted from generated videos, not pre-generated.


TodoWrite Template

At the START of the workflow, create the appropriate todo list based on selected approval mode:

Manual Approval Mode

1. Ask user to select approval mode (Manual/Automatic)
2. Check ComfyUI setup and start server
3. Create philosophy.md and style.json
4. Get user approval on production philosophy
5. Create scene-breakdown.md
6. Get user approval on scene breakdown
7. Create pipeline.json (v3.0 schema)
8. Get user approval on pipeline.json
9. Execute assets stage, review with VLM, get user approval
10. Execute keyframes stage, review with VLM, get user approval
11. Execute scenes stage, get user approval
12. Provide final summary

Automatic Approval Mode

1. Ask user to select approval mode (Manual/Automatic)
2. Check ComfyUI setup and start server
3. Create philosophy.md and style.json
4. Create scene-breakdown.md
5. Create pipeline.json (v3.0 schema)
6. Execute assets stage, review with VLM
7. Execute keyframes stage, review with VLM
8. Execute scenes stage
9. Present final output to user for review

Key points:
- ALL prompts are written to pipeline.json BEFORE any generation starts
- execute_pipeline.py handles VRAM management automatically
- LLM always reviews outputs with VLM (both modes)
- In Auto mode, LLM proceeds unless VLM detects issues

Setup Verification (Step 1)

python {baseDir}/scripts/setup_comfyui.py --check   # Verify setup
python {baseDir}/scripts/setup_comfyui.py --start   # Start ComfyUI server

VRAM Management

Note: execute_pipeline.py handles VRAM management automatically with --free-memory flags.
The Qwen image model and WAN video model cannot both fit in 10GB VRAM simultaneously - the executor handles this.


Quick Reference

Script Purpose Key Arguments
execute_pipeline.py Execute complete pipeline --stage, --all, --status, --validate, --regenerate
asset_generator.py Generate reusable assets character, background, style subcommands
keyframe_generator.py Generate keyframes with character references --prompt, --character, --background, --output
angle_transformer.py Transform keyframe camera angles --input, --output, --rotate, --tilt, --zoom
wan_video_comfyui.py Generate videos (WAN 2.1 I2V) --prompt, --start-frame, --output, --free-memory
video_merger.py Merge videos with transitions --concat, --output, --transition, --duration
setup_comfyui.py Setup and manage ComfyUI --check, --start, --models

Pipeline Execution

Stage Command Description
Assets --stage assets Generate character sheets and backgrounds
Scene Keyframes --stage scene_keyframes Generate keyframes for each scene
Scenes --stage scenes Generate videos, merge segments, create final video
All stages --all Run complete pipeline

Key Features:
- Scene keyframes generated for "cut" transitions, extracted for "continuous" (landscape only)
- Segments within scenes merged automatically
- Final video assembled with transitions (cut/fade/dissolve)

Asset Generation

Asset Type Command Output
Character asset_generator.py character --name X --description "..." -o path Neutral A-pose, white background
Background asset_generator.py background --name X --description "..." -o path Environment, no people
Style asset_generator.py style --name X --description "..." -o path Style reference

Keyframe Generation

IMPORTANT: Always use --free-memory for every keyframe generation to prevent VRAM fragmentation.

Mode Command Description
Single Character keyframe_generator.py --free-memory --character X --prompt "..." Character from reference
Multi-Character keyframe_generator.py --free-memory --character A --character B --prompt "..." Up to 3 characters
With Background keyframe_generator.py --free-memory --background B --character X --character Y ... Background + 2 chars

Camera Angle Transformation

Transform keyframes using angle_transformer.py:
- --rotate: -180 to 180 (horizontal rotation, negative = left)
- --tilt: -90 to 90 (vertical tilt, negative = look up)
- --zoom: wide/normal/close
- Requires Multi-Angle LoRA

Video Generation (I2V Mode)

This skill uses Image-to-Video (I2V) mode exclusively:
- Input: Start keyframe + Motion prompt
- Output: Video + Extracted end frame

Video Model Selection

Flag Time Quality When to Use
(none) ~6 min Good Fast generation
--moe-fast ~7 min Better RECOMMENDED - Best balance with ALG motion enhancement
--moe ~30 min Best Maximum quality when time not critical

Motion Limitations: I2V preserves the subject's pose from input image. For dynamic poses, keyframe must show subject mid-action. Camera motion works better than body motion.

Technical Specs

Parameter Value
Video Duration ~5 seconds (81 frames)
Frame Rate 16 fps
Resolution Up to 832x480 (medium preset)
VRAM Required 10GB (GGUF Q4_K_M quantization)
Image Steps 4 (with Lightning LoRA)
Video Steps 8 (with LightX2V LoRA)

References

  • references/models.md - Model specifications and sizes
  • references/prompt-engineering.md - Detailed prompt writing guidance
  • references/troubleshooting.md - Common issues and solutions
  • README.md - Installation instructions

# README.md

AI Video Producer Skill

A Claude Code skill for complete AI video production workflows using WAN 2.2 video generation and Qwen Image Edit 2511 keyframe generation via ComfyUI. Runs entirely locally on consumer GPUs. \
If you want to generate with Gemini instead of local hosted models, please navigate to https://github.com/zysilm-ai/gemini-video-producer-skill

Usage with Claude Code

Simply describe what video you want to create. Claude will automatically:
- Start ComfyUI server if needed
- Guide you through the production workflow
- Generate keyframes and videos with your approval at each step

Example conversation:

You: Create a 15-second anime fight scene with two warriors

Claude: I'll help you create that video. Let me start by establishing
a Production Philosophy and breaking down the scenes...

Example with meme:

You: Create a video from meme "I want to save money vs you only live once"

Claude: I'll create a relatable lifestyle video capturing the internal
struggle. Let me plan the story and visual style...

Overview

This skill guides you through creating professional AI-generated videos with a structured, iterative workflow:

  1. Pipeline Mode Selection - Choose Video-First (recommended) or Keyframe-First
  2. Production Philosophy - Define visual style, motion language, and narrative approach
  3. Scene Breakdown - Decompose video into scenes with motion requirements
  4. Asset Generation - Create reusable character and background assets
  5. Keyframe/Video Generation - Execute pipeline deterministically via execute_pipeline.py
  6. Review & Iterate - Refine based on feedback

The philosophy-first approach ensures visual coherence across all scenes, resulting in professional, cohesive videos.

Pipeline Modes

Mode Description Best For
Video-First (Recommended) Generate first keyframe only, then videos sequentially. Last frame of each video becomes next scene's start. Visual continuity between scenes
Keyframe-First Generate all keyframes independently, then videos between them. Precise control over end frames

Key Features

  • 100% Local - No cloud APIs needed, runs on your GPU
  • Fast Generation - ~36 seconds per image (warm), ~2 minutes per 5-second video
  • Low VRAM - Works on 10GB GPUs using GGUF quantization
  • High Quality - Lightning LoRA (4-step) for images, LightX2V (8-step) for video
  • Two Video Modes - Image-to-Video (I2V) and First-Last-Frame (FLF2V)
  • Tiled VAE - Memory-efficient decoding for constrained VRAM

Supported Video Types

  • Promotional - Product launches, brand stories, ads
  • Educational - Tutorials, explainers, courses
  • Narrative - Short films, animations, music videos
  • Social Media - Platform-optimized content (TikTok, Reels, Shorts)
  • Corporate - Demos, presentations, training
  • Game Trailers - Action sequences, atmosphere, gameplay hints

System Requirements

Component Minimum Recommended
GPU VRAM 10GB 12GB+
RAM 16GB 32GB
Storage 40GB free 60GB+
OS Windows/Linux Windows 10/11, Ubuntu 22.04+

Required Software: Python 3.10+, Git, CUDA 12.x


Standalone Generation

For programmatic or manual usage without Claude Code.

1. Setup (First Time Only)

cd gemini-video-producer-skill

# Full automatic setup (installs ComfyUI + downloads models ~40GB)
python scripts/setup_comfyui.py

This will:
1. Clone ComfyUI into ./comfyui/
2. Install custom nodes (GGUF, VideoHelperSuite, Manager)
3. Download all required models:
- WAN 2.1 I2V GGUF (~11GB)
- Qwen Image Edit 2511 GGUF (~13GB)
- Text encoders, VAEs, and LoRAs (~14GB)

2. Start ComfyUI Server

# Automatically uses --cache-none for optimal 10GB VRAM performance
python scripts/setup_comfyui.py --start

Keep this running in the background. The server must be running for generation.

Note: The --cache-none flag is automatically used to enable sequential model loading, which is critical for multi-reference keyframe generation on 10GB VRAM systems.

3. Generate!

# Step 1: Generate character asset (neutral A-pose, clean background)
python scripts/asset_generator.py character \
  --name warrior \
  --description "A warrior in dramatic lighting, anime style, red armor" \
  --output outputs/assets/warrior.png

# Step 2: Generate keyframe with character (--free-memory is MANDATORY)
python scripts/keyframe_generator.py \
  --free-memory \
  --prompt "Warrior charging forward, cape flowing, dramatic lighting" \
  --character outputs/assets/warrior.png \
  --output outputs/keyframes/KF-A.png

# Step 3: Generate a video from the keyframe
python scripts/wan_video_comfyui.py \
  --free-memory \
  --prompt "The warrior charges forward, cape flowing" \
  --start-frame outputs/keyframes/KF-A.png \
  --output outputs/video.mp4

Architecture

ComfyUI Pipeline

The skill uses ComfyUI with GGUF quantized models for efficient GPU memory usage:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ComfyUI Server (:8188)                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Prompt    │────▢│  Qwen Image Edit 2511            β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β”œβ”€ GGUF Q4_K_M (13GB)           β”‚  β”‚
β”‚                      β”‚  β”œβ”€ Lightning LoRA (4-step)      β”‚  β”‚
β”‚                      β”‚  └─ Tiled VAE Decode             β”‚  β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                     β”‚                      β”‚
β”‚                                     β–Ό                      β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚                              β”‚  Keyframe   β”‚               β”‚
β”‚                              β”‚   (PNG)     β”‚               β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                                     β”‚                      β”‚
β”‚                                     β–Ό                      β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚                      β”‚  WAN 2.1 I2V                     β”‚  β”‚
β”‚                      β”‚  β”œβ”€ GGUF Q4_K_M (11GB)           β”‚  β”‚
β”‚                      β”‚  β”œβ”€ LightX2V LoRA (8-step)       β”‚  β”‚
β”‚                      β”‚  └─ VAE Decode                   β”‚  β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                     β”‚                      β”‚
β”‚                                     β–Ό                      β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚                              β”‚   Video     β”‚               β”‚
β”‚                              β”‚  (81 frames)β”‚               β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Models (GGUF Quantized)

Default Models (~40GB):

Component Model Size Purpose
Video Generation WAN 2.1 I2V Q4_K_M 11.3GB 14B video transformer
Video LoRA LightX2V 0.7GB 8-step distillation
Image Generation Qwen Image Edit Q4_K_M 13.1GB 20B image transformer
Image LoRA Lightning 0.8GB 4-step distillation
Text Encoders UMT5-XXL + Qwen VL 7B 13GB FP8 quantized
VAEs WAN + Qwen 0.4GB Video/Image decoding

Optional WAN 2.2 MoE Models (+24GB):

Component Model Size Purpose
HighNoise Expert WAN 2.2 I2V Q6_K 12GB MoE early denoising
LowNoise Expert WAN 2.2 I2V Q6_K 12GB MoE refinement

Download with: python scripts/setup_comfyui.py --q6k

Total: ~40GB base (+ ~24GB optional for WAN 2.2)

Performance (RTX 3080 10GB)

Task Cold Start Warm
Image Generation ~63s ~36s
Video Generation (81 frames) ~3 min ~2 min

Script Reference

execute_pipeline.py

Execute a complete pipeline.json file deterministically.

# Check pipeline status
python scripts/execute_pipeline.py output/project/pipeline.json --status

# Validate pipeline structure
python scripts/execute_pipeline.py output/project/pipeline.json --validate

# Execute specific stage (video-first mode)
python scripts/execute_pipeline.py output/project/pipeline.json --stage assets
python scripts/execute_pipeline.py output/project/pipeline.json --stage first_keyframe
python scripts/execute_pipeline.py output/project/pipeline.json --stage scenes

# Execute specific stage (keyframe-first mode)
python scripts/execute_pipeline.py output/project/pipeline.json --stage assets
python scripts/execute_pipeline.py output/project/pipeline.json --stage keyframes
python scripts/execute_pipeline.py output/project/pipeline.json --stage videos

# Execute all stages with review pauses
python scripts/execute_pipeline.py output/project/pipeline.json --all

# Regenerate a specific item
python scripts/execute_pipeline.py output/project/pipeline.json --regenerate KF-A

The pipeline executor automatically:
- Detects pipeline mode (video-first or keyframe-first) from schema
- Manages VRAM by using --free-memory appropriately
- Tracks status in pipeline.json
- Extracts last frames for video-first scene continuity

asset_generator.py

Generate reusable assets for keyframe generation.

# Character identity (neutral A-pose, clean white background)
python scripts/asset_generator.py character \
  --name [character_name] \
  --description "[detailed character description]" \
  --output path/to/character.png

# Background (no people, environment only)
python scripts/asset_generator.py background \
  --name [background_name] \
  --description "[environment description]" \
  --output path/to/background.png

# Style reference
python scripts/asset_generator.py style \
  --name [style_name] \
  --description "[style description]" \
  --output path/to/style.png

keyframe_generator.py

Generate keyframes using character reference images.

IMPORTANT: Always use --free-memory for EVERY keyframe generation to prevent VRAM fragmentation.

python scripts/keyframe_generator.py \
  --free-memory \                        # MANDATORY - prevents VRAM fragmentation
  --prompt "Action/scene description" \
  --output path/to/keyframe.png \
  --character path/to/character.png      # Character identity (WHO)
  [--background path/to/background.png]  # Background reference
  [--preset low|medium|high]             # Resolution preset

Multi-Character Generation:

# Two characters (up to 3 supported) - always use --free-memory
python scripts/keyframe_generator.py \
  --free-memory \
  --prompt "On the left: warrior attacking. On the right: ninja defending." \
  --character assets/warrior.png \
  --character assets/ninja.png \
  --output keyframes/KF-battle.png

# Two characters with background reference
python scripts/keyframe_generator.py \
  --free-memory \
  --prompt "Warriors facing off in temple courtyard" \
  --background assets/backgrounds/temple.png \
  --character assets/warrior.png \
  --character assets/ninja.png \
  --output keyframes/KF-standoff.png

Reference Slot Allocation:

Slot Without --background With --background
image1 Character 1 Background
image2 Character 2 Character 1
image3 Character 3 Character 2

Note: With --background, maximum 2 characters are supported (3 reference slots total).

wan_video_comfyui.py

Generate videos from keyframes using WAN 2.1/2.2.

python scripts/wan_video_comfyui.py \
  --prompt "Motion description" \
  --output path/to/output.mp4 \
  --start-frame path/to/start.png \
  [--end-frame path/to/end.png]     # For First-Last-Frame mode
  [--preset low|medium|high]        # Resolution preset
  [--steps 8]                       # Sampling steps
  [--seed 0]                        # Random seed
  [--moe]                           # WAN 2.2 MoE (best quality, slow)
  [--moe-fast]                      # WAN 2.2 MoE + ALG (RECOMMENDED)

Video Modes:

Mode Arguments Use Case
I2V --start-frame only Continuous motion from single frame
FLF2V --start-frame + --end-frame Precise control over motion

Model Selection:

Flag Model Time Best For
(none) WAN 2.1 Q4K + LoRA ~6 min Default, fastest
--moe-fast WAN 2.2 MoE + ALG ~7 min RECOMMENDED - Best balance
--moe WAN 2.2 MoE (20 steps) ~30 min Maximum quality

Note: WAN 2.2 modes require additional models. Download with: python scripts/setup_comfyui.py --q6k

angle_transformer.py

Transform keyframe camera angles without regenerating the base image.

python scripts/angle_transformer.py \
  --input path/to/keyframe.png \
  --output path/to/transformed.png \
  [--rotate -45]                    # Horizontal rotation (-180 to 180)
  [--tilt -30]                      # Vertical tilt (-90 to 90)
  [--zoom wide|normal|close]        # Lens type
  [--prompt "custom angle desc"]    # Override auto-generated description

Examples:

# Low angle dramatic shot
python scripts/angle_transformer.py \
  --input keyframes/KF-A.png \
  --output keyframes/KF-A-lowangle.png \
  --tilt -30

# Rotated wide shot
python scripts/angle_transformer.py \
  --input keyframes/KF-B.png \
  --output keyframes/KF-B-wide.png \
  --rotate 45 \
  --zoom wide

setup_comfyui.py

Setup and manage ComfyUI installation.

python scripts/setup_comfyui.py              # Full setup
python scripts/setup_comfyui.py --check      # Check status
python scripts/setup_comfyui.py --start      # Start server
python scripts/setup_comfyui.py --models     # Download models only
python scripts/setup_comfyui.py --q6k        # Download WAN 2.2 MoE models (optional)

Resolution Presets

Preset Resolution Frames VRAM Usage
low 640x384 49 ~8GB
medium 832x480 81 ~10GB
high 1280x720 81 ~16GB

Directory Structure

gemini-video-producer-skill/
β”œβ”€β”€ SKILL.md                    # Claude Code skill instructions
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ execute_pipeline.py     # Pipeline executor
β”‚   β”œβ”€β”€ asset_generator.py      # Generate character/background/style assets
β”‚   β”œβ”€β”€ keyframe_generator.py   # Generate keyframes with character references
β”‚   β”œβ”€β”€ angle_transformer.py    # Transform keyframe camera angles
β”‚   β”œβ”€β”€ wan_video_comfyui.py    # Video generation (WAN 2.1/2.2)
β”‚   β”œβ”€β”€ setup_comfyui.py        # ComfyUI setup and server management
β”‚   β”œβ”€β”€ core.py                 # Shared generation utilities
β”‚   β”œβ”€β”€ comfyui_client.py       # ComfyUI API client
β”‚   β”œβ”€β”€ utils.py                # Shared utilities
β”‚   └── workflows/              # ComfyUI workflow JSON files
β”‚       β”œβ”€β”€ qwen_*.json         # Image generation workflows
β”‚       β”œβ”€β”€ wan_i2v.json        # Image-to-Video (WAN 2.1)
β”‚       β”œβ”€β”€ wan_flf2v.json      # First-Last-Frame-to-Video
β”‚       β”œβ”€β”€ wan_i2v_moe.json    # WAN 2.2 MoE (20 steps)
β”‚       └── wan_i2v_moe_fast.json    # WAN 2.2 MoE + ALG (RECOMMENDED)
β”œβ”€β”€ comfyui/                    # ComfyUI installation (gitignored)
β”‚   β”œβ”€β”€ models/                 # All models stored here
β”‚   └── output/                 # ComfyUI output directory
β”œβ”€β”€ references/
β”‚   β”œβ”€β”€ prompt-engineering.md
β”‚   β”œβ”€β”€ style-systems.md
β”‚   └── troubleshooting.md
└── outputs/                    # Your generated content

Output Directory Structure (Per Project)

When generating videos, the workflow creates this structure:

outputs/my-project/
β”œβ”€β”€ philosophy.md              # Production philosophy
β”œβ”€β”€ style.json                 # Style configuration
β”œβ”€β”€ scene-breakdown.md         # Scene plan
β”œβ”€β”€ pipeline.json              # Execution pipeline (all prompts)
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ characters/           # Character identity assets
β”‚   β”œβ”€β”€ backgrounds/          # Environment references
β”‚   └── styles/               # Style references
β”œβ”€β”€ keyframes/                # Generated/extracted keyframes
β”‚   β”œβ”€β”€ KF-A.png              # First keyframe (generated)
β”‚   β”œβ”€β”€ KF-B.png              # Extracted from scene-01 (video-first)
β”‚   └── KF-C.png              # Extracted from scene-02 (video-first)
β”œβ”€β”€ scene-01/
β”‚   └── video.mp4
└── scene-02/
    └── video.mp4

Video-First Mode: Only KF-A is generated traditionally. KF-B, KF-C, etc. are automatically extracted from the last frame of each video, ensuring perfect visual continuity between scenes.

Keyframe-First Mode: All keyframes are generated independently, then videos interpolate between them.


Troubleshooting

ComfyUI Server Not Running

# Check if server is running
curl http://127.0.0.1:8188/system_stats

# Start the server
python scripts/setup_comfyui.py --start

Models Not Found

# Check setup status
python scripts/setup_comfyui.py --check

# Download missing models
python scripts/setup_comfyui.py --models

Out of VRAM / Generation Hanging

The tiled VAE decode should handle most VRAM issues. If problems persist:

  1. Use lower resolution preset: --preset low
  2. Restart ComfyUI server to clear memory
  3. Close other GPU applications

Slow Generation

First run is slower due to model loading (~60s). Subsequent runs with warm models are faster (~36s for images).

Multi-Reference Keyframes Very Slow (30+ minutes)

If multi-reference keyframe generation (background + 2 characters) takes longer than expected:

  1. Ensure --cache-none is enabled: The ComfyUI server must be started with --cache-none flag
    ```bash
    # Correct - uses --cache-none automatically
    python scripts/setup_comfyui.py --start

# Or manually:
python main.py --listen 0.0.0.0 --port 8188 --cache-none
```

  1. Why this happens: On 10GB VRAM, the Qwen VL 7B text encoder (~8GB) and diffusion model (~12GB) cannot both fit in VRAM simultaneously. Without --cache-none, they compete for memory causing constant CPU↔GPU swapping.

  2. What --cache-none does: Allows ComfyUI to unload the text encoder after encoding is complete, freeing ~8GB VRAM for the diffusion sampling stage.

See references/troubleshooting.md for more solutions.


Contributing

Contributions welcome! Areas for improvement:

  • Additional model support (SD3.5, FLUX)
  • ControlNet workflows (depth, canny)
  • Audio generation integration
  • Batch processing tools

License

MIT License - See LICENSE.txt

Acknowledgments

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.