yha9806

culture-batch-generator

by @yha9806 in Tools
0
0
# Install this skill:
npx skills add yha9806/claude-skills-vulca --skill "culture-batch-generator"

Install specific skill from multi-skill repository

# Description

多文化批量评论生成Skill v2.0。支持Chinese/Korean/Indian/Mural/Western/Islamic/Japanese/Hermitage文化的批量评论生成。基于增量任务清单,自动去重检查,确保评论唯一性。

# SKILL.md


name: culture-batch-generator
description: 多文化批量评论生成Skill v2.0。支持Chinese/Korean/Indian/Mural/Western/Islamic/Japanese/Hermitage文化的批量评论生成。基于增量任务清单,自动去重检查,确保评论唯一性。


Multi-Culture Batch Generator v2.0

v2.0 更新 (2025-12-27)

  • ✅ 基于增量任务清单 (pending_tasks.json)
  • ✅ 处理前检查图像是否已有评论
  • ✅ 生成后检查评论是否与已有重复
  • ✅ 支持8种文化
  • ✅ 单张失败不影响整个批次

核心功能

  • 支持8种文化: chinese, korean, indian, mural, western, islamic, japanese, hermitage
  • 基于任务清单驱动,避免重复处理
  • 自动选择对应的文化Agent
  • 生成后验证评论唯一性
  • 自动保存检查点,支持断点续传

文化配置 (v2.1 - 统一10张/批次)

Culture Agent 批次大小 中文要求 维度要求
chinese chinese-painting-critique-agent 10 ≥300字 ≥21 CN_
korean korean-painting-critique-agent 10 ≥200字 ≥18 KR_
indian indian-bilingual-agent-v3 10 ≥400字 ≥21 IN_
mural mural-bilingual-agent 10 ≥200字 ≥21 MU_
western western-image-critique-agent 10 ≥200字 ≥18 WE_
islamic islamic-bilingual-agent 10 ≥200字 ≥20 IS_
japanese japanese-image-critique-agent 10 ≥200字 ≥19 JP_
hermitage hermitage-dimension-agent 10 ≥200字 ≥21 WS_

硬编码约束 (v2.1):
- 每批次: 固定10张图片
- 最大并行: 4个agents
- 返回协议: 只返回摘要,完整数据写checkpoint

快速启动

Step 0: 生成/更新任务清单 (必须)

cd /mnt/i/VULCA\ 2.0/VULCA2.0_Project
source .venv/bin/activate
python scripts/database/incremental_scanner.py

Step 1: 查看待处理任务

import json

with open('experiments/checkpoints/pending_tasks.json') as f:
    tasks = json.load(f)

print("=== 任务摘要 ===")
for culture, stats in tasks['summary'].items():
    print(f"{culture}: {stats['total_pending']}条待处理")

Step 2: 提取批次任务

def get_batch_tasks(culture: str, batch_size: int = 10) -> list:
    """从任务清单提取指定文化的批次任务"""
    with open('experiments/checkpoints/pending_tasks.json') as f:
        tasks = json.load(f)

    # 筛选指定文化的pending任务
    culture_tasks = [
        t for t in tasks['tasks']
        if t['culture'] == culture and t['status'] == 'pending'
    ]

    return culture_tasks[:batch_size]

# 示例: 获取10个western任务
batch = get_batch_tasks('western', 10)
print(f"获取 {len(batch)} 个任务")

批次处理流程 (v2.0)

Step 1: 预检查 - 确认图像未处理

import lancedb
import os

def pre_check(image_paths: list) -> dict:
    """检查图像是否已在数据库中"""
    db = lancedb.connect('/home/yhryzy/vulca_lancedb')
    pairs = db.open_table('matched_pairs')
    existing = pairs.search().limit(20000).to_list()

    db_images = set(os.path.basename(r.get('optimized_path') or r.get('image_path', ''))
                    for r in existing)

    results = {'new': [], 'existing': []}
    for path in image_paths:
        basename = os.path.basename(path)
        if basename in db_images:
            results['existing'].append(path)
        else:
            results['new'].append(path)

    return results

# 使用
check = pre_check([t['image_path'] for t in batch])
print(f"新图像: {len(check['new'])}, 已存在: {len(check['existing'])}")

# 只处理新图像
batch_to_process = [t for t in batch if t['image_path'] in check['new']]

Step 2: 调用对应Agent

# Agent映射
CULTURE_AGENTS = {
    'chinese': 'chinese-painting-critique-agent',
    'korean': 'korean-painting-critique-agent',
    'indian': 'indian-bilingual-agent-v3',
    'mural': 'mural-bilingual-agent',
    'western': 'western-image-critique-agent',
    'islamic': 'islamic-bilingual-agent',
    'japanese': 'japanese-image-critique-agent',
    'hermitage': 'hermitage-dimension-agent'
}

# 保存批次输入
import json
input_file = f'experiments/checkpoints/{culture}_batch_{batch_id:03d}_input.json'
with open(input_file, 'w') as f:
    json.dump(batch_to_process, f, indent=2, ensure_ascii=False)

# 调用Task工具
# Task(
#     subagent_type=CULTURE_AGENTS[culture],
#     prompt=f"处理 {input_file} 中的图像,生成双语评论。"
# )

Step 3: 后验证 - 检查评论唯一性

def post_validate(new_critiques: list) -> dict:
    """验证新生成的评论是否与已有重复"""
    import lancedb

    db = lancedb.connect('/home/yhryzy/vulca_lancedb')
    pairs = db.open_table('matched_pairs')
    existing = pairs.search().limit(20000).to_list()

    # 构建已有评论指纹库
    existing_zh_fingerprints = set()
    for r in existing:
        zh = r.get('critique_zh', '')
        if zh and len(zh) > 100:
            existing_zh_fingerprints.add(zh[:200])

    results = {'unique': [], 'duplicate': []}
    for c in new_critiques:
        zh = c.get('critique_zh', '')
        fingerprint = zh[:200] if zh else ''

        if fingerprint in existing_zh_fingerprints:
            results['duplicate'].append({
                'image_path': c.get('filepath') or c.get('image_path'),
                'reason': 'critique_zh已存在'
            })
        else:
            results['unique'].append(c)
            # 添加到指纹库,防止批次内重复
            existing_zh_fingerprints.add(fingerprint)

    return results

# 使用
with open(output_file) as f:
    critiques = json.load(f)

validation = post_validate(critiques)
print(f"唯一: {len(validation['unique'])}, 重复: {len(validation['duplicate'])}")

# 只保存唯一的评论
if validation['duplicate']:
    print("警告: 以下图像的评论与已有重复,需重新生成:")
    for d in validation['duplicate']:
        print(f"  - {d['image_path']}")

Step 4: 更新任务状态

def update_task_status(task_ids: list, new_status: str):
    """更新任务清单中的任务状态"""
    with open('experiments/checkpoints/pending_tasks.json') as f:
        tasks = json.load(f)

    for task in tasks['tasks']:
        if task['task_id'] in task_ids:
            task['status'] = new_status

    with open('experiments/checkpoints/pending_tasks.json', 'w') as f:
        json.dump(tasks, f, indent=2, ensure_ascii=False)

# 标记已完成
completed_ids = [t['task_id'] for t in batch_to_process]
update_task_status(completed_ids, 'completed')

完整批次脚本模板

#!/usr/bin/env python3
"""
增强版批次处理脚本模板
"""
import json
import lancedb
import os
from datetime import datetime

# 配置
CULTURE = 'western'  # 修改为目标文化
BATCH_SIZE = 10
BATCH_ID = 1

# Step 1: 获取任务
with open('experiments/checkpoints/pending_tasks.json') as f:
    all_tasks = json.load(f)

culture_tasks = [
    t for t in all_tasks['tasks']
    if t['culture'] == CULTURE and t['status'] == 'pending'
][:BATCH_SIZE]

print(f"获取 {len(culture_tasks)} 个 {CULTURE} 任务")

# Step 2: 预检查
db = lancedb.connect('/home/yhryzy/vulca_lancedb')
pairs = db.open_table('matched_pairs')
existing = pairs.search().limit(20000).to_list()
db_images = set(os.path.basename(r.get('optimized_path') or r.get('image_path', ''))
                for r in existing)

tasks_to_process = [
    t for t in culture_tasks
    if os.path.basename(t['image_path']) not in db_images
]
print(f"预检查通过: {len(tasks_to_process)} 个")

# Step 3: 保存批次输入
input_file = f'experiments/checkpoints/{CULTURE}_batch_{BATCH_ID:03d}_input.json'
with open(input_file, 'w') as f:
    json.dump(tasks_to_process, f, indent=2, ensure_ascii=False)

print(f"批次输入已保存: {input_file}")
print(f"请使用对应Agent处理此文件")

合并到LanceDB

使用数据治理网关合并 (已包含去重检查):

from scripts.database.data_ingestion import insert

# 读取批次输出
with open(f'experiments/checkpoints/{culture}_batch_{batch_id:03d}_output.json') as f:
    records = json.load(f)

# 通过治理网关合并 (自动去重)
result = insert(records, source=f'{culture}_expansion_batch_{batch_id:03d}')
print(f"Inserted: {result.inserted}, Rejected: {result.rejected_quality}, Duplicates: {result.rejected_duplicate}")

检查点机制

任务清单: experiments/checkpoints/pending_tasks.json

{
  "metadata": {
    "generated_at": "2025-12-27T15:31:44",
    "db_image_count": 6466,
    "local_image_count": 8502,
    "regen_image_count": 927
  },
  "summary": {
    "western": {"unprocessed": 584, "regen": 339, "total_pending": 629},
    ...
  },
  "tasks": [
    {
      "task_id": "TASK_ABC12345",
      "culture": "western",
      "image_file": "artist-title_hash.jpg",
      "image_path": "/mnt/i/VULCA 2.0/.../image.jpg",
      "task_type": "new",  // or "regen"
      "status": "pending",  // pending -> processing -> completed/failed
      "created_at": "2025-12-27T15:31:44"
    }
  ]
}

执行优先级

按工作量从小到大:
1. Hermitage (16条): 最少,快速完成
2. Japanese (93条): 需要日文专业知识
3. Indian (126条): 中等
4. Islamic (178条): 中等
5. Mural (444条): 较大
6. Western (629条): 较大
7. Chinese (1338条): 最大

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.