Xsir0

google-gemini-media

0
0
# Install this skill:
npx skills add Xsir0/xsir-skills --skill "google-gemini-media"

Install specific skill from multi-skill repository

# Description

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

# SKILL.md


name: google-gemini-media
description: Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".
license: MIT


Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

  • Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
  • Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
  • Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
  • Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
  • Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
  • Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.


2. Quick routing (decide which capability to use)

1) Do you need to produce images?
- Need to generate images from scratch or edit based on an image -> use Nano Banana image generation (see Section 5)

2) Do you need to understand images?
- Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6)

3) Do you need to produce video?
- Need to generate an 8-second video (optionally with native audio) -> use Veo 3.1 video generation (see Section 7)

4) Do you need to understand video?
- Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8)

5) Do you need to read text aloud?
- Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9)

6) Do you need to understand audio?
- Need audio descriptions, transcription, time-range transcription, token counting -> use Audio understanding (see Section 10)


3. Unified engineering constraints and I/O spec (must read)

3.0 Prerequisites (dependencies and tools)

  • Node.js 18+ (match your project version)
  • Install SDK (example):
npm install @google/genai
  • REST examples only need curl; if you need to parse image Base64, install jq (optional).

3.1 Authentication and environment variables

  • Put your API key in GEMINI_API_KEY
  • REST requests use x-goog-api-key: $GEMINI_API_KEY

3.2 Two file input modes: Inline vs Files API

Inline (embedded bytes/Base64)
- Pros: shorter call chain, good for small files.
- Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.

Files API (upload then reference)
- Pros: good for large files, reusing the same file, or multi-turn conversations.
- Typical flow:
1. files.upload(...) (SDK) or POST /upload/v1beta/files (REST resumable)
2. Use file_data / file_uri in generateContent

Engineering suggestion: implement ensure_file_uri() so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.

3.3 Unified handling of binary media outputs

  • Images: usually returned as inline_data (Base64) in response parts; in the SDK use part.as_image() or decode Base64 and save as PNG/JPG.
  • Speech (TTS): usually returns PCM bytes (Base64); save as .pcm or wrap into .wav (commonly 24kHz, 16-bit, mono).
  • Video (Veo): long-running async task; poll the operation; download the file (or use the returned URI).

4. Model selection matrix (choose by scenario)

Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.

4.1 Image generation (Nano Banana)

  • gemini-2.5-flash-image: optimized for speed/throughput; good for frequent, low-latency generation/editing.
  • gemini-3-pro-image-preview: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.

4.2 General image/video/audio understanding

  • Docs use gemini-3-flash-preview for image, video, and audio understanding (choose stronger models as needed for quality/cost).

4.3 Video generation (Veo)

  • Example model: veo-3.1-generate-preview (generates 8-second video and can natively generate audio).

4.4 Speech generation (TTS)

  • Example model: gemini-2.5-flash-preview-tts (native TTS, currently in preview).

5. Image generation (Nano Banana)

5.1 Text-to-Image

SDK (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents:
    "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.text) console.log(part.text);
  if (part.inlineData?.data) {
    fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

REST (with imageConfig) minimal template

curl -s -X POST   "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent"   -H "x-goog-api-key: $GEMINI_API_KEY"   -H "Content-Type: application/json"   -d '{
    "contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
    "generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
  }'

REST image parsing (Base64 decode)

curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
  | jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
  | base64 --decode > out.png

# macOS can use: base64 -D > out.png

5.2 Text-and-Image-to-Image

Use case: given an image, add/remove/modify elements, change style, color grading, etc.

SDK (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents: [
    { text: prompt },
    { inlineData: { mimeType: "image/png", data: imageBase64 } },
  ],
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.inlineData?.data) {
    fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

5.3 Multi-turn image iteration (Multi-turn editing)

Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set response_modalities to ["TEXT", "IMAGE"].

5.4 ImageConfig

You can set in generationConfig.imageConfig or the SDK config:
- aspectRatio: e.g. 16:9, 1:1.
- imageSize: e.g. 2K, 4K (higher resolution is usually slower/more expensive and model support can vary).


6. Image understanding (Image Understanding)

6.1 Two ways to provide input images

  • Inline image data: suitable for small files (total request size < 20MB).
  • Files API upload: better for large files or reuse across multiple requests.

6.2 Inline images (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const imageBase64 = fs.readFileSync("image.jpg").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: [
    { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
    { text: "Caption this image, and list any visible brands." },
  ],
});

console.log(response.text);

6.3 Upload and reference with Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Caption this image.",
  ]),
});

console.log(response.text);

6.4 Multi-image prompts

Append multiple images as multiple Part entries in the same contents; you can mix uploaded references and inline bytes.


7. Video generation (Veo 3.1)

7.1 Core features (must know)

  • Generates 8-second high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
  • Supports:
  • Aspect ratio (16:9 / 9:16)
  • Video extension (extend a generated video; typically limited to 720p)
  • First/last frame control (frame-specific)
  • Up to 3 reference images (image-based direction)

7.2 SDK (Node.js) minimal template: async polling + download

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
  model: "veo-3.1-generate-preview",
  prompt,
  config: { resolution: "1080p" },
});

while (!operation.done) {
  await new Promise((resolve) => setTimeout(resolve, 10_000));
  operation = await ai.operations.getVideosOperation({ operation });
}

const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });

7.3 REST minimal template: predictLongRunning + poll + download

Key point: Veo REST uses :predictLongRunning to return an operation name, then poll GET /v1beta/{operation_name}; once done, download from the video URI in the response.

7.4 Common controls (recommend a unified wrapper)

  • aspectRatio: "16:9" or "9:16"
  • resolution: "720p" | "1080p" | "4k" (higher resolutions are usually slower/more expensive)
  • When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood).
  • Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.

7.5 Important limits (engineering fallback needed)

  • Latency can vary from seconds to minutes; implement timeouts and retries.
  • Generated videos are only retained on the server for a limited time (download promptly).
  • Outputs include a SynthID watermark.

Polling fallback (with timeout/backoff) pseudocode

const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
  await new Promise((resolve) => setTimeout(resolve, sleepMs));
  sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
  operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");

8. Video understanding (Video Understanding)

8.1 Video input options

  • Files API upload: recommended when file > 100MB, video length > ~1 minute, or you need reuse.
  • Inline video data: for smaller files.
  • Direct YouTube URL: can analyze public videos.

8.2 Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Summarize this video. Provide timestamps for key events.",
  ]),
});

console.log(response.text);

8.3 Timestamp prompting strategy

  • Ask for segmented bullets with "(mm:ss)" timestamps.
  • Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.

9. Speech generation (Text-to-Speech, TTS)

9.1 Positioning

  • Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
  • Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.

9.2 Single-speaker TTS (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-tts",
  contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
  config: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Kore" },
      },
    },
  },
});

const data =
  response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));

9.3 Multi-speaker TTS (max 2 speakers)

Requirements:
- Use multiSpeakerVoiceConfig
- Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).

9.4 Voice options and language

  • voice_name supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).
  • The model can auto-detect input language and supports 24 languages (see docs for the list).

Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.


10. Audio understanding (Audio Understanding)

10.1 Typical tasks

  • Describe audio content (including non-speech like birds, alarms, etc.)
  • Generate transcripts
  • Transcribe specific time ranges
  • Count tokens (for cost estimates/segmentation)

10.2 Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    "Describe this audio clip.",
    createPartFromUri(uploaded.uri, uploaded.mimeType),
  ]),
});

console.log(response.text);

10.3 Key limits and engineering tips

  • Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
  • Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change).
  • Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters).
  • If total request size exceeds 20MB, you must use the Files API.

11. End-to-end examples (composition)

Example A: Image generation -> validation via understanding

1) Generate product images with Nano Banana (require negative space, consistent lighting).
2) Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.
3) If not satisfied, feed the generated image into text+image editing and iterate.

Example B: Video generation -> video understanding -> narration script

1) Generate an 8-second shot with Veo (include dialogue or SFX).
2) Download and save (respect retention window).
3) Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).

Example C: Audio understanding -> time-range transcription -> TTS redub

1) Upload meeting audio and transcribe full content.
2) Transcribe or summarize specific time ranges.
3) Use TTS to generate a "broadcast" version of the summary.


12. Compliance and risk (must follow)

  • Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
  • Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints.
  • Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.

13. Quick reference (Checklist)

  • [ ] Pick the right model: image generation (Flash Image / Pro Image Preview), video generation (Veo 3.1), TTS (Gemini 2.5 TTS), understanding (Gemini Flash/Pro).
  • [ ] Pick the right input mode: inline for small files; Files API for large/reuse.
  • [ ] Parse binary outputs correctly: image/audio via inline_data decode; video via operation polling + download.
  • [ ] For video generation: set aspectRatio / resolution, and download promptly (avoid expiration).
  • [ ] For TTS: set response_modalities=["AUDIO"]; max 2 speakers; speaker names must match prompt.
  • [ ] For audio understanding: countTokens when needed; segment long audio or use Files API.

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.