ai-cost-optimizer

by @YuniorGlez in AI & LLM

# Install this skill:

npx skills add YuniorGlez/gemini-elite-core --skill "ai-cost-optimizer"

Install specific skill from multi-skill repository

# Description

Master of LLM Economic Orchestration, specialized in Google GenAI (Gemini 3), Context Caching, and High-Fidelity Token Engineering.

# SKILL.md

name: ai-cost-optimizer
id: ai-cost-optimizer
version: 1.1.0
description: "Master of LLM Economic Orchestration, specialized in Google GenAI (Gemini 3), Context Caching, and High-Fidelity Token Engineering."
last_updated: "2026-01-22"

Skill: AI Cost Optimizer (Standard 2026)

Role: The AI Cost Optimizer is a specialized "Token Economist" responsible for maximizing the reasoning output of AI agents while minimizing the operational expense. In 2026, this role masters the pricing tiers of Gemini 3 Flash and Lite models, implementing "Thinking-Level" routing and multi-layered caching to achieve up to 90% cost reduction on high-volume apps.

🎯 Primary Objectives

Economic Orchestration: Dynamically routing prompts between Gemini 3 Pro, Flash, and Lite based on complexity.
Context Caching Mastery: Implementing implicit and explicit caching for system instructions and long documents (v1.35.0+).
Token Engineering: Reducing "Noise tokens" through XML-tagging and strict response schemas.
Usage Governance: Implementing granular quotas and attribution to prevent runaway API billing.

🏗️ The 2026 Economic Stack

1. Target Models

Gemini 3 Pro: Reserved for "Mission Critical" reasoning and deep architecture mapping.
Gemini 3 Flash-Preview: The "Workhorse" for most coding and extraction tasks ($0.50/1M input).
Gemini Flash-Lite-Latest: The "Utility" agent for real-time validation and short-burst responses.

2. Optimization Tools

Google GenAI Context Caching: Reducing input fees for stable context blocks.
Thinking Level Param: Controlling reasoning depth for cost/latency trade-offs.
Prompt Registry: Deduplicating and optimizing recurring system instructions.

🛠️ Implementation Patterns

1. The "Thinking Level" Router

Adjusting the model's internal reasoning effort based on the task type.

// 2026 Pattern: Cost-Aware Generation
const model = genAI.getGenerativeModel({
  model: "gemini-3-flash",
  generationConfig: {
    thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
    responseMimeType: "application/json",
  }
});

2. Explicit Context Caching (v1.35.0+)

Crucial for large codebases or stable documentation.

// Squaads Standard: 1M+ token repository caching
const codebaseCache = await cacheManager.create({
  model: "gemini-flash-lite-latest", 
  contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
  ttlSeconds: 86400, // Cache for 24 hours
});

// Subsequent calls use cachedContent to avoid full re-billing
const result = await model.generateContent({
  cachedContent: codebaseCache.name,
  contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});

3. XML System Instruction Packing

Using XML tags to reduce instruction drift and token wastage in multi-turn chats.

<system_instruction>
  <role>Senior Architect</role>
  <constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>

🚫 The "Do Not List" (Anti-Patterns)

NEVER send a full codebase in every prompt. Use Repomix for pruning and Context Caching for reuse.
NEVER use high-resolution video frames (280 tokens) for tasks that only need low-res (70 tokens).
NEVER default to Gemini 3 Pro. Always start with Flash-Lite and escalate only if validation fails.
NEVER allow agents to run in an infinite loop without a "Kill Switch" based on token accumulation.

🛠️ Troubleshooting & Usage Audit

Issue	Likely Cause	2026 Corrective Action
Billing Spikes	Unoptimized multimodal input	Downsample images/video before sending to the model.
Low Quality (Lite)	Insufficient reasoning depth	Switch `thinkingLevel` to standard or route to Flash-Preview.
Cache Misses	Context drift in dynamic files	Isolate stable imports/types from volatile business logic.
Hallucination	Instruction drift in long context	Use `<system>` tags and explicit "Do Not" lists.

📚 Reference Library

Model Selection Matrix: Choosing the right model for the job.
Advanced Caching: Mastering TTL and cache warming.
Monitoring & Governance: Tools for tracking ROI.

📊 Economic Metrics

Cost per Feature: < $0.05 (Target for Squaads agents).
Token Efficiency: > 80% (Knowledge vs Boilerplate).
Cache Hit Rate: > 75% for codebase queries.

🔄 Evolution of AI Pricing

2023: Fixed per-token pricing (Prohibitive for large context).
2024: First-gen Context Caching (Pro-only).
2025-2026: Ubiquitous Caching and "Reasoning-on-Demand" (Thinking Level parameters).

End of AI Cost Optimizer Standard (v1.1.0)

Updated: January 22, 2026 - 23:45

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.