cohere-best-practices

by @RSHVR in AI & LLM

# Install this skill:

npx skills add RSHVR/unofficial-cohere-best-practices --skill "cohere-best-practices"

Install specific skill from multi-skill repository

# Description

Production best practices for Cohere AI APIs. Covers model selection, API configuration, error handling, cost optimization, and architectural patterns for chat, RAG, and agentic applications.

# SKILL.md

name: cohere-best-practices
description: Production best practices for Cohere AI APIs. Covers model selection, API configuration, error handling, cost optimization, and architectural patterns for chat, RAG, and agentic applications.

Cohere Best Practices Reference

Official Resources

Docs & Cookbooks: https://github.com/cohere-ai/cohere-developer-experience
API Reference: https://docs.cohere.com/reference/about

Model Selection Guide

Use Case	Model	Notes
General chat/reasoning	`command-a-03-2025`	Latest Command A model
RAG with citations	`command-r-plus-08-2024`	Excellent grounded generation
Cost-sensitive tasks	`command-r-08-2024`	Good balance of quality/cost
Embeddings (English)	`embed-english-v3.0`	Best for English-only
Embeddings (Multilingual)	`embed-multilingual-v3.0`	100+ languages
Reranking	`rerank-v3.5`	Good balance
Reranking (Quality)	`rerank-v4.0-pro`	Best quality, slower
Reranking (Speed)	`rerank-v4.0-fast`	Optimized for latency

API Configuration Best Practices

Use Client V2

import cohere

# Correct: Use ClientV2 for all new projects
co = cohere.ClientV2()

# Deprecated: Don't use the old client
# co = cohere.Client()  # Avoid

Temperature Settings

# For agents/tool calling - lower temperature for reliability
co.chat(model="command-a-03-2025", temperature=0.3, ...)

# For creative tasks - higher temperature
co.chat(model="command-a-03-2025", temperature=0.7, ...)

# For deterministic outputs - zero temperature
co.chat(model="command-a-03-2025", temperature=0, ...)

Embedding Best Practices

Always Specify input_type

# For documents being indexed
doc_embeddings = co.embed(
    texts=documents,
    model="embed-english-v3.0",
    input_type="search_document",  # Critical!
    embedding_types=["float"]
)

# For search queries
query_embedding = co.embed(
    texts=[query],
    model="embed-english-v3.0",
    input_type="search_query",  # Must match at query time
    embedding_types=["float"]
)

Critical: Mismatched input_type between indexing and querying will degrade search quality significantly.

RAG Best Practices

Two-Stage Retrieval Pattern

# Stage 1: Broad retrieval with embeddings
candidates = vectorstore.similarity_search(query, k=30)

# Stage 2: Precise reranking
reranked = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[doc.page_content for doc in candidates],
    top_n=5
)

# Use reranked results for generation
final_docs = [candidates[r.index] for r in reranked.results]

Grounded Generation with Citations

response = co.chat(
    model="command-r-plus-08-2024",
    messages=[{"role": "user", "content": question}],
    documents=[
        {"id": f"doc_{i}", "data": {"text": doc}}
        for i, doc in enumerate(final_docs)
    ]
)

# Access citations
for citation in response.message.citations:
    print(f"'{citation.text}' from {citation.sources}")

Error Handling

from cohere.core import ApiError

def safe_chat(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return co.chat(
                model="command-a-03-2025",
                messages=messages
            )
        except ApiError as e:
            if e.status_code == 429:  # Rate limit
                time.sleep(2 ** attempt)
                continue
            elif e.status_code >= 500:  # Server error
                time.sleep(1)
                continue
            else:
                raise
    raise Exception("Max retries exceeded")

Cost Optimization

Use appropriate models: Don't use Command A for simple tasks
Batch embeddings: Embed multiple texts in one call (up to 96 texts)
Cache embeddings: Store computed embeddings in a vector database
Use reranking wisely: Only rerank when quality matters
Stream for UX: Streaming doesn't cost more but improves perceived latency

Production Checklist

[ ] Use ClientV2 for all API calls
[ ] Set appropriate temperature for your use case
[ ] Always specify input_type for embeddings
[ ] Implement retry logic with exponential backoff
[ ] Use two-stage retrieval for RAG
[ ] Cache embeddings to reduce API calls
[ ] Monitor token usage and costs
[ ] Handle rate limits gracefully

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.