ratacat

assemblyai-streaming

16
4
# Install this skill:
npx skills add ratacat/claude-skills --skill "assemblyai-streaming"

Install specific skill from multi-skill repository

# Description

This skill should be used when working with AssemblyAI’s Speech-to-Text and LLM Gateway APIs, especially for streaming/live transcription, meeting notetakers, and voice agents that need low-latency transcripts and audio analysis.

# SKILL.md


name: assemblyai-streaming
description: This skill should be used when working with AssemblyAI’s Speech-to-Text and LLM Gateway APIs, especially for streaming/live transcription, meeting notetakers, and voice agents that need low-latency transcripts and audio analysis.
license: MIT
allowed-tools:
- Read
- Write
- Edit
- Grep
- Glob
- Bash
- Python
metadata:
skill-version: "1.0.0"
upstream-docs: "https://www.assemblyai.com/docs"
focus: "streaming-stt, meeting-notetaker, voice-agent, llm-gateway"


AssemblyAI Streaming & Live Transcription Skill

Overview

Use this skill to build and maintain code that talks to AssemblyAI’s:

  • Streaming Speech-to-Text (STT) via WebSockets (wss://streaming.assemblyai.com/v3/ws)
  • Async / pre-recorded STT via REST (https://api.assemblyai.com/v2/transcript)
  • LLM Gateway for applying Claude/GPT/Gemini-style models to transcripts (https://llm-gateway.assemblyai.com)

The emphasis is on streaming/live transcription, meeting notetakers, and voice agents, while still covering async workflows and post-processing.

This skill assumes a Claude Code environment with access to Python (preferred) and Bash.


When to Use

Use this skill when:

  • Implementing real-time transcription from a microphone, telephony stream, or audio file.
  • Building a live meeting notetaker (Zoom/Teams/Meet), especially with summaries, action items, and highlights.
  • Implementing a voice agent where latency and natural turn-taking matter.
  • Migrating from other STT providers (OpenAI/Deepgram/Google/AWS/etc.) to AssemblyAI.
  • Applying LLMs to audio via LLM Gateway for summaries, Q&A, topic tagging, or custom prompts.

Do not use this skill when:

  • The task is generic HTTP client usage with no AssemblyAI-specific logic.
  • The request clearly targets a different STT vendor.
  • The environment cannot safely store or use an API key.

AssemblyAI Mental Model

1. Products to care about

  1. Pre-recorded Speech-to-Text (Async)
  2. REST API: POST /v2/transcriptGET /v2/transcript/{id}
  3. Designed for files from URLs, uploads, S3, etc.
  4. Supports extra models: summarization, topic detection, sentiment, PII redaction, chapters, etc.

  5. Streaming Speech-to-Text

  6. WebSocket: wss://streaming.assemblyai.com/v3/ws
  7. Low-latency, immutable transcripts (~300ms).
  8. Turn detection built in; fits voice agents and live captioning.

  9. LLM Gateway

  10. REST API: POST /v1/chat/completions at https://llm-gateway.assemblyai.com
  11. Unified access to multiple LLMs (Claude, GPT, Gemini, etc.).
  12. Designed for “LLM over transcripts” workflows.

2. Key model knobs (Async)

  • speech_models: ["slam-1", "universal"] etc.
  • Slam-1: best English accuracy + keyterms_prompt, good for medical/technical conversations.
  • Universal: multilingual coverage; good default if language is unknown.
  • language_code vs language_detection:
  • Use language_code when the language is known.
  • Use language_detection: true when unknown; optionally set language_confidence_threshold.
  • keyterms_prompt:
  • Domain words/phrases to boost (med terms, product names, etc.).
  • Extra intelligence: summarization, iab_categories, content_safety, entity_detection, auto_chapters, sentiment_analysis, speaker_labels, auto_highlights, redact_pii, etc.

3. Key model knobs (Streaming)

Connection URL:

  • US: wss://streaming.assemblyai.com/v3/ws
  • EU: wss://streaming.eu.assemblyai.com/v3/ws

Important query parameters:

  • sample_rate (required): e.g. 16000
  • format_turns (bool): return formatted final transcripts; avoid for low-latency voice agents.
  • speech_model: universal-streaming-english (default) or universal-streaming-multi.
  • `keyterms_p

rompt: JSON-encoded list of terms, e.g.["AssemblyAI", "Slam-1", "Keanu Reeves"]. - Turn detection: -end_of_turn_confidence_threshold(0.0–1.0, default ~0.4) -min_end_of_turn_silence_when_confident(ms, default ~400) -max_turn_silence` (ms, default ~1280)

Headers:

  • Use either Authorization: <API_KEY> or a short-lived token query parameter issued by your backend.

Messages:

  • Client sends:
  • Binary audio chunks (50–1000ms each).
  • Optional JSON messages: {"type": "UpdateConfig", ...}, {"type": "Terminate"}, {"type": "ForceEndpoint"}.
  • Server sends:
  • Begin event with id, expires_at.
  • Turn events with:
    • transcript (immutable partials/finals),
    • utterance (complete semantic chunk),
    • end_of_turn (bool),
    • turn_is_formatted (bool),
    • words array with timestamps/confidences.
  • Termination event with summary stats.

4. Regions and data residency

  • Async:
  • US: https://api.assemblyai.com
  • EU: https://api.eu.assemblyai.com
  • Streaming:
  • US: wss://streaming.assemblyai.com/v3/ws
  • EU: wss://streaming.eu.assemblyai.com/v3/ws

Always keep base URLs consistent per project; don’t mix US/EU endpoints for the same data.


Security & API Keys

  • Always require an AssemblyAI API key and keep it out of source in Claude Code output:
  • Use environment variables: ASSEMBLYAI_API_KEY.
  • Or placeholders ("<YOUR_API_KEY>") in snippets.
  • For browser/client code:
  • Do not embed the API key.
  • Instruct the user to generate temporary streaming tokens on their backend and pass only the token into the WebSocket connection.
  • Never print real keys in logs or comments.

High-Level Workflow Patterns

Decision tree

  1. Is the audio live?
  2. Yes → Use Streaming STT.
  3. No → Use Async STT.

  4. Is latency critical (<1s) for responses?

  5. Yes → Streaming with format_turns=false and careful turn detection.
  6. No → Async, then Summarization/Chapters/etc.

  7. Do transcripts leave the backend?

  8. Yes → Consider redact_pii (and optionally redact_pii_audio) before sharing.
  9. No → Use raw transcripts as needed.

  10. Need LLM-based processing (Q&A, structured summaries)?

  11. Yes → Pipe transcripts into LLM Gateway via chat/completions.

How Claude Should Work with This Skill

General principles

  • Prefer official AssemblyAI SDKs (Python/JS) when available; fall back to requests/websocket-client only if SDK cannot be installed.
  • Always:
  • Validate HTTP responses and WebSocket status.
  • Surface useful error messages (status, error fields in transcript JSON).
  • Respect documented min/max chunk sizes (50–1000ms of audio per binary message).
  • For voice-agent code, optimize for:
  • Immutable partials (transcript) and utterance field.
  • Minimal latency, avoid extra formatting passes.

Recipe 1 – Minimal Streaming from Microphone (Python SDK)

Goal: Stream mic audio to AssemblyAI and print transcripts in real time.

Use this when the environment has Python and assemblyai + pyaudio installed, and the user wants a quick streaming demo.

```python
import assemblyai as aai
from assemblyai.streaming import v3 as aai_stream
import pyaudio

API_KEY = ""

aai.settings.api_key = API_KEY

SAMPLE_RATE = 16000
CHUNK_MS = 50
FRAMES_PER_BUFFER = int(SAMPLE_RATE * (CHUNK_MS / 1000.0))

def main():
client = aai_stream.StreamingClient(
aai_stream.StreamingClientOptions(
api_key=API_KEY,
api_host="streaming.assemblyai.com", # or "streaming.eu.assemblyai.com"
)
)

def on_begin(_client, event: aai_stream.BeginEvent):
    print(f"Session started: {event.id}, expires at {event.expires_at}")

def on_turn(_client, event: aai_stream.TurnEvent):
    # Use immutable transcript text
    text = (event.transcript or "").strip()
    if not text:
        return
    # Use formatted finals only for display; keep unformatted for LLMs
    if event.turn_is_formatted:
        print(f"[FINAL] {text}")
    else:
        print(f"[PARTIAL] {text}", end="\r")

def on_terminated(_client, event: aai_stream.TerminationEvent):
    print(f"\nTerminated. Audio duration={event.audio_duration_seconds}s")

def on_error(_client, error: aai_stream.StreamingError):
    print(f"\nStreaming error: {error}")

client.on(aai_stream.StreamingEvents.Begin, on_begin)
client.on(aai_stream.StreamingEvents.Turn, on_turn)
client.on(aai_stream.StreamingEvents.Termination, on_terminated)
client.on(aai_stream.StreamingEvents.Error, on_error)

client.connect(
    aai_stream.StreamingParameters(
        sample_rate=SAMPLE_RATE,
        format_turns=False,  # better latency for voice agents
    )
)

pa = pyaudio.PyAudio()
stream = pa.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=SAMPLE_RATE,
    input=True,
    frames_per_buffer=FRAMES_PER_BUFFER,
)

try:
    print("Speak into your microphone (Ctrl+C to stop)...")
    def audio_gen():
        while True:
            yield stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
    client.stream(audio_gen())
except KeyboardInterrupt:
    pass
finally:
    client.disconnect(terminate=True)
    stream.stop_stream()
    stream.close()
    pa.terminate()

if name == "main":
main()

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.