erichowens

llm-streaming-response-handler

20
3
# Install this skill:
npx skills add erichowens/some_claude_skills --skill "llm-streaming-response-handler"

Install specific skill from multi-skill repository

# Description

Build production LLM streaming UIs with Server-Sent Events, real-time token display, cancellation, error recovery. Handles OpenAI/Anthropic/Claude streaming APIs. Use for chatbots, AI assistants, real-time text generation. Activate on "LLM streaming", "SSE", "token stream", "chat UI", "real-time AI". NOT for batch processing, non-streaming APIs, or WebSocket bidirectional chat.

# SKILL.md


name: llm-streaming-response-handler
description: Build production LLM streaming UIs with Server-Sent Events, real-time token display, cancellation, error recovery. Handles OpenAI/Anthropic/Claude streaming APIs. Use for chatbots, AI assistants, real-time text generation. Activate on "LLM streaming", "SSE", "token stream", "chat UI", "real-time AI". NOT for batch processing, non-streaming APIs, or WebSocket bidirectional chat.
allowed-tools: Read,Write,Edit,Bash(npm:*)


LLM Streaming Response Handler

Expert in building production-grade streaming interfaces for LLM responses that feel instant and responsive.

When to Use

Use for:
- Chat interfaces with typing animation
- Real-time AI assistants
- Code generation with live preview
- Document summarization with progressive display
- Any UI where users expect immediate feedback from LLMs

NOT for:
- Batch document processing (no user watching)
- APIs that don't support streaming
- WebSocket-based bidirectional chat (use Socket.IO)
- Simple request/response (fetch is fine)

Quick Decision Tree

Does your LLM interaction:
├── Need immediate visual feedback? → Streaming
├── Display long-form content (>100 words)? → Streaming
├── User expects typewriter effect? → Streaming
├── Short response (<50 words)? → Regular fetch
└── Background processing? → Regular fetch

Technology Selection

Why SSE over WebSockets for LLM streaming:
- Simplicity: HTTP-based, works with existing infrastructure
- Auto-reconnect: Built-in reconnection logic
- Firewall-friendly: Easier than WebSockets through proxies
- One-way perfect: LLMs only stream server → client

Timeline:
- 2015-2020: WebSockets for everything
- 2020: SSE adoption for streaming APIs
- 2023+: SSE standard for LLM streaming (OpenAI, Anthropic)
- 2024: Vercel AI SDK popularizes SSE patterns

Streaming APIs

Provider Streaming Method Response Format
OpenAI SSE data: {"choices":[{"delta":{"content":"token"}}]}
Anthropic SSE data: {"type":"content_block_delta","delta":{"text":"token"}}
Claude (API) SSE data: {"delta":{"text":"token"}}
Vercel AI SDK SSE Normalized across providers

Common Anti-Patterns

Anti-Pattern 1: Buffering Before Display

Novice thinking: "Collect all tokens, then show complete response"

Problem: Defeats the entire purpose of streaming.

Wrong approach:

// ❌ Waits for entire response before showing anything
const response = await fetch('/api/chat', { method: 'POST', body: prompt });
const fullText = await response.text();
setMessage(fullText); // User sees nothing until done

Correct approach:

// ✅ Display tokens as they arrive
const response = await fetch('/api/chat', {
  method: 'POST',
  body: JSON.stringify({ prompt })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n').filter(line => line.trim());

  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      setMessage(prev => prev + data.content); // Update immediately
    }
  }
}

Timeline:
- Pre-2023: Many apps buffered entire response
- 2023+: Token-by-token display expected


Anti-Pattern 2: No Stream Cancellation

Problem: User can't stop generation, wasting tokens and money.

Symptom: "Stop" button doesn't work or doesn't exist.

Correct approach:

// ✅ AbortController for cancellation
const [abortController, setAbortController] = useState<AbortController | null>(null);

const streamResponse = async () => {
  const controller = new AbortController();
  setAbortController(controller);

  try {
    const response = await fetch('/api/chat', {
      signal: controller.signal,
      method: 'POST',
      body: JSON.stringify({ prompt })
    });

    // Stream handling...
  } catch (error) {
    if (error.name === 'AbortError') {
      console.log('Stream cancelled by user');
    }
  } finally {
    setAbortController(null);
  }
};

const cancelStream = () => {
  abortController?.abort();
};

return (
  <button onClick={cancelStream} disabled={!abortController}>
    Stop Generating
  </button>
);

Anti-Pattern 3: No Error Recovery

Problem: Stream fails mid-response, user sees partial text with no indication of failure.

Correct approach:

// ✅ Error states and recovery
const [streamState, setStreamState] = useState<'idle' | 'streaming' | 'error' | 'complete'>('idle');
const [errorMessage, setErrorMessage] = useState<string | null>(null);

try {
  setStreamState('streaming');

  // Streaming logic...

  setStreamState('complete');
} catch (error) {
  setStreamState('error');

  if (error.name === 'AbortError') {
    setErrorMessage('Generation stopped');
  } else if (error.message.includes('429')) {
    setErrorMessage('Rate limit exceeded. Try again in a moment.');
  } else {
    setErrorMessage('Something went wrong. Please retry.');
  }
}

// UI feedback
{streamState === 'error' && (
  <div className="error-banner">
    {errorMessage}
    <button onClick={retryStream}>Retry</button>
  </div>
)}

Anti-Pattern 4: Memory Leaks from Unclosed Streams

Problem: Streams not cleaned up, causing memory leaks.

Symptom: Browser slows down after multiple requests.

Correct approach:

// ✅ Cleanup with useEffect
useEffect(() => {
  let reader: ReadableStreamDefaultReader | null = null;

  const streamResponse = async () => {
    const response = await fetch('/api/chat', { ... });
    reader = response.body.getReader();

    // Streaming...
  };

  streamResponse();

  // Cleanup on unmount
  return () => {
    reader?.cancel();
  };
}, [prompt]);

Anti-Pattern 5: No Typing Indicator Between Tokens

Problem: UI feels frozen between slow tokens.

Correct approach:

// ✅ Animated cursor during generation
<div className="message">
  {content}
  {isStreaming && <span className="typing-cursor">▊</span>}
</div>
.typing-cursor {
  animation: blink 1s step-end infinite;
}

@keyframes blink {
  50% { opacity: 0; }
}

Implementation Patterns

Pattern 1: Basic SSE Stream Handler

async function* streamCompletion(prompt: string) {
  const response = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt })
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n');

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6));

        if (data.content) {
          yield data.content;
        }

        if (data.done) {
          return;
        }
      }
    }
  }
}

// Usage
for await (const token of streamCompletion('Hello')) {
  console.log(token);
}

Pattern 2: React Hook for Streaming

import { useState, useCallback } from 'react';

interface UseStreamingOptions {
  onToken?: (token: string) => void;
  onComplete?: (fullText: string) => void;
  onError?: (error: Error) => void;
}

export function useStreaming(options: UseStreamingOptions = {}) {
  const [content, setContent] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);
  const [error, setError] = useState<Error | null>(null);
  const [abortController, setAbortController] = useState<AbortController | null>(null);

  const stream = useCallback(async (prompt: string) => {
    const controller = new AbortController();
    setAbortController(controller);
    setIsStreaming(true);
    setError(null);
    setContent('');

    try {
      const response = await fetch('/api/chat', {
        method: 'POST',
        signal: controller.signal,
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ prompt })
      });

      const reader = response.body!.getReader();
      const decoder = new TextDecoder();

      let accumulated = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value);
        const lines = chunk.split('\n').filter(line => line.trim());

        for (const line of lines) {
          if (line.startsWith('data: ')) {
            const data = JSON.parse(line.slice(6));

            if (data.content) {
              accumulated += data.content;
              setContent(accumulated);
              options.onToken?.(data.content);
            }
          }
        }
      }

      options.onComplete?.(accumulated);
    } catch (err) {
      if (err.name !== 'AbortError') {
        setError(err as Error);
        options.onError?.(err as Error);
      }
    } finally {
      setIsStreaming(false);
      setAbortController(null);
    }
  }, [options]);

  const cancel = useCallback(() => {
    abortController?.abort();
  }, [abortController]);

  return { content, isStreaming, error, stream, cancel };
}

// Usage in component
function ChatInterface() {
  const { content, isStreaming, stream, cancel } = useStreaming({
    onToken: (token) => console.log('New token:', token),
    onComplete: (text) => console.log('Done:', text)
  });

  return (
    <div>
      <div className="message">
        {content}
        {isStreaming && <span className="cursor">▊</span>}
      </div>

      <button onClick={() => stream('Tell me a story')} disabled={isStreaming}>
        Generate
      </button>

      {isStreaming && <button onClick={cancel}>Stop</button>}
    </div>
  );
}

Pattern 3: Server-Side Streaming (Next.js)

// app/api/chat/route.ts
import { OpenAI } from 'openai';

export const runtime = 'edge'; // Required for streaming

export async function POST(req: Request) {
  const { prompt } = await req.json();

  const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY
  });

  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }],
    stream: true
  });

  // Convert OpenAI stream to SSE format
  const encoder = new TextEncoder();

  const readable = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of stream) {
          const content = chunk.choices[0]?.delta?.content;

          if (content) {
            const sseMessage = `data: ${JSON.stringify({ content })}\n\n`;
            controller.enqueue(encoder.encode(sseMessage));
          }
        }

        // Send completion signal
        controller.enqueue(encoder.encode('data: {"done":true}\n\n'));
        controller.close();
      } catch (error) {
        controller.error(error);
      }
    }
  });

  return new Response(readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive'
    }
  });
}

Production Checklist

□ AbortController for cancellation
□ Error states with retry capability
□ Typing indicator during generation
□ Cleanup on component unmount
□ Rate limiting on API route
□ Token usage tracking
□ Streaming fallback (if API fails)
□ Accessibility (screen reader announces updates)
□ Mobile-friendly (touch targets for stop button)
□ Network error recovery (auto-retry on disconnect)
□ Max response length enforcement
□ Cost estimation before generation

When to Use vs Avoid

Scenario Use Streaming?
Chat interface ✅ Yes
Long-form content generation ✅ Yes
Code generation with preview ✅ Yes
Short completions (<50 words) ❌ No - regular fetch
Background jobs ❌ No - use job queue
Bidirectional chat ⚠️ Use WebSockets instead

Technology Comparison

Feature SSE WebSockets Long Polling
Complexity Low Medium High
Auto-reconnect
Bidirectional
Firewall-friendly ⚠️
Browser support ✅ All modern ✅ All modern ✅ Universal
LLM API support ✅ Standard ❌ Rare ❌ Not used

References

  • /references/sse-protocol.md - Server-Sent Events specification details
  • /references/vercel-ai-sdk.md - Vercel AI SDK integration patterns
  • /references/error-recovery.md - Stream error handling strategies

Scripts

  • scripts/stream_tester.ts - Test SSE endpoints locally
  • scripts/token_counter.ts - Estimate costs before generation

This skill guides: LLM streaming implementation | SSE protocol | Real-time UI updates | Cancellation | Error recovery | Token-by-token display

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.