websocket-engineer

by @404kidwiz in Web & API

# Install this skill:

npx skills add 404kidwiz/claude-supercode-skills --skill "websocket-engineer"

Install specific skill from multi-skill repository

# Description

Expert in real-time communication systems, including WebSockets, Socket.IO, SSE, and WebRTC.

# SKILL.md

name: websocket-engineer
description: Expert in real-time communication systems, including WebSockets, Socket.IO, SSE, and WebRTC.

WebSocket & Real-Time Engineer

Purpose

Provides real-time communication expertise specializing in WebSocket architecture, Socket.IO, and event-driven systems. Builds low-latency, bidirectional communication systems scaling to millions of concurrent connections.

When to Use

Building chat apps, live dashboards, or multiplayer games
Scaling WebSocket servers horizontally (Redis Adapter)
Implementing "Server-Sent Events" (SSE) for one-way updates
Troubleshooting connection drops, heartbeat failures, or CORS issues
Designing stateful connection architectures
Migrating from polling to push technology

Examples

Example 1: Real-Time Chat Application

Scenario: Building a scalable chat platform for enterprise use.

Implementation:
1. Designed WebSocket architecture with Socket.IO
2. Implemented Redis Adapter for horizontal scaling
3. Created room-based message routing
4. Added message persistence and history
5. Implemented presence system (online/offline)

Results:
- Supports 100,000+ concurrent connections
- 50ms average message delivery
- 99.99% connection stability
- Seamless horizontal scaling

Example 2: Live Dashboard System

Scenario: Real-time analytics dashboard with sub-second updates.

Implementation:
1. Implemented WebSocket server with low latency
2. Created efficient message batching strategy
3. Added Redis pub/sub for multi-server support
4. Implemented client-side update coalescing
5. Added compression for large payloads

Results:
- Dashboard updates in under 100ms
- Handles 10,000 concurrent dashboard views
- 80% reduction in server load vs polling
- Zero data loss during reconnections

Example 3: Multiplayer Game Backend

Scenario: Low-latency multiplayer game server.

Implementation:
1. Implemented WebSocket server with binary protocols
2. Created authoritative server architecture
3. Added client-side prediction and reconciliation
4. Implemented lag compensation algorithms
5. Set up server-side physics and collision detection

Results:
- 30ms end-to-end latency
- Supports 1000 concurrent players per server
- Smooth gameplay despite network variations
- Cheat-resistant server authority

Best Practices

Connection Management

Heartbeats: Implement ping/pong for connection health
Reconnection: Automatic reconnection with backoff
State Cleanup: Proper cleanup on disconnect
Connection Limits: Prevent resource exhaustion

Scaling

Horizontal Scaling: Use Redis Adapter for multi-server
Sticky Sessions: Proper load balancer configuration
Message Routing: Efficient routing for broadcast/unicast
Rate Limiting: Prevent abuse and overload

Performance

Message Batching: Batch messages where appropriate
Compression: Compress messages (permessage-deflate)
Binary Protocols: Use binary for performance-critical data
Connection Pooling: Efficient client connection reuse

Security

Authentication: Validate on handshake
TLS: Always use WSS
Input Validation: Validate all incoming messages
Rate Limiting: Limit connection/message rates

---

2. Decision Framework

Protocol Selection

What is the communication pattern?
│
├─ **Bi-directional (Chat/Game)**
│  ├─ Low Latency needed? → **WebSockets (Raw)**
│  ├─ Fallbacks/Auto-reconnect needed? → **Socket.IO**
│  └─ P2P Video/Audio? → **WebRTC**
│
├─ **One-way (Server → Client)**
│  ├─ Stock Ticker / Notifications? → **Server-Sent Events (SSE)**
│  └─ Large File Download? → **HTTP Stream**
│
└─ **High Frequency (IoT)**
   └─ Constrained device? → **MQTT** (over TCP/WS)

Scaling Strategy

Scale	Architecture	Backend
< 10k Users	Monolith Node.js	Single Instance
10k - 100k	Clustering	Node.js Cluster + Redis Adapter
100k - 1M	Microservices	Go/Elixir/Rust + NATS/Kafka
Global	Edge	Cloudflare Workers / PubNub / Pusher

Load Balancer Config

Sticky Sessions: REQUIRED for Socket.IO (handshake phase).
Timeouts: Increase idle timeouts (e.g., 60s+).
Headers: Upgrade: websocket, Connection: Upgrade.

Red Flags → Escalate to security-engineer:
- Accepting connections from any Origin (*) with credentials
- No Rate Limiting on connection requests (DoS risk)
- Sending JWTs in URL query params (Logged in proxy logs) - Use Cookie or Initial Message instead

---

3. Core Workflows

Workflow 1: Scalable Socket.IO Server (Node.js)

Goal: Chat server capable of scaling across multiple cores/instances.

Steps:

Install Dependencies
bash npm install socket.io redis @socket.io/redis-adapter
Implementation (server.js)
```javascript
const { Server } = require("socket.io");
const { createClient } = require("redis");
const { createAdapter } = require("@socket.io/redis-adapter");

const pubClient = createClient({ url: "redis://localhost:6379" });
const subClient = pubClient.duplicate();

Promise.all([pubClient.connect(), subClient.connect()]).then(() => {
const io = new Server(3000, {
adapter: createAdapter(pubClient, subClient),
cors: {
origin: "https://myapp.com",
methods: ["GET", "POST"]
}
});

io.on("connection", (socket) => {
// User joins a room (e.g., "chat-123")
socket.on("join", (room) => {
socket.join(room);
});
```
// Send message to room (propagates via Redis to all nodes)
socket.on("message", (data) => {
  io.to(data.room).emit("chat", data.text);
});
```
});
});
```

---

Workflow 3: Production Tuning (Linux)

Goal: Handle 50k concurrent connections on a single server.

Steps:

File Descriptors
- Increase limit: ulimit -n 65535.
- Edit /etc/security/limits.conf.
Ephemeral Ports
- Increase range: sysctl -w net.ipv4.ip_local_port_range="1024 65535".
Memory Optimization
- Use ws (lighter) instead of Socket.IO if features not needed.
- Disable "Per-Message Deflate" (Compression) if CPU is high.

---

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Stateful Monolith

What it looks like:
- Storing users = [] array in Node.js memory.

Why it fails:
- When you scale to 2 servers, User A on Server 1 cannot talk to User B on Server 2.
- Memory leaks crash the process.

Correct approach:
- Use Redis as the state store (Adapter).
- Stateless servers, Stateful backend (Redis).

❌ Anti-Pattern 2: The "Thundering Herd"

What it looks like:
- Server restarts. 100,000 clients reconnect instantly.
- Server crashes again due to CPU spike.

Why it fails:
- Connection handshakes are expensive (TLS + Auth).

Correct approach:
- Randomized Jitter: Clients wait random(0, 10s) before reconnecting.
- Exponential Backoff: Wait 1s, then 2s, then 4s...

❌ Anti-Pattern 3: Blocking the Event Loop

What it looks like:
- socket.on('message', () => { heavyCalculation(); })

Why it fails:
- Node.js is single-threaded. One heavy task blocks all 10,000 connections.

Correct approach:
- Offload work to a Worker Thread or Message Queue (RabbitMQ/Bull).

---

7. Quality Checklist

Scalability:
- [ ] Adapter: Redis/NATS adapter configured for multi-node.
- [ ] Load Balancer: Sticky sessions enabled (if using polling fallback).
- [ ] OS Limits: File descriptors limit increased.

Resilience:
- [ ] Reconnection: Exponential backoff + Jitter implemented.
- [ ] Heartbeat: Ping/Pong interval configured (< LB timeout).
- [ ] Fallback: Socket.IO fallbacks (HTTP Long Polling) enabled/tested.

Security:
- [ ] WSS: TLS enabled (Secure WebSockets).
- [ ] Auth: Handshake validates credentials properly.
- [ ] Rate Limit: Connection rate limiting active.

Anti-Patterns

Connection Management Anti-Patterns

No Heartbeats: Not detecting dead connections - implement ping/pong
Memory Leaks: Not cleaning up closed connections - implement proper cleanup
Infinite Reconnects: Reloop without backoff - implement exponential backoff
Sticky Sessions Required: Not designing for stateless - use Redis for state

Scaling Anti-Patterns

Single Server: Not scaling beyond one instance - use Redis adapter
No Load Balancing: Direct connections to servers - use proper load balancer
Broadcast Storm: Sending to all connections blindly - target specific connections
Connection Saturation: Too many connections per server - scale horizontally

Performance Anti-Patterns

Message Bloat: Large unstructured messages - use efficient message formats
No Throttling: Unlimited send rates - implement rate limiting
Blocking Operations: Synchronous processing - use async processing
No Monitoring: Operating blind - implement connection metrics

Security Anti-Patterns

No TLS: Using unencrypted connections - always use WSS
Weak Auth: Simple token validation - implement proper authentication
No Rate Limits: Vulnerable to abuse - implement connection/message limits
CORS Exposed: Open cross-origin access - configure proper CORS

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.