404kidwiz

kafka-engineer

6
0
# Install this skill:
npx skills add 404kidwiz/claude-supercode-skills --skill "kafka-engineer"

Install specific skill from multi-skill repository

# Description

Expert in Apache Kafka, Event Streaming, and Real-time Data Pipelines. Specializes in Kafka Connect, KSQL, and Schema Registry.

# SKILL.md


name: kafka-engineer
description: Expert in Apache Kafka, Event Streaming, and Real-time Data Pipelines. Specializes in Kafka Connect, KSQL, and Schema Registry.


Kafka Engineer

Purpose

Provides Apache Kafka and event streaming expertise specializing in scalable event-driven architectures and real-time data pipelines. Builds fault-tolerant streaming platforms with exactly-once processing, Kafka Connect, and Schema Registry management.

When to Use

  • Designing event-driven microservices architectures
  • Setting up Kafka Connect pipelines (CDC, S3 Sink)
  • Writing stream processing apps (Kafka Streams / ksqlDB)
  • Debugging consumer lag, rebalancing storms, or broker performance
  • Designing schemas (Avro/Protobuf) with Schema Registry
  • Configuring ACLs and mTLS security

---

2. Decision Framework

Architecture Selection

What is the use case?
│
├─ **Data Integration (ETL)**
│  ├─ DB to DB/Data Lake? → **Kafka Connect** (Zero code)
│  └─ Complex transformations? → **Kafka Streams**
│
├─ **Real-Time Analytics**
│  ├─ SQL-like queries? → **ksqlDB** (Quick aggregation)
│  └─ Complex stateful logic? → **Kafka Streams / Flink**
│
└─ **Microservices Comm**
   ├─ Event Notification? → **Standard Producer/Consumer**
   └─ Event Sourcing? → **State Stores (RocksDB)**

Config Tuning (The "Big 3")

  1. Throughput: batch.size, linger.ms, compression.type=lz4.
  2. Latency: linger.ms=0, acks=1.
  3. Durability: acks=all, min.insync.replicas=2, replication.factor=3.

Red Flags → Escalate to sre-engineer:
- "Unclean leader election" enabled (Data loss risk)
- Zookeeper dependency in new clusters (Use KRaft mode)
- Disk usage > 80% on brokers
- Consumer lag constantly increasing (Capacity mismatch)

---

3. Core Workflows

Workflow 1: Kafka Connect (CDC)

Goal: Stream changes from PostgreSQL to S3.

Steps:

  1. Source Config (postgres-source.json)
    json { "name": "postgres-source", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "db-host", "database.dbname": "mydb", "database.user": "kafka", "plugin.name": "pgoutput" } }

  2. Sink Config (s3-sink.json)
    json { "name": "s3-sink", "config": { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "s3.bucket.name": "my-datalake", "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat", "flush.size": "1000" } }

  3. Deploy

    • curl -X POST -d @postgres-source.json http://connect:8083/connectors

---

Workflow 3: Schema Registry Integration

Goal: Enforce schema compatibility.

Steps:

  1. Define Schema (user.avsc)
    json { "type": "record", "name": "User", "fields": [ {"name": "id", "type": "int"}, {"name": "name", "type": "string"} ] }

  2. Producer (Java)

    • Use KafkaAvroSerializer.
    • Registry URL: http://schema-registry:8081.

---

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Large Messages

What it looks like:
- Sending 10MB images payload in Kafka message.

Why it fails:
- Kafka is optimized for small messages (< 1MB). Large messages block the broker threads.

Correct approach:
- Store image in S3.
- Send Reference URL in Kafka message.

❌ Anti-Pattern 2: Too Many Partitions

What it looks like:
- Creating 10,000 partitions on a small cluster.

Why it fails:
- Slow leader election (Zookeeper overhead).
- High file handle usage.

Correct approach:
- Limit partitions per broker (~4000). Use fewer topics or larger clusters.

❌ Anti-Pattern 3: Blocking Consumer

What it looks like:
- Consumer doing heavy HTTP call (30s) for each message.

Why it fails:
- Rebalance storm (Consumer leaves group due to timeout).

Correct approach:
- Async Processing: Move work to a thread pool.
- Pause/Resume: consumer.pause() if buffer is full.

---

7. Quality Checklist

Configuration:
- [ ] Replication: Factor 3 for production.
- [ ] Min.ISR: 2 (Prevents data loss).
- [ ] Retention: Configured correctly (Time vs Size).

Observability:
- [ ] Lag: Consumer Lag monitored (Burrow/Prometheus).
- [ ] Under-replicated: Alert on under-replicated partitions (>0).
- [ ] JMX: Metrics exported.

Examples

Example 1: Real-Time Fraud Detection Pipeline

Scenario: A financial services company needs real-time fraud detection using Kafka streaming.

Architecture Implementation:
1. Event Ingestion: Kafka Connect CDC from PostgreSQL transaction database
2. Stream Processing: Kafka Streams application for real-time pattern detection
3. Alert System: Producer to alert topic triggering notifications
4. Storage: S3 sink for historical analysis and compliance

Pipeline Configuration:
| Component | Configuration | Purpose |
|-----------|---------------|---------|
| Topics | 3 (transactions, alerts, enriched) | Data organization |
| Partitions | 12 (3 brokers × 4) | Parallelism |
| Replication | 3 | High availability |
| Compression | LZ4 | Throughput optimization |

Key Logic:
- Detects velocity patterns (5+ transactions in 1 minute)
- Identifies geographic anomalies (impossible travel)
- Flags high-risk merchant categories

Results:
- 99.7% of fraud detected in under 100ms
- False positive rate reduced from 5% to 0.3%
- Compliance audit passed with zero findings

Example 2: E-Commerce Order Processing System

Scenario: Build a resilient order processing system with Kafka for high reliability.

System Design:
1. Order Events: Topic for order lifecycle events
2. Inventory Service: Consumes orders, updates stock
3. Payment Service: Processes payments, publishes results
4. Notification Service: Sends confirmations via email/SMS

Resilience Patterns:
- Dead Letter Queue for failed processing
- Idempotent producers for exactly-once semantics
- Consumer groups with manual offset management
- Retries with exponential backoff

Configuration:

# Producer Configuration
acks: all
retries: 3
enable.idempotence: true

# Consumer Configuration
auto.offset.reset: earliest
enable.auto.commit: false
max.poll.records: 500

Results:
- 99.99% message delivery reliability
- Zero duplicate orders in 6 months
- Peak processing: 10,000 orders/second

Example 3: IoT Telemetry Platform

Scenario: Process millions of IoT device telemetry messages with Kafka.

Platform Architecture:
1. Device Gateway: MQTT to Kafka proxy
2. Data Enrichment: Stream processing adds device metadata
3. Time-Series Storage: S3 sink partitioned by device_id/date
4. Real-Time Alerts: Threshold-based alerting for anomalies

Scalability Configuration:
- 50 partitions for parallel processing
- Compression enabled for cost optimization
- Retention: 7 days hot, 1 year cold in S3
- Schema Registry for data contracts

Performance Metrics:
| Metric | Value |
|--------|-------|
| Throughput | 500,000 messages/sec |
| Latency (P99) | 50ms |
| Consumer lag | < 1 second |
| Storage efficiency | 60% reduction with compression |

Best Practices

Topic Design

  • Naming Conventions: Use clear, hierarchical topic names (domain.entity.event)
  • Partition Strategy: Plan for future growth (3x expected throughput)
  • Retention Policies: Match retention to business requirements
  • Cleanup Policies: Use delete for time-based, compact for state
  • Schema Management: Enforce schemas via Schema Registry

Producer Optimization

  • Batching: Increase batch.size and linger.ms for throughput
  • Compression: Use LZ4 for balance of speed and size
  • Acks Configuration: Use all for reliability, 1 for latency
  • Retry Strategy: Implement retries with backoff
  • Idempotence: Enable for exactly-once semantics in critical paths

Consumer Best Practices

  • Offset Management: Use manual commit for critical processing
  • Batch Processing: Increase max.poll.records for efficiency
  • Rebalance Handling: Implement graceful shutdown
  • Error Handling: Dead letter queues for poison messages
  • Monitoring: Track consumer lag and processing time

Security Configuration

  • Encryption: TLS for all client-broker communication
  • Authentication: SASL/SCRAM or mTLS for production
  • Authorization: ACLs with least privilege principle
  • Quotas: Implement client quotas to prevent abuse
  • Audit Logging: Log all access and configuration changes

Performance Tuning

  • Broker Configuration: Optimize for workload type (throughput vs latency)
  • JVM Tuning: Heap size and garbage collector selection
  • OS Tuning: File descriptor limits, network settings
  • Monitoring: Metrics for throughput, latency, and errors
  • Capacity Planning: Regular review and scaling assessment

Security:
- [ ] Encryption: TLS enabled for Client-Broker and Inter-broker.
- [ ] Auth: SASL/SCRAM or mTLS enabled.
- [ ] ACLs: Principle of least privilege (Topic read/write).

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.