rmyndharis

machine-learning-ops-ml-pipeline

187
28
# Install this skill:
npx skills add rmyndharis/antigravity-skills --skill "machine-learning-ops-ml-pipeline"

Install specific skill from multi-skill repository

# Description

Design and implement a complete ML pipeline for: $ARGUMENTS

# SKILL.md


name: machine-learning-ops-ml-pipeline
description: "Design and implement a complete ML pipeline for: $ARGUMENTS"


Machine Learning Pipeline - Multi-Agent MLOps Orchestration

Design and implement a complete ML pipeline for: $ARGUMENTS

Use this skill when

  • Working on machine learning pipeline - multi-agent mlops orchestration tasks or workflows
  • Needing guidance, best practices, or checklists for machine learning pipeline - multi-agent mlops orchestration

Do not use this skill when

  • The task is unrelated to machine learning pipeline - multi-agent mlops orchestration
  • You need a different domain or tool outside this scope

Instructions

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open resources/implementation-playbook.md.

Thinking

This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:

  • Phase-based coordination: Each phase builds upon previous outputs, with clear handoffs between agents
  • Modern tooling integration: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving
  • Production-first mindset: Every component designed for scale, monitoring, and reliability
  • Reproducibility: Version control for data, models, and infrastructure
  • Continuous improvement: Automated retraining, A/B testing, and drift detection

The multi-agent approach ensures each aspect is handled by domain experts:
- Data engineers handle ingestion and quality
- Data scientists design features and experiments
- ML engineers implement training pipelines
- MLOps engineers handle production deployment
- Observability engineers ensure monitoring

Phase 1: Data & Requirements Analysis


subagent_type: data-engineer
prompt: |
Analyze and design data pipeline for ML system with requirements: $ARGUMENTS

Deliverables:
1. Data source audit and ingestion strategy:
- Source systems and connection patterns
- Schema validation using Pydantic/Great Expectations
- Data versioning with DVC or lakeFS
- Incremental loading and CDC strategies

  1. Data quality framework:

    • Profiling and statistics generation
    • Anomaly detection rules
    • Data lineage tracking
    • Quality gates and SLAs
  2. Storage architecture:

    • Raw/processed/feature layers
    • Partitioning strategy
    • Retention policies
    • Cost optimization

Provide implementation code for critical components and integration patterns.


subagent_type: data-scientist
prompt: |
Design feature engineering and model requirements for: $ARGUMENTS
Using data architecture from: {phase1.data-engineer.output}

Deliverables:
1. Feature engineering pipeline:
- Transformation specifications
- Feature store schema (Feast/Tecton)
- Statistical validation rules
- Handling strategies for missing data/outliers

  1. Model requirements:

    • Algorithm selection rationale
    • Performance metrics and baselines
    • Training data requirements
    • Evaluation criteria and thresholds
  2. Experiment design:

    • Hypothesis and success metrics
    • A/B testing methodology
    • Sample size calculations
    • Bias detection approach

Include feature transformation code and statistical validation logic.

Phase 2: Model Development & Training


subagent_type: ml-engineer
prompt: |
Implement training pipeline based on requirements: {phase1.data-scientist.output}
Using data pipeline: {phase1.data-engineer.output}

Build comprehensive training system:
1. Training pipeline implementation:
- Modular training code with clear interfaces
- Hyperparameter optimization (Optuna/Ray Tune)
- Distributed training support (Horovod/PyTorch DDP)
- Cross-validation and ensemble strategies

  1. Experiment tracking setup:

    • MLflow/Weights & Biases integration
    • Metric logging and visualization
    • Artifact management (models, plots, data samples)
    • Experiment comparison and analysis tools
  2. Model registry integration:

    • Version control and tagging strategy
    • Model metadata and lineage
    • Promotion workflows (dev -> staging -> prod)
    • Rollback procedures

Provide complete training code with configuration management.


subagent_type: python-pro
prompt: |
Optimize and productionize ML code from: {phase2.ml-engineer.output}

Focus areas:
1. Code quality and structure:
- Refactor for production standards
- Add comprehensive error handling
- Implement proper logging with structured formats
- Create reusable components and utilities

  1. Performance optimization:

    • Profile and optimize bottlenecks
    • Implement caching strategies
    • Optimize data loading and preprocessing
    • Memory management for large-scale training
  2. Testing framework:

    • Unit tests for data transformations
    • Integration tests for pipeline components
    • Model quality tests (invariance, directional)
    • Performance regression tests

Deliver production-ready, maintainable code with full test coverage.

Phase 3: Production Deployment & Serving


subagent_type: mlops-engineer
prompt: |
Design production deployment for models from: {phase2.ml-engineer.output}
With optimized code from: {phase2.python-pro.output}

Implementation requirements:
1. Model serving infrastructure:
- REST/gRPC APIs with FastAPI/TorchServe
- Batch prediction pipelines (Airflow/Kubeflow)
- Stream processing (Kafka/Kinesis integration)
- Model serving platforms (KServe/Seldon Core)

  1. Deployment strategies:

    • Blue-green deployments for zero downtime
    • Canary releases with traffic splitting
    • Shadow deployments for validation
    • A/B testing infrastructure
  2. CI/CD pipeline:

    • GitHub Actions/GitLab CI workflows
    • Automated testing gates
    • Model validation before deployment
    • ArgoCD for GitOps deployment
  3. Infrastructure as Code:

    • Terraform modules for cloud resources
    • Helm charts for Kubernetes deployments
    • Docker multi-stage builds for optimization
    • Secret management with Vault/Secrets Manager

Provide complete deployment configuration and automation scripts.


subagent_type: kubernetes-architect
prompt: |
Design Kubernetes infrastructure for ML workloads from: {phase3.mlops-engineer.output}

Kubernetes-specific requirements:
1. Workload orchestration:
- Training job scheduling with Kubeflow
- GPU resource allocation and sharing
- Spot/preemptible instance integration
- Priority classes and resource quotas

  1. Serving infrastructure:

    • HPA/VPA for autoscaling
    • KEDA for event-driven scaling
    • Istio service mesh for traffic management
    • Model caching and warm-up strategies
  2. Storage and data access:

    • PVC strategies for training data
    • Model artifact storage with CSI drivers
    • Distributed storage for feature stores
    • Cache layers for inference optimization

Provide Kubernetes manifests and Helm charts for entire ML platform.

Phase 4: Monitoring & Continuous Improvement


subagent_type: observability-engineer
prompt: |
Implement comprehensive monitoring for ML system deployed in: {phase3.mlops-engineer.output}
Using Kubernetes infrastructure: {phase3.kubernetes-architect.output}

Monitoring framework:
1. Model performance monitoring:
- Prediction accuracy tracking
- Latency and throughput metrics
- Feature importance shifts
- Business KPI correlation

  1. Data and model drift detection:

    • Statistical drift detection (KS test, PSI)
    • Concept drift monitoring
    • Feature distribution tracking
    • Automated drift alerts and reports
  2. System observability:

    • Prometheus metrics for all components
    • Grafana dashboards for visualization
    • Distributed tracing with Jaeger/Zipkin
    • Log aggregation with ELK/Loki
  3. Alerting and automation:

    • PagerDuty/Opsgenie integration
    • Automated retraining triggers
    • Performance degradation workflows
    • Incident response runbooks
  4. Cost tracking:

    • Resource utilization metrics
    • Cost allocation by model/experiment
    • Optimization recommendations
    • Budget alerts and controls

Deliver monitoring configuration, dashboards, and alert rules.

Configuration Options

  • experiment_tracking: mlflow | wandb | neptune | clearml
  • feature_store: feast | tecton | databricks | custom
  • serving_platform: kserve | seldon | torchserve | triton
  • orchestration: kubeflow | airflow | prefect | dagster
  • cloud_provider: aws | azure | gcp | multi-cloud
  • deployment_mode: realtime | batch | streaming | hybrid
  • monitoring_stack: prometheus | datadog | newrelic | custom

Success Criteria

  1. Data Pipeline Success:
  2. < 0.1% data quality issues in production
  3. Automated data validation passing 99.9% of time
  4. Complete data lineage tracking
  5. Sub-second feature serving latency

  6. Model Performance:

  7. Meeting or exceeding baseline metrics
  8. < 5% performance degradation before retraining
  9. Successful A/B tests with statistical significance
  10. No undetected model drift > 24 hours

  11. Operational Excellence:

  12. 99.9% uptime for model serving
  13. < 200ms p99 inference latency
  14. Automated rollback within 5 minutes
  15. Complete observability with < 1 minute alert time

  16. Development Velocity:

  17. < 1 hour from commit to production
  18. Parallel experiment execution
  19. Reproducible training runs
  20. Self-service model deployment

  21. Cost Efficiency:

  22. < 20% infrastructure waste
  23. Optimized resource allocation
  24. Automatic scaling based on load
  25. Spot instance utilization > 60%

Final Deliverables

Upon completion, the orchestrated pipeline will provide:
- End-to-end ML pipeline with full automation
- Comprehensive documentation and runbooks
- Production-ready infrastructure as code
- Complete monitoring and alerting system
- CI/CD pipelines for continuous improvement
- Cost optimization and scaling strategies
- Disaster recovery and rollback procedures

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.