williamzujkowski

MLOps Lifecycle Manager

3
0
# Install this skill:
npx skills add williamzujkowski/cognitive-toolworks --skill "MLOps Lifecycle Manager"

Install specific skill from multi-skill repository

# Description

Manage ML model lifecycle from development to deployment with experiment tracking, versioning, monitoring, and automated retraining workflows.

# SKILL.md


name: MLOps Lifecycle Manager
slug: mlops-lifecycle-manager
description: Manage ML model lifecycle from development to deployment with experiment tracking, versioning, monitoring, and automated retraining workflows.
capabilities:
- Experiment tracking setup and configuration
- Model versioning and registry management
- Deployment pipeline design (batch/online/edge)
- Model monitoring and drift detection
- Automated retraining triggers and workflows
- Model governance and lineage tracking
- Feature store integration
- A/B testing and canary deployments
inputs:
- model_type: classification, regression, clustering, generative, etc.
- deployment_target: batch, online, edge, embedded
- scale_requirements: requests_per_second, latency_sla, data_volume
- governance_level: experimental, staging, production
- existing_stack: mlflow, kubeflow, vertex_ai, sagemaker, etc.
outputs:
- experiment_config: MLflow/W&B tracking configuration
- model_registry_schema: versioning and metadata structure
- deployment_manifest: K8s/cloud-specific deployment config
- monitoring_dashboard: metrics, alerts, drift detection queries
- retraining_pipeline: automated workflow definition
keywords:
- mlops
- machine-learning
- model-deployment
- experiment-tracking
- model-monitoring
- feature-engineering
- model-versioning
- ml-pipeline
version: 1.0.0
owner: cognitive-toolworks
license: CC0-1.0
security: public; no secrets or PII
links:
- https://mlflow.org/docs/latest/
- https://www.kubeflow.org/docs/
- https://ml-ops.org/
- https://cloud.google.com/vertex-ai/docs
- https://neptune.ai/blog/mlops
- https://docs.seldon.io/


Purpose & When-To-Use

Trigger conditions:

  • Transitioning ML model from notebook/experiment to production deployment
  • Model performance degradation detected in production
  • Need to establish experiment tracking for ML team collaboration
  • Designing automated ML pipeline from training to deployment
  • Implementing model governance and compliance requirements
  • Setting up model monitoring and observability
  • Managing multiple model versions across environments

Use this skill when:

  • Building end-to-end MLOps infrastructure from scratch
  • Migrating ad-hoc ML workflows to production-grade systems
  • Implementing MLOps best practices (accessed 2025-10-25T21:30:36-04:00): https://ml-ops.org/content/three-levels-of-ml-software
  • Establishing model lifecycle governance for regulated industries
  • Optimizing model deployment patterns for scale and latency
  • Integrating ML workflows with existing DevOps pipelines

Pre-Checks

Time normalization:

NOW_ET = 2025-10-25T21:30:36-04:00

Input validation:

  • model_type must match supported categories: classification, regression, clustering, ranking, generative, time-series, recommendation, computer-vision, nlp
  • deployment_target must be: batch, online, edge, embedded, or hybrid
  • scale_requirements must include at least one of: rps (requests per second), latency_ms, data_volume_gb, concurrent_users
  • governance_level must be: experimental, staging, production (determines compliance/audit requirements)
  • existing_stack is optional; if provided, must be recognized platform: mlflow, kubeflow, vertex_ai, sagemaker, azure_ml, databricks, neptune, wandb, custom

MLOps maturity check:

  • Level 0: Manual process (no automation) - recommend full T3 implementation
  • Level 1: ML pipeline automation - focus on monitoring and governance (T2)
  • Level 2: CI/CD for ML pipelines - focus on advanced patterns (T2-T3)
  • Reference: Google MLOps Maturity Model (accessed 2025-10-25T21:30:36-04:00): https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

Framework/platform compatibility:

  • Kubernetes available: enables Kubeflow, Seldon, KServe deployments
  • Cloud-managed: Vertex AI (GCP), SageMaker (AWS), Azure ML (Azure)
  • Hybrid/on-prem: MLflow + model servers (TorchServe, TensorFlow Serving, Triton)

Procedure

Tier 1: Quick MLOps Assessment and Recommendations (≀2k tokens)

Goal: Provide fast MLOps setup recommendations based on model type and deployment target

Steps:

  1. Analyze requirements
  2. Identify primary model type and deployment pattern
  3. Assess scale requirements (batch vs. online, latency constraints)
  4. Determine existing infrastructure capabilities

  5. Recommend core stack

  6. Experiment tracking: MLflow (open-source, cloud-agnostic) or platform-native (Vertex AI Experiments, SageMaker Experiments)
  7. Model registry: MLflow Model Registry or cloud-managed registry
  8. Deployment:
    • Batch: Kubernetes CronJobs, Airflow/Prefect orchestration
    • Online: KServe/Seldon (K8s) or cloud endpoints (Vertex AI Prediction, SageMaker Inference)
    • Edge: TensorFlow Lite, ONNX Runtime, TorchScript
  9. Monitoring: Prometheus + Grafana for metrics, Evidently AI or Alibi Detect for drift

  10. Generate quick-start checklist

  11. Install experiment tracker
  12. Configure model registry with versioning schema
  13. Setup basic deployment pipeline
  14. Enable core metrics collection (latency, throughput, error rate)

  15. Output minimal configuration

  16. MLflow tracking URI and artifact store location
  17. Model registry schema (model name, version, stage: staging/production)
  18. Deployment command template
  19. Basic monitoring dashboard query examples

Abort conditions:
* Incompatible infrastructure (e.g., edge deployment requested but no ONNX/TFLite tooling)
* Conflicting requirements (e.g., <50ms latency with large ensemble model on CPU)

Tier 2: Production MLOps Pipeline Design (≀6k tokens)

Goal: Design production-ready MLOps pipeline with monitoring and governance

Steps:

  1. Experiment tracking setup
  2. MLflow configuration (accessed 2025-10-25T21:30:36-04:00): https://mlflow.org/docs/latest/tracking.html
    • Backend store: PostgreSQL/MySQL for metadata
    • Artifact store: S3/GCS/Azure Blob for models/artifacts
    • Authentication: integrate with LDAP/OAuth
  3. Logging best practices:

    • Parameters: hyperparameters, data versioning, feature engineering config
    • Metrics: training/validation loss, accuracy, precision, recall, F1, AUC
    • Artifacts: model binaries, confusion matrices, feature importance plots
    • Tags: team, project, experiment_type for filtering
  4. Model versioning and registry

  5. Versioning schema:
    • Semantic versioning: major.minor.patch
    • Git commit SHA for code reproducibility
    • Data version hash for training data lineage
  6. Model stages: None β†’ Staging β†’ Production β†’ Archived
  7. Metadata requirements:

    • Model type, framework (scikit-learn, TensorFlow, PyTorch, XGBoost)
    • Performance metrics on validation/test sets
    • Hardware requirements (CPU/GPU, memory)
    • Inference latency benchmarks
    • Training data statistics and schema
  8. Deployment pipeline design

  9. Online serving (accessed 2025-10-25T21:30:36-04:00): https://www.kubeflow.org/docs/components/kserve/
    • KServe InferenceService for autoscaling and canary deployments
    • Model server selection: TorchServe (PyTorch), TensorFlow Serving, Triton (multi-framework)
    • API design: REST or gRPC with request/response schemas
    • Traffic splitting: 90/10 for canary, 50/50 for A/B testing
  10. Batch inference:
    • Kubernetes Jobs or cloud batch services
    • Input data partitioning for parallelism
    • Output schema validation and storage
  11. Feature store integration:

    • Feast (open-source) or cloud-native (Vertex AI Feature Store, SageMaker Feature Store)
    • Online features: low-latency key-value store (Redis, DynamoDB)
    • Offline features: data warehouse (BigQuery, Snowflake) for training
  12. Model monitoring configuration

  13. Performance monitoring:
    • Latency: p50, p95, p99 percentiles
    • Throughput: requests per second
    • Error rates: 4xx, 5xx responses, prediction failures
  14. Data drift detection (accessed 2025-10-25T21:30:36-04:00): https://docs.evidentlyai.com/
    • Statistical tests: Kolmogorov-Smirnov, Population Stability Index (PSI)
    • Feature distribution monitoring
    • Alert thresholds: PSI > 0.2 triggers investigation
  15. Model drift (concept drift):

    • Prediction distribution shifts
    • Accuracy degradation on labeled subset
    • Business metric impact (conversion rate, revenue)
  16. Generate production artifacts

  17. MLflow project configuration (MLproject file)
  18. Kubernetes deployment manifests (Deployment, Service, HPA)
  19. Monitoring dashboard (Grafana JSON)
  20. CI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins)

Decision rules:
* If latency SLA < 100ms β†’ use model compression (quantization, pruning) or edge deployment
* If scale > 1000 rps β†’ enable autoscaling (HPA) with GPU node pools
* If governance_level = production β†’ require model approval workflow and audit logging

Tier 3: Comprehensive MLOps Lifecycle with Governance (≀12k tokens)

Goal: Full MLOps implementation with automated retraining, governance, and advanced patterns

Steps:

  1. Advanced experiment tracking
  2. Hyperparameter optimization integration:
    • Optuna, Ray Tune, Katib (Kubeflow) integration
    • Parallel trial execution and early stopping
    • Best model auto-promotion to registry
  3. Distributed training:

    • Horovod, PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy
    • Training cluster setup (K8s with GPU node pools)
    • Checkpoint management and fault tolerance
  4. Model governance and lineage

  5. Model cards (accessed 2025-10-25T21:30:36-04:00): https://modelcards.withgoogle.com/about
    • Model details: architecture, training data, intended use
    • Performance metrics: accuracy by demographic subgroup
    • Ethical considerations: fairness, bias mitigation
    • Limitations and recommendations
  6. Data lineage tracking:
    • DAG visualization: raw data β†’ features β†’ model
    • Versioned datasets with checksums
    • Feature engineering pipeline capture
  7. Compliance and audit:

    • Model approval workflows (JIRA, ServiceNow integration)
    • Access control: RBAC for model deployment
    • Audit logs: who deployed which model when
  8. Automated retraining pipelines

  9. Trigger conditions:
    • Scheduled: weekly, monthly based on data volume
    • Performance-based: accuracy drops below threshold
    • Data drift: PSI > 0.25 or KS test p-value < 0.05
    • Business rules: quarterly regulatory updates
  10. Retraining workflow (accessed 2025-10-25T21:30:36-04:00): https://www.kubeflow.org/docs/components/pipelines/
    • Data validation: schema checks, outlier detection (TensorFlow Data Validation)
    • Feature engineering: reproducible transformations
    • Training: distributed with experiment tracking
    • Evaluation: challenger vs. champion comparison
    • Conditional deployment: only if challenger outperforms by X%
  11. Kubeflow Pipelines or Vertex AI Pipelines:

    • DAG definition with components (data prep, train, eval, deploy)
    • Pipeline versioning and parameterization
    • Artifact caching for efficiency
  12. Advanced deployment patterns

  13. Multi-model serving:
    • Model ensemble aggregation (voting, stacking)
    • Conditional routing: use simple model for 80% cases, complex for 20%
    • Resource optimization: shared GPU for multiple models
  14. Shadow deployment:
    • Route traffic to both old and new models
    • Compare predictions without affecting users
    • Gather real-world performance data before cutover
  15. Progressive rollout:

    • Canary: 5% β†’ 25% β†’ 50% β†’ 100%
    • Automated rollback on SLO violation
    • Feature flags for A/B testing (LaunchDarkly, Unleash)
  16. Observability and incident response

  17. Distributed tracing:
    • OpenTelemetry integration for request tracing
    • Trace prediction path: API β†’ feature fetch β†’ inference β†’ response
    • Latency breakdown by component
  18. Alerting strategy:
    • P0: Model serving down (Opsgenie/PagerDuty)
    • P1: SLO violation (latency, error rate)
    • P2: Data drift detected (Slack notification)
    • P3: Retraining pipeline failure
  19. Incident runbooks:

    • Model rollback procedure
    • Cache warming after deployment
    • Debugging prediction anomalies
  20. Cost optimization

  21. Compute optimization:
    • Autoscaling based on traffic patterns
    • Spot/preemptible instances for training
    • CPU vs. GPU cost-benefit analysis
  22. Model optimization:
    • Quantization: FP32 β†’ FP16 or INT8 (TensorRT, ONNX Runtime)
    • Pruning: remove low-importance weights
    • Knowledge distillation: train smaller student model
  23. Storage optimization:

    • Artifact lifecycle policies: archive old model versions
    • Compression for model binaries
  24. Generate comprehensive deliverables

  25. Complete MLflow project with reproducible runs
  26. Kubeflow Pipeline YAML or Python DSL
  27. Model monitoring dashboard (Grafana + custom panels)
  28. Retraining automation scripts
  29. Model governance documentation (model cards, lineage diagrams)
  30. Runbooks for common incidents
  31. Cost analysis and optimization report

Research sources for T3:
* MLOps principles (accessed 2025-10-25T21:30:36-04:00): https://ml-ops.org/content/mlops-principles
* Model monitoring patterns (accessed 2025-10-25T21:30:36-04:00): https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/
* Feature stores (accessed 2025-10-25T21:30:36-04:00): https://www.featurestore.org/
* Model deployment strategies (accessed 2025-10-25T21:30:36-04:00): https://docs.seldon.io/projects/seldon-core/en/latest/analytics/routers.html

Decision Rules

Stack selection:
* If existing_stack specified β†’ integrate with existing tools rather than replace
* If cloud-native β†’ prefer managed services (Vertex AI, SageMaker) for faster setup
* If on-prem/hybrid β†’ prefer open-source (MLflow, Kubeflow) for portability
* If team < 5 β†’ start with MLflow + simple deployment (T1-T2); avoid Kubeflow complexity
* If regulated industry β†’ enforce model governance and audit logging (T3)

Deployment pattern selection:
* If latency_sla < 50ms β†’ edge deployment or model optimization required
* If 50ms ≀ latency_sla < 500ms β†’ online serving with optimized model servers
* If latency_sla β‰₯ 500ms OR batch β†’ batch inference acceptable
* If rps > 1000 β†’ enable autoscaling and consider model caching
* If model_size > 1GB β†’ use model compression or multi-stage loading

Monitoring thresholds:
* Data drift (PSI): 0.1-0.2 = monitor, >0.2 = investigate, >0.25 = trigger retraining
* Model accuracy degradation: >5% drop = investigate, >10% = immediate retraining
* Latency SLO: p95 < target_latency; alert if p95 > 1.5x target for 10+ minutes
* Error rate: >1% = alert, >5% = page on-call engineer

Abort conditions:
* Incompatible model format for target deployment (e.g., Keras model to TorchServe)
* Insufficient infrastructure for scale requirements
* Conflicting governance requirements (e.g., model explainability mandate without supported framework)

Output Contract

Experiment tracking configuration:

{
  "tracking_uri": "http://mlflow-server:5000",
  "artifact_location": "s3://mlops-artifacts/experiments",
  "backend_store": "postgresql://mlflow:***@db:5432/mlflow",
  "default_artifact_root": "s3://mlops-artifacts",
  "auth_enabled": true
}

Model registry schema:

{
  "model_name": "fraud-detection-v2",
  "version": "2.1.0",
  "stage": "Production",
  "framework": "XGBoost 1.7.0",
  "metrics": {
    "auc": 0.943,
    "precision": 0.89,
    "recall": 0.87,
    "f1": 0.88
  },
  "metadata": {
    "training_data_version": "2024-10-01",
    "git_commit": "a3f4d9e",
    "trained_at": "2025-10-20T14:32:18Z",
    "trained_by": "[email protected]"
  },
  "requirements": {
    "cpu_cores": 2,
    "memory_gb": 4,
    "gpu_required": false
  },
  "signature": {
    "inputs": "[{\"name\": \"transaction_amount\", \"type\": \"double\"}, ...]",
    "outputs": "[{\"name\": \"fraud_probability\", \"type\": \"double\"}]"
  }
}

Deployment manifest (KServe example):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detection
spec:
  predictor:
    model:
      modelFormat:
        name: xgboost
      storageUri: s3://mlops-models/fraud-detection/v2.1.0
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "1"
          memory: 2Gi
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 80
    scaleMetric: concurrency
  canaryTrafficPercent: 10

Monitoring dashboard config:

{
  "metrics": [
    {
      "name": "prediction_latency_ms",
      "type": "histogram",
      "aggregations": ["p50", "p95", "p99"],
      "alert_threshold": {"p95": 200}
    },
    {
      "name": "predictions_per_second",
      "type": "counter",
      "aggregations": ["rate"]
    },
    {
      "name": "prediction_error_rate",
      "type": "counter",
      "alert_threshold": {"rate": 0.01}
    },
    {
      "name": "feature_drift_psi",
      "type": "gauge",
      "alert_threshold": {"value": 0.2}
    }
  ],
  "dashboards": [
    "grafana_dashboard_url"
  ]
}

Retraining pipeline definition:

pipeline_name: fraud-detection-retraining
schedule: "0 2 * * 0"  # Weekly on Sunday 2 AM
triggers:
  - type: performance
    metric: f1_score
    threshold: 0.85
    operator: less_than
  - type: drift
    metric: psi
    threshold: 0.25
    operator: greater_than
steps:
  - name: data-validation
    component: tfx.DataValidator
    config:
      schema_path: s3://schemas/fraud-detection.json
  - name: feature-engineering
    component: custom.FeatureTransformer
    dependencies: [data-validation]
  - name: model-training
    component: xgboost.Trainer
    dependencies: [feature-engineering]
    hyperparameters:
      max_depth: 6
      learning_rate: 0.1
      n_estimators: 100
  - name: model-evaluation
    component: custom.Evaluator
    dependencies: [model-training]
    approval_required: true
  - name: model-deployment
    component: kserve.Deployer
    dependencies: [model-evaluation]
    conditional: "evaluation.f1_score > 0.88"

Examples

# Example: Setup MLOps pipeline for fraud detection model
from mlops_lifecycle_manager import MLOpsManager

mgr = MLOpsManager(
    model_type="classification", deployment_target="online",
    scale_requirements={"rps": 500, "latency_ms": 150},
    governance_level="production"
)

# T1: Quick setup - basic stack recommendations
config = mgr.quick_setup(tracking="mlflow", deployment="kserve")
print(config.get_start_commands())

# T2: Production pipeline with monitoring
pipeline = mgr.design_pipeline(
    experiment_tracking=True, model_monitoring=True,
    drift_detection="evidently"
)
pipeline.generate_manifests(output_dir="./mlops-config")

# T3: Full lifecycle with automated retraining
lifecycle = mgr.full_lifecycle(
    retraining_triggers=["drift", "performance"],
    governance={"model_cards": True, "audit_logs": True}
)
lifecycle.deploy(dry_run=False)

Quality Gates

Token budgets:
* T1 (quick assessment): ≀2000 tokens - basic stack recommendations only
* T2 (production pipeline): ≀6000 tokens - monitoring and deployment included
* T3 (full lifecycle): ≀12000 tokens - governance, retraining, advanced patterns

Validation checks:
* All recommendations must cite authoritative sources with access dates
* Generated configs must be valid YAML/JSON (validate before output)
* Monitoring thresholds must be specific and actionable
* No hardcoded credentials or secrets in any outputs
* Model registry schema must include versioning and metadata

Safety requirements:
* Never recommend storing secrets in code or configs (use secret managers)
* Always include rollback procedures in deployment plans
* Require approval workflows for production deployments
* Implement gradual rollouts (canary/blue-green) for production models
* Include data privacy checks for regulated industries (GDPR, HIPAA)

Auditability:
* All model deployments must be traceable to training runs
* Model cards required for production governance_level
* Change logs for model versions
* Access control and RBAC for model registry

Determinism:
* Reproducible training: fix random seeds, log dependency versions
* Model signatures must define input/output schemas
* Feature engineering must be versioned and reversible

Resources

MLOps frameworks and platforms:
* MLflow documentation (accessed 2025-10-25T21:30:36-04:00): https://mlflow.org/docs/latest/
* Kubeflow documentation (accessed 2025-10-25T21:30:36-04:00): https://www.kubeflow.org/docs/
* ML-Ops.org principles (accessed 2025-10-25T21:30:36-04:00): https://ml-ops.org/
* Google Cloud Vertex AI (accessed 2025-10-25T21:30:36-04:00): https://cloud.google.com/vertex-ai/docs
* AWS SageMaker MLOps (accessed 2025-10-25T21:30:36-04:00): https://docs.aws.amazon.com/sagemaker/latest/dg/mlops.html
* Neptune.ai MLOps guide (accessed 2025-10-25T21:30:36-04:00): https://neptune.ai/blog/mlops

Model serving and deployment:
* KServe documentation (accessed 2025-10-25T21:30:36-04:00): https://kserve.github.io/website/
* Seldon Core patterns (accessed 2025-10-25T21:30:36-04:00): https://docs.seldon.io/
* TorchServe (accessed 2025-10-25T21:30:36-04:00): https://pytorch.org/serve/
* TensorFlow Serving (accessed 2025-10-25T21:30:36-04:00): https://www.tensorflow.org/tfx/guide/serving

Monitoring and observability:
* Evidently AI drift detection (accessed 2025-10-25T21:30:36-04:00): https://docs.evidentlyai.com/
* Model monitoring guide (accessed 2025-10-25T21:30:36-04:00): https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/
* Prometheus for ML metrics (accessed 2025-10-25T21:30:36-04:00): https://prometheus.io/docs/introduction/overview/

Governance and best practices:
* Model cards (accessed 2025-10-25T21:30:36-04:00): https://modelcards.withgoogle.com/about
* Google MLOps maturity model (accessed 2025-10-25T21:30:36-04:00): https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
* Feature store concepts (accessed 2025-10-25T21:30:36-04:00): https://www.featurestore.org/

Additional resources in /skills/mlops-lifecycle-manager/resources/:
* mlflow-tracking-config.yaml - Complete MLflow server configuration
* kserve-inference-service.yaml - Production-ready KServe manifest template
* model-card-template.md - Standardized model documentation template
* monitoring-dashboard.json - Grafana dashboard for ML models
* retraining-pipeline.py - Kubeflow Pipeline example with automated triggers

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.