MLOps Lifecycle Manager

by @williamzujkowski in AI & LLM

# Install this skill:

npx skills add williamzujkowski/cognitive-toolworks --skill "MLOps Lifecycle Manager"

Install specific skill from multi-skill repository

# Description

Manage ML model lifecycle from development to deployment with experiment tracking, versioning, monitoring, and automated retraining workflows.

# SKILL.md

name: MLOps Lifecycle Manager
slug: mlops-lifecycle-manager
description: Manage ML model lifecycle from development to deployment with experiment tracking, versioning, monitoring, and automated retraining workflows.
capabilities:
- Experiment tracking setup and configuration
- Model versioning and registry management
- Deployment pipeline design (batch/online/edge)
- Model monitoring and drift detection
- Automated retraining triggers and workflows
- Model governance and lineage tracking
- Feature store integration
- A/B testing and canary deployments
inputs:
- model_type: classification, regression, clustering, generative, etc.
- deployment_target: batch, online, edge, embedded
- scale_requirements: requests_per_second, latency_sla, data_volume
- governance_level: experimental, staging, production
- existing_stack: mlflow, kubeflow, vertex_ai, sagemaker, etc.
outputs:
- experiment_config: MLflow/W&B tracking configuration
- model_registry_schema: versioning and metadata structure
- deployment_manifest: K8s/cloud-specific deployment config
- monitoring_dashboard: metrics, alerts, drift detection queries
- retraining_pipeline: automated workflow definition
keywords:
- mlops
- machine-learning
- model-deployment
- experiment-tracking
- model-monitoring
- feature-engineering
- model-versioning
- ml-pipeline
version: 1.0.0
owner: cognitive-toolworks
license: CC0-1.0
security: public; no secrets or PII
links:
- https://mlflow.org/docs/latest/
- https://www.kubeflow.org/docs/
- https://ml-ops.org/
- https://cloud.google.com/vertex-ai/docs
- https://neptune.ai/blog/mlops
- https://docs.seldon.io/

Purpose & When-To-Use

Trigger conditions:

Transitioning ML model from notebook/experiment to production deployment
Model performance degradation detected in production
Need to establish experiment tracking for ML team collaboration
Designing automated ML pipeline from training to deployment
Implementing model governance and compliance requirements
Setting up model monitoring and observability
Managing multiple model versions across environments

Use this skill when:

Building end-to-end MLOps infrastructure from scratch
Migrating ad-hoc ML workflows to production-grade systems
Implementing MLOps best practices (accessed 2025-10-25T21:30:36-04:00): https://ml-ops.org/content/three-levels-of-ml-software
Establishing model lifecycle governance for regulated industries
Optimizing model deployment patterns for scale and latency
Integrating ML workflows with existing DevOps pipelines

Pre-Checks

Time normalization:

NOW_ET = 2025-10-25T21:30:36-04:00

Input validation:

model_type must match supported categories: classification, regression, clustering, ranking, generative, time-series, recommendation, computer-vision, nlp
deployment_target must be: batch, online, edge, embedded, or hybrid
scale_requirements must include at least one of: rps (requests per second), latency_ms, data_volume_gb, concurrent_users
governance_level must be: experimental, staging, production (determines compliance/audit requirements)
existing_stack is optional; if provided, must be recognized platform: mlflow, kubeflow, vertex_ai, sagemaker, azure_ml, databricks, neptune, wandb, custom

MLOps maturity check:

Level 0: Manual process (no automation) - recommend full T3 implementation
Level 1: ML pipeline automation - focus on monitoring and governance (T2)
Level 2: CI/CD for ML pipelines - focus on advanced patterns (T2-T3)
Reference: Google MLOps Maturity Model (accessed 2025-10-25T21:30:36-04:00): https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

Framework/platform compatibility:

Kubernetes available: enables Kubeflow, Seldon, KServe deployments
Cloud-managed: Vertex AI (GCP), SageMaker (AWS), Azure ML (Azure)
Hybrid/on-prem: MLflow + model servers (TorchServe, TensorFlow Serving, Triton)

Procedure

Tier 1: Quick MLOps Assessment and Recommendations (≤2k tokens)

Goal: Provide fast MLOps setup recommendations based on model type and deployment target

Steps:

Analyze requirements
Identify primary model type and deployment pattern
Assess scale requirements (batch vs. online, latency constraints)
Determine existing infrastructure capabilities
Recommend core stack
Experiment tracking: MLflow (open-source, cloud-agnostic) or platform-native (Vertex AI Experiments, SageMaker Experiments)
Model registry: MLflow Model Registry or cloud-managed registry
Deployment:
- Batch: Kubernetes CronJobs, Airflow/Prefect orchestration
- Online: KServe/Seldon (K8s) or cloud endpoints (Vertex AI Prediction, SageMaker Inference)
- Edge: TensorFlow Lite, ONNX Runtime, TorchScript
Monitoring: Prometheus + Grafana for metrics, Evidently AI or Alibi Detect for drift
Generate quick-start checklist
Install experiment tracker
Configure model registry with versioning schema
Setup basic deployment pipeline
Enable core metrics collection (latency, throughput, error rate)
Output minimal configuration
MLflow tracking URI and artifact store location
Model registry schema (model name, version, stage: staging/production)
Deployment command template
Basic monitoring dashboard query examples

Abort conditions:
* Incompatible infrastructure (e.g., edge deployment requested but no ONNX/TFLite tooling)
* Conflicting requirements (e.g., <50ms latency with large ensemble model on CPU)

Tier 2: Production MLOps Pipeline Design (≤6k tokens)

Goal: Design production-ready MLOps pipeline with monitoring and governance

Steps:

Experiment tracking setup
MLflow configuration (accessed 2025-10-25T21:30:36-04:00): https://mlflow.org/docs/latest/tracking.html
- Backend store: PostgreSQL/MySQL for metadata
- Artifact store: S3/GCS/Azure Blob for models/artifacts
- Authentication: integrate with LDAP/OAuth
Logging best practices:
- Parameters: hyperparameters, data versioning, feature engineering config
- Metrics: training/validation loss, accuracy, precision, recall, F1, AUC
- Artifacts: model binaries, confusion matrices, feature importance plots
- Tags: team, project, experiment_type for filtering
Model versioning and registry
Versioning schema:
- Semantic versioning: major.minor.patch
- Git commit SHA for code reproducibility
- Data version hash for training data lineage
Model stages: None → Staging → Production → Archived
Metadata requirements:
- Model type, framework (scikit-learn, TensorFlow, PyTorch, XGBoost)
- Performance metrics on validation/test sets
- Hardware requirements (CPU/GPU, memory)
- Inference latency benchmarks
- Training data statistics and schema
Deployment pipeline design
Online serving (accessed 2025-10-25T21:30:36-04:00): https://www.kubeflow.org/docs/components/kserve/
- KServe InferenceService for autoscaling and canary deployments
- Model server selection: TorchServe (PyTorch), TensorFlow Serving, Triton (multi-framework)
- API design: REST or gRPC with request/response schemas
- Traffic splitting: 90/10 for canary, 50/50 for A/B testing
Batch inference:
- Kubernetes Jobs or cloud batch services
- Input data partitioning for parallelism
- Output schema validation and storage
Feature store integration:
- Feast (open-source) or cloud-native (Vertex AI Feature Store, SageMaker Feature Store)
- Online features: low-latency key-value store (Redis, DynamoDB)
- Offline features: data warehouse (BigQuery, Snowflake) for training
Model monitoring configuration
Performance monitoring:
- Latency: p50, p95, p99 percentiles
- Throughput: requests per second
- Error rates: 4xx, 5xx responses, prediction failures
Data drift detection (accessed 2025-10-25T21:30:36-04:00): https://docs.evidentlyai.com/
- Statistical tests: Kolmogorov-Smirnov, Population Stability Index (PSI)
- Feature distribution monitoring
- Alert thresholds: PSI > 0.2 triggers investigation
Model drift (concept drift):
- Prediction distribution shifts
- Accuracy degradation on labeled subset
- Business metric impact (conversion rate, revenue)
Generate production artifacts
MLflow project configuration (MLproject file)
Kubernetes deployment manifests (Deployment, Service, HPA)
Monitoring dashboard (Grafana JSON)
CI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins)

Decision rules:
* If latency SLA < 100ms → use model compression (quantization, pruning) or edge deployment
* If scale > 1000 rps → enable autoscaling (HPA) with GPU node pools
* If governance_level = production → require model approval workflow and audit logging

Tier 3: Comprehensive MLOps Lifecycle with Governance (≤12k tokens)

Goal: Full MLOps implementation with automated retraining, governance, and advanced patterns

Steps:

Advanced experiment tracking
Hyperparameter optimization integration:
- Optuna, Ray Tune, Katib (Kubeflow) integration
- Parallel trial execution and early stopping
- Best model auto-promotion to registry
Distributed training:
- Horovod, PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy
- Training cluster setup (K8s with GPU node pools)
- Checkpoint management and fault tolerance
Model governance and lineage
Model cards (accessed 2025-10-25T21:30:36-04:00): https://modelcards.withgoogle.com/about
- Model details: architecture, training data, intended use
- Performance metrics: accuracy by demographic subgroup
- Ethical considerations: fairness, bias mitigation
- Limitations and recommendations
Data lineage tracking:
- DAG visualization: raw data → features → model
- Versioned datasets with checksums
- Feature engineering pipeline capture
Compliance and audit:
- Model approval workflows (JIRA, ServiceNow integration)
- Access control: RBAC for model deployment
- Audit logs: who deployed which model when
Automated retraining pipelines
Trigger conditions:
- Scheduled: weekly, monthly based on data volume
- Performance-based: accuracy drops below threshold
- Data drift: PSI > 0.25 or KS test p-value < 0.05
- Business rules: quarterly regulatory updates
Retraining workflow (accessed 2025-10-25T21:30:36-04:00): https://www.kubeflow.org/docs/components/pipelines/
- Data validation: schema checks, outlier detection (TensorFlow Data Validation)
- Feature engineering: reproducible transformations
- Training: distributed with experiment tracking
- Evaluation: challenger vs. champion comparison
- Conditional deployment: only if challenger outperforms by X%
Kubeflow Pipelines or Vertex AI Pipelines:
- DAG definition with components (data prep, train, eval, deploy)
- Pipeline versioning and parameterization
- Artifact caching for efficiency
Advanced deployment patterns
Multi-model serving:
- Model ensemble aggregation (voting, stacking)
- Conditional routing: use simple model for 80% cases, complex for 20%
- Resource optimization: shared GPU for multiple models
Shadow deployment:
- Route traffic to both old and new models
- Compare predictions without affecting users
- Gather real-world performance data before cutover
Progressive rollout:
- Canary: 5% → 25% → 50% → 100%
- Automated rollback on SLO violation
- Feature flags for A/B testing (LaunchDarkly, Unleash)
Observability and incident response
Distributed tracing:
- OpenTelemetry integration for request tracing
- Trace prediction path: API → feature fetch → inference → response
- Latency breakdown by component
Alerting strategy:
- P0: Model serving down (Opsgenie/PagerDuty)
- P1: SLO violation (latency, error rate)
- P2: Data drift detected (Slack notification)
- P3: Retraining pipeline failure
Incident runbooks:
- Model rollback procedure
- Cache warming after deployment
- Debugging prediction anomalies
Cost optimization
Compute optimization:
- Autoscaling based on traffic patterns
- Spot/preemptible instances for training
- CPU vs. GPU cost-benefit analysis
Model optimization:
- Quantization: FP32 → FP16 or INT8 (TensorRT, ONNX Runtime)
- Pruning: remove low-importance weights
- Knowledge distillation: train smaller student model
Storage optimization:
- Artifact lifecycle policies: archive old model versions
- Compression for model binaries
Generate comprehensive deliverables
Complete MLflow project with reproducible runs
Kubeflow Pipeline YAML or Python DSL
Model monitoring dashboard (Grafana + custom panels)
Retraining automation scripts
Model governance documentation (model cards, lineage diagrams)
Runbooks for common incidents
Cost analysis and optimization report

Research sources for T3:
* MLOps principles (accessed 2025-10-25T21:30:36-04:00): https://ml-ops.org/content/mlops-principles
* Model monitoring patterns (accessed 2025-10-25T21:30:36-04:00): https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/
* Feature stores (accessed 2025-10-25T21:30:36-04:00): https://www.featurestore.org/
* Model deployment strategies (accessed 2025-10-25T21:30:36-04:00): https://docs.seldon.io/projects/seldon-core/en/latest/analytics/routers.html

Decision Rules

Stack selection:
* If existing_stack specified → integrate with existing tools rather than replace
* If cloud-native → prefer managed services (Vertex AI, SageMaker) for faster setup
* If on-prem/hybrid → prefer open-source (MLflow, Kubeflow) for portability
* If team < 5 → start with MLflow + simple deployment (T1-T2); avoid Kubeflow complexity
* If regulated industry → enforce model governance and audit logging (T3)

Deployment pattern selection:
* If latency_sla < 50ms → edge deployment or model optimization required
* If 50ms ≤ latency_sla < 500ms → online serving with optimized model servers
* If latency_sla ≥ 500ms OR batch → batch inference acceptable
* If rps > 1000 → enable autoscaling and consider model caching
* If model_size > 1GB → use model compression or multi-stage loading

Monitoring thresholds:
* Data drift (PSI): 0.1-0.2 = monitor, >0.2 = investigate, >0.25 = trigger retraining
* Model accuracy degradation: >5% drop = investigate, >10% = immediate retraining
* Latency SLO: p95 < target_latency; alert if p95 > 1.5x target for 10+ minutes
* Error rate: >1% = alert, >5% = page on-call engineer

Abort conditions:
* Incompatible model format for target deployment (e.g., Keras model to TorchServe)
* Insufficient infrastructure for scale requirements
* Conflicting governance requirements (e.g., model explainability mandate without supported framework)

Output Contract

Experiment tracking configuration:

{
  "tracking_uri": "http://mlflow-server:5000",
  "artifact_location": "s3://mlops-artifacts/experiments",
  "backend_store": "postgresql://mlflow:***@db:5432/mlflow",
  "default_artifact_root": "s3://mlops-artifacts",
  "auth_enabled": true
}

Model registry schema:

{
  "model_name": "fraud-detection-v2",
  "version": "2.1.0",
  "stage": "Production",
  "framework": "XGBoost 1.7.0",
  "metrics": {
    "auc": 0.943,
    "precision": 0.89,
    "recall": 0.87,
    "f1": 0.88
  },
  "metadata": {
    "training_data_version": "2024-10-01",
    "git_commit": "a3f4d9e",
    "trained_at": "2025-10-20T14:32:18Z",
    "trained_by": "[email protected]"
  },
  "requirements": {
    "cpu_cores": 2,
    "memory_gb": 4,
    "gpu_required": false
  },
  "signature": {
    "inputs": "[{\"name\": \"transaction_amount\", \"type\": \"double\"}, ...]",
    "outputs": "[{\"name\": \"fraud_probability\", \"type\": \"double\"}]"
  }
}

Deployment manifest (KServe example):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detection
spec:
  predictor:
    model:
      modelFormat:
        name: xgboost
      storageUri: s3://mlops-models/fraud-detection/v2.1.0
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "1"
          memory: 2Gi
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 80
    scaleMetric: concurrency
  canaryTrafficPercent: 10

Monitoring dashboard config:

{
  "metrics": [
    {
      "name": "prediction_latency_ms",
      "type": "histogram",
      "aggregations": ["p50", "p95", "p99"],
      "alert_threshold": {"p95": 200}
    },
    {
      "name": "predictions_per_second",
      "type": "counter",
      "aggregations": ["rate"]
    },
    {
      "name": "prediction_error_rate",
      "type": "counter",
      "alert_threshold": {"rate": 0.01}
    },
    {
      "name": "feature_drift_psi",
      "type": "gauge",
      "alert_threshold": {"value": 0.2}
    }
  ],
  "dashboards": [
    "grafana_dashboard_url"
  ]
}

Retraining pipeline definition:

pipeline_name: fraud-detection-retraining
schedule: "0 2 * * 0"  # Weekly on Sunday 2 AM
triggers:
  - type: performance
    metric: f1_score
    threshold: 0.85
    operator: less_than
  - type: drift
    metric: psi
    threshold: 0.25
    operator: greater_than
steps:
  - name: data-validation
    component: tfx.DataValidator
    config:
      schema_path: s3://schemas/fraud-detection.json
  - name: feature-engineering
    component: custom.FeatureTransformer
    dependencies: [data-validation]
  - name: model-training
    component: xgboost.Trainer
    dependencies: [feature-engineering]
    hyperparameters:
      max_depth: 6
      learning_rate: 0.1
      n_estimators: 100
  - name: model-evaluation
    component: custom.Evaluator
    dependencies: [model-training]
    approval_required: true
  - name: model-deployment
    component: kserve.Deployer
    dependencies: [model-evaluation]
    conditional: "evaluation.f1_score > 0.88"

Examples

# Example: Setup MLOps pipeline for fraud detection model
from mlops_lifecycle_manager import MLOpsManager

mgr = MLOpsManager(
    model_type="classification", deployment_target="online",
    scale_requirements={"rps": 500, "latency_ms": 150},
    governance_level="production"
)

# T1: Quick setup - basic stack recommendations
config = mgr.quick_setup(tracking="mlflow", deployment="kserve")
print(config.get_start_commands())

# T2: Production pipeline with monitoring
pipeline = mgr.design_pipeline(
    experiment_tracking=True, model_monitoring=True,
    drift_detection="evidently"
)
pipeline.generate_manifests(output_dir="./mlops-config")

# T3: Full lifecycle with automated retraining
lifecycle = mgr.full_lifecycle(
    retraining_triggers=["drift", "performance"],
    governance={"model_cards": True, "audit_logs": True}
)
lifecycle.deploy(dry_run=False)

Quality Gates

Token budgets:
* T1 (quick assessment): ≤2000 tokens - basic stack recommendations only
* T2 (production pipeline): ≤6000 tokens - monitoring and deployment included
* T3 (full lifecycle): ≤12000 tokens - governance, retraining, advanced patterns

Validation checks:
* All recommendations must cite authoritative sources with access dates
* Generated configs must be valid YAML/JSON (validate before output)
* Monitoring thresholds must be specific and actionable
* No hardcoded credentials or secrets in any outputs
* Model registry schema must include versioning and metadata

Safety requirements:
* Never recommend storing secrets in code or configs (use secret managers)
* Always include rollback procedures in deployment plans
* Require approval workflows for production deployments
* Implement gradual rollouts (canary/blue-green) for production models
* Include data privacy checks for regulated industries (GDPR, HIPAA)

Auditability:
* All model deployments must be traceable to training runs
* Model cards required for production governance_level
* Change logs for model versions
* Access control and RBAC for model registry

Determinism:
* Reproducible training: fix random seeds, log dependency versions
* Model signatures must define input/output schemas
* Feature engineering must be versioned and reversible

Resources

MLOps frameworks and platforms:
* MLflow documentation (accessed 2025-10-25T21:30:36-04:00): https://mlflow.org/docs/latest/
* Kubeflow documentation (accessed 2025-10-25T21:30:36-04:00): https://www.kubeflow.org/docs/
* ML-Ops.org principles (accessed 2025-10-25T21:30:36-04:00): https://ml-ops.org/
* Google Cloud Vertex AI (accessed 2025-10-25T21:30:36-04:00): https://cloud.google.com/vertex-ai/docs
* AWS SageMaker MLOps (accessed 2025-10-25T21:30:36-04:00): https://docs.aws.amazon.com/sagemaker/latest/dg/mlops.html
* Neptune.ai MLOps guide (accessed 2025-10-25T21:30:36-04:00): https://neptune.ai/blog/mlops

Model serving and deployment:
* KServe documentation (accessed 2025-10-25T21:30:36-04:00): https://kserve.github.io/website/
* Seldon Core patterns (accessed 2025-10-25T21:30:36-04:00): https://docs.seldon.io/
* TorchServe (accessed 2025-10-25T21:30:36-04:00): https://pytorch.org/serve/
* TensorFlow Serving (accessed 2025-10-25T21:30:36-04:00): https://www.tensorflow.org/tfx/guide/serving

Monitoring and observability:
* Evidently AI drift detection (accessed 2025-10-25T21:30:36-04:00): https://docs.evidentlyai.com/
* Model monitoring guide (accessed 2025-10-25T21:30:36-04:00): https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/
* Prometheus for ML metrics (accessed 2025-10-25T21:30:36-04:00): https://prometheus.io/docs/introduction/overview/

Governance and best practices:
* Model cards (accessed 2025-10-25T21:30:36-04:00): https://modelcards.withgoogle.com/about
* Google MLOps maturity model (accessed 2025-10-25T21:30:36-04:00): https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
* Feature store concepts (accessed 2025-10-25T21:30:36-04:00): https://www.featurestore.org/

Additional resources in /skills/mlops-lifecycle-manager/resources/:
* mlflow-tracking-config.yaml - Complete MLflow server configuration
* kserve-inference-service.yaml - Production-ready KServe manifest template
* model-card-template.md - Standardized model documentation template
* monitoring-dashboard.json - Grafana dashboard for ML models
* retraining-pipeline.py - Kubeflow Pipeline example with automated triggers

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

MLOps Lifecycle Manager

# Description

# SKILL.md

Purpose & When-To-Use

Pre-Checks

Procedure

Tier 1: Quick MLOps Assessment and Recommendations (≤2k tokens)

Tier 2: Production MLOps Pipeline Design (≤6k tokens)

Tier 3: Comprehensive MLOps Lifecycle with Governance (≤12k tokens)

Decision Rules

Output Contract

Examples

Quality Gates

Resources

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill