SellTheSun

xgboost-best-practices

0
0
# Install this skill:
npx skills add SellTheSun/STS_Skills --skill "xgboost-best-practices"

Install specific skill from multi-skill repository

# Description

XGBoost machine learning best practices for training, tuning, and deploying gradient boosted models. Use when writing, reviewing, or implementing XGBoost models for classification, regression, or ranking tasks. Triggers on tasks involving XGBoost training, hyperparameter optimization, data preparation, model evaluation, or deployment.

# SKILL.md


name: xgboost-best-practices
description: XGBoost machine learning best practices for training, tuning, and deploying gradient boosted models. Use when writing, reviewing, or implementing XGBoost models for classification, regression, or ranking tasks. Triggers on tasks involving XGBoost training, hyperparameter optimization, data preparation, model evaluation, or deployment.
license: MIT
metadata:
author: xgboost-community
version: "1.0.0"


XGBoost Best Practices

Comprehensive optimization and best practices guide for XGBoost machine learning applications. Contains 60+ rules across 10 categories, prioritized by impact to guide automated code generation and model training workflows.

[!NOTE]
For the complete guide with all rules and code examples, read AGENTS.md.
This SKILL.md provides a quick reference; AGENTS.md contains detailed explanations and incorrect/correct code patterns for all 60 rules.

When to Apply

Reference these guidelines when:
- Preparing data for XGBoost training (DMatrix, categorical features, missing values)
- Configuring hyperparameters (learning rate, tree depth, regularization)
- Training models (early stopping, cross-validation, callbacks)
- Tuning hyperparameters (grid search, Optuna, overfitting diagnosis)
- Evaluating models (feature importance, SHAP, metrics)
- Persisting and deploying models (JSON format, version compatibility)
- Optimizing performance (GPU, distributed, external memory)
- Integrating with scikit-learn pipelines

Environment Context (Ask First)

[!IMPORTANT]
If training environment is unclear, ask: LOCAL vs CLOUD, OS, GPU vendor (NVIDIA/AMD/none), XGBoost version (1.x vs 2.x+).

GPU support guidance:
- NVIDIA CUDA: Full support (cloud and local)
- AMD ROCm: Limited, Linux-only, verify build
- AMD Windows: CPU training only (tree_method='hist')

Rule Categories by Priority

Priority Category Impact Prefix
1 Data Preparation CRITICAL data-
2 General Parameters CRITICAL param-
3 Training HIGH train-
4 Hyperparameter Tuning HIGH tune-
5 Evaluation MEDIUM eval-
6 Model Persistence HIGH persist-
7 Deployment MEDIUM deploy-
8 Performance MEDIUM perf-
9 Scikit-Learn API HIGH sklearn-
10 Advanced Patterns LOW advanced-

Quick Reference

1. Data Preparation (CRITICAL)

  • data-dmatrix - Use DMatrix for Native API
  • data-missing-values - Handle Missing Values Explicitly
  • data-feature-engineering - Feature Scaling Usually Not Required
  • data-categorical - Enable Native Categorical Support
  • data-timeseries-split - Use Time-Based Cross-Validation
  • data-purged-cv - Use Purged/Embargo CV for Finance Labels
  • data-label-encoding - Encode Target Variables Correctly
  • data-weight-imbalance - Use Sample Weights for Imbalanced Data

2. General Parameters (CRITICAL)

  • param-learning-rate - Set Lower Learning Rate with More Trees
  • param-max-depth - Control Tree Depth to Prevent Overfitting
  • param-min-child-weight - Set Minimum Child Weight
  • param-gamma - Use Gamma for Minimum Split Loss
  • param-subsample - Enable Row Subsampling
  • param-colsample - Enable Column Subsampling
  • param-regularization - Apply L1/L2 Regularization
  • param-objective - Choose Correct Objective Function
  • param-tree-method - Select Appropriate Tree Method

3. Training (HIGH)

  • train-early-stopping - Always Use Early Stopping
  • train-cross-validation - Use Built-in Cross-Validation
  • train-evaluation-metric - Monitor Multiple Metrics
  • train-callbacks - Use Callbacks for Logging
  • train-watchlist - Monitor Training and Validation Loss
  • train-seed - Set Random Seed for Reproducibility

4. Hyperparameter Tuning (HIGH)

  • tune-grid-search - Start with Coarse Grid Search
  • tune-random-search - Use Random Search for Efficiency
  • tune-bayesian-optuna - Use Optuna for Bayesian Optimization
  • tune-overfitting - Diagnose Overfitting vs Underfitting
  • tune-parameter-order - Tune Parameters in Optimal Order
  • tune-scale-pos-weight - Handle Class Imbalance

5. Evaluation (MEDIUM)

  • eval-feature-importance - Use gain-based Feature Importance
  • eval-shap - Use SHAP for Model Interpretability
  • eval-confusion-matrix - Generate Confusion Matrix for Classification
  • eval-roc-auc - Evaluate with ROC-AUC for Binary
  • eval-regression-metrics - Use RMSE, MAE, R² for Regression

6. Model Persistence (HIGH)

  • persist-save-model - Save Models in JSON/UBJSON Format
  • persist-load-model - Load Models Correctly
  • persist-version-compat - Check Version Compatibility
  • persist-feature-names - Preserve Feature Names

7. Deployment (MEDIUM)

  • deploy-prediction - Use Correct Prediction Methods
  • deploy-batch-inference - Optimize Batch Predictions
  • deploy-iteration-range - Use iteration_range for Best Model
  • deploy-inference-config - Configure for Inference Performance

8. Performance (MEDIUM)

  • perf-gpu-training - Enable GPU Training (NVIDIA CUDA only)
  • perf-amd-cpu-optimization - Optimize for AMD Ryzen CPUs
  • perf-local-vs-cloud - Ask User About Training Environment
  • perf-distributed - Use Distributed Training for Large Data
  • perf-external-memory - Use External Memory for Very Large Data
  • perf-nthreads - Configure Thread Count
  • perf-quantile-dmatrix - Use QuantileDMatrix for hist

9. Scikit-Learn API (HIGH)

  • sklearn-xgbclassifier - Use XGBClassifier Correctly
  • sklearn-xgbregressor - Use XGBRegressor Correctly
  • sklearn-pipeline - Integrate with Sklearn Pipelines
  • sklearn-gridsearchcv - Use with GridSearchCV
  • sklearn-early-stopping - Enable Early Stopping in Sklearn API

10. Advanced Patterns (LOW)

  • advanced-custom-objective - Implement Custom Objective Functions
  • advanced-custom-metric - Implement Custom Evaluation Metrics
  • advanced-monotonic - Apply Monotonic Constraints
  • advanced-feature-interaction - Apply Feature Interaction Constraints
  • advanced-dart - Use DART Booster
  • advanced-random-forest - Use XGBoost as Random Forest

How to Use

Read individual rule files for detailed explanations and code examples:

rules/data-timeseries-split.md
rules/param-learning-rate.md
rules/train-early-stopping.md

Each rule file contains:
- Brief explanation of why it matters
- Incorrect code example with explanation
- Correct code example with explanation
- Additional context and references

Full Compiled Document

For the complete guide with all rules expanded: AGENTS.md

References

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.