Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add SellTheSun/STS_Skills --skill "xgboost-best-practices"
Install specific skill from multi-skill repository
# Description
XGBoost machine learning best practices for training, tuning, and deploying gradient boosted models. Use when writing, reviewing, or implementing XGBoost models for classification, regression, or ranking tasks. Triggers on tasks involving XGBoost training, hyperparameter optimization, data preparation, model evaluation, or deployment.
# SKILL.md
name: xgboost-best-practices
description: XGBoost machine learning best practices for training, tuning, and deploying gradient boosted models. Use when writing, reviewing, or implementing XGBoost models for classification, regression, or ranking tasks. Triggers on tasks involving XGBoost training, hyperparameter optimization, data preparation, model evaluation, or deployment.
license: MIT
metadata:
author: xgboost-community
version: "1.0.0"
XGBoost Best Practices
Comprehensive optimization and best practices guide for XGBoost machine learning applications. Contains 60+ rules across 10 categories, prioritized by impact to guide automated code generation and model training workflows.
[!NOTE]
For the complete guide with all rules and code examples, read AGENTS.md.
This SKILL.md provides a quick reference; AGENTS.md contains detailed explanations and incorrect/correct code patterns for all 60 rules.
When to Apply
Reference these guidelines when:
- Preparing data for XGBoost training (DMatrix, categorical features, missing values)
- Configuring hyperparameters (learning rate, tree depth, regularization)
- Training models (early stopping, cross-validation, callbacks)
- Tuning hyperparameters (grid search, Optuna, overfitting diagnosis)
- Evaluating models (feature importance, SHAP, metrics)
- Persisting and deploying models (JSON format, version compatibility)
- Optimizing performance (GPU, distributed, external memory)
- Integrating with scikit-learn pipelines
Environment Context (Ask First)
[!IMPORTANT]
If training environment is unclear, ask: LOCAL vs CLOUD, OS, GPU vendor (NVIDIA/AMD/none), XGBoost version (1.x vs 2.x+).
GPU support guidance:
- NVIDIA CUDA: Full support (cloud and local)
- AMD ROCm: Limited, Linux-only, verify build
- AMD Windows: CPU training only (tree_method='hist')
Rule Categories by Priority
| Priority | Category | Impact | Prefix |
|---|---|---|---|
| 1 | Data Preparation | CRITICAL | data- |
| 2 | General Parameters | CRITICAL | param- |
| 3 | Training | HIGH | train- |
| 4 | Hyperparameter Tuning | HIGH | tune- |
| 5 | Evaluation | MEDIUM | eval- |
| 6 | Model Persistence | HIGH | persist- |
| 7 | Deployment | MEDIUM | deploy- |
| 8 | Performance | MEDIUM | perf- |
| 9 | Scikit-Learn API | HIGH | sklearn- |
| 10 | Advanced Patterns | LOW | advanced- |
Quick Reference
1. Data Preparation (CRITICAL)
data-dmatrix- Use DMatrix for Native APIdata-missing-values- Handle Missing Values Explicitlydata-feature-engineering- Feature Scaling Usually Not Requireddata-categorical- Enable Native Categorical Supportdata-timeseries-split- Use Time-Based Cross-Validationdata-purged-cv- Use Purged/Embargo CV for Finance Labelsdata-label-encoding- Encode Target Variables Correctlydata-weight-imbalance- Use Sample Weights for Imbalanced Data
2. General Parameters (CRITICAL)
param-learning-rate- Set Lower Learning Rate with More Treesparam-max-depth- Control Tree Depth to Prevent Overfittingparam-min-child-weight- Set Minimum Child Weightparam-gamma- Use Gamma for Minimum Split Lossparam-subsample- Enable Row Subsamplingparam-colsample- Enable Column Subsamplingparam-regularization- Apply L1/L2 Regularizationparam-objective- Choose Correct Objective Functionparam-tree-method- Select Appropriate Tree Method
3. Training (HIGH)
train-early-stopping- Always Use Early Stoppingtrain-cross-validation- Use Built-in Cross-Validationtrain-evaluation-metric- Monitor Multiple Metricstrain-callbacks- Use Callbacks for Loggingtrain-watchlist- Monitor Training and Validation Losstrain-seed- Set Random Seed for Reproducibility
4. Hyperparameter Tuning (HIGH)
tune-grid-search- Start with Coarse Grid Searchtune-random-search- Use Random Search for Efficiencytune-bayesian-optuna- Use Optuna for Bayesian Optimizationtune-overfitting- Diagnose Overfitting vs Underfittingtune-parameter-order- Tune Parameters in Optimal Ordertune-scale-pos-weight- Handle Class Imbalance
5. Evaluation (MEDIUM)
eval-feature-importance- Use gain-based Feature Importanceeval-shap- Use SHAP for Model Interpretabilityeval-confusion-matrix- Generate Confusion Matrix for Classificationeval-roc-auc- Evaluate with ROC-AUC for Binaryeval-regression-metrics- Use RMSE, MAE, R² for Regression
6. Model Persistence (HIGH)
persist-save-model- Save Models in JSON/UBJSON Formatpersist-load-model- Load Models Correctlypersist-version-compat- Check Version Compatibilitypersist-feature-names- Preserve Feature Names
7. Deployment (MEDIUM)
deploy-prediction- Use Correct Prediction Methodsdeploy-batch-inference- Optimize Batch Predictionsdeploy-iteration-range- Use iteration_range for Best Modeldeploy-inference-config- Configure for Inference Performance
8. Performance (MEDIUM)
perf-gpu-training- Enable GPU Training (NVIDIA CUDA only)perf-amd-cpu-optimization- Optimize for AMD Ryzen CPUsperf-local-vs-cloud- Ask User About Training Environmentperf-distributed- Use Distributed Training for Large Dataperf-external-memory- Use External Memory for Very Large Dataperf-nthreads- Configure Thread Countperf-quantile-dmatrix- Use QuantileDMatrix for hist
9. Scikit-Learn API (HIGH)
sklearn-xgbclassifier- Use XGBClassifier Correctlysklearn-xgbregressor- Use XGBRegressor Correctlysklearn-pipeline- Integrate with Sklearn Pipelinessklearn-gridsearchcv- Use with GridSearchCVsklearn-early-stopping- Enable Early Stopping in Sklearn API
10. Advanced Patterns (LOW)
advanced-custom-objective- Implement Custom Objective Functionsadvanced-custom-metric- Implement Custom Evaluation Metricsadvanced-monotonic- Apply Monotonic Constraintsadvanced-feature-interaction- Apply Feature Interaction Constraintsadvanced-dart- Use DART Boosteradvanced-random-forest- Use XGBoost as Random Forest
How to Use
Read individual rule files for detailed explanations and code examples:
rules/data-timeseries-split.md
rules/param-learning-rate.md
rules/train-early-stopping.md
Each rule file contains:
- Brief explanation of why it matters
- Incorrect code example with explanation
- Correct code example with explanation
- Additional context and references
Full Compiled Document
For the complete guide with all rules expanded: AGENTS.md
References
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.