Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes...
npx skills add Mindrally/skills --skill "scikit-learn-best-practices"
Install specific skill from multi-skill repository
# Description
Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python
# SKILL.md
name: scikit-learn-best-practices
description: Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python
Scikit-learn Best Practices
Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.
Code Style and Structure
- Write concise, technical responses with accurate Python examples
- Prioritize reproducibility in machine learning workflows
- Use functional programming for data pipelines
- Use object-oriented programming for custom estimators
- Prefer vectorized operations over explicit loops
- Follow PEP 8 style guidelines
Machine Learning Workflow
Data Preparation
- Always split data before any preprocessing: train/validation/test
- Use
train_test_split()withrandom_statefor reproducibility - Stratify splits for imbalanced classification:
stratify=y - Keep test set completely separate until final evaluation
Feature Engineering
- Scale features appropriately for distance-based algorithms
- Use
StandardScalerfor normally distributed features - Use
MinMaxScalerfor bounded features - Use
RobustScalerfor data with outliers - Encode categorical variables:
OneHotEncoder,OrdinalEncoder,LabelEncoder - Handle missing values:
SimpleImputer,KNNImputer
Pipelines
- Always use
Pipelineto chain preprocessing and modeling - Prevents data leakage by fitting transformers only on training data
- Makes code cleaner and more reproducible
- Enables easy deployment and serialization
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
Column Transformers
- Use
ColumnTransformerfor different preprocessing per feature type - Combine numeric and categorical preprocessing in single pipeline
Model Selection and Tuning
Cross-Validation
- Use cross-validation for reliable performance estimates
cross_val_score()for quick evaluationcross_validate()for multiple metrics- Use appropriate CV strategy:
KFoldfor regressionStratifiedKFoldfor classificationTimeSeriesSplitfor temporal dataGroupKFoldfor grouped data
Hyperparameter Tuning
- Use
GridSearchCVfor exhaustive search - Use
RandomizedSearchCVfor large parameter spaces - Always tune on training/validation data, never test data
- Set
n_jobs=-1for parallel processing
Model Evaluation
Classification Metrics
- Use appropriate metrics for your problem:
accuracy_scorefor balanced classesprecision_score,recall_score,f1_scorefor imbalancedroc_auc_scorefor ranking ability- Use
classification_report()for comprehensive overview - Examine
confusion_matrix()for error analysis
Regression Metrics
mean_squared_error(MSE) for general usemean_absolute_error(MAE) for interpretabilityr2_scorefor explained variance
Evaluation Best Practices
- Report confidence intervals, not just point estimates
- Use multiple metrics to understand model behavior
- Compare against meaningful baselines
- Evaluate on held-out test set only once, at the end
Handling Imbalanced Data
- Use stratified splitting and cross-validation
- Consider class weights:
class_weight='balanced' - Use appropriate metrics (F1, AUC-PR, not accuracy)
- Adjust decision threshold based on business needs
Feature Selection
- Use
SelectKBestwith statistical tests - Use
RFE(Recursive Feature Elimination) - Use model-based selection:
SelectFromModel - Examine feature importances from tree-based models
Model Persistence
- Use
joblibfor saving and loading models - Save entire pipelines, not just models
- Version control model artifacts
- Document model metadata
Performance Optimization
- Use
n_jobs=-1for parallel processing where available - Consider
warm_start=Truefor iterative training - Use sparse matrices for high-dimensional sparse data
- Consider incremental learning with
partial_fit()for large data
Key Conventions
- Import from submodules:
from sklearn.ensemble import RandomForestClassifier - Set
random_statefor reproducibility - Use pipelines to prevent data leakage
- Document model choices and hyperparameters
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.