scikit-learn-best-practices

Name: scikit-learn-best-practices
Rating: 5 (3 reviews)
Author: mindrally

by @mindrally in DevOps & Cloud

# Install this skill:

npx skills add Mindrally/skills --skill "scikit-learn-best-practices"

Install specific skill from multi-skill repository

# Description

Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python

# SKILL.md

name: scikit-learn-best-practices
description: Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python

Scikit-learn Best Practices

Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.

Code Style and Structure

Write concise, technical responses with accurate Python examples
Prioritize reproducibility in machine learning workflows
Use functional programming for data pipelines
Use object-oriented programming for custom estimators
Prefer vectorized operations over explicit loops
Follow PEP 8 style guidelines

Machine Learning Workflow

Data Preparation

Always split data before any preprocessing: train/validation/test
Use train_test_split() with random_state for reproducibility
Stratify splits for imbalanced classification: stratify=y
Keep test set completely separate until final evaluation

Feature Engineering

Scale features appropriately for distance-based algorithms
Use StandardScaler for normally distributed features
Use MinMaxScaler for bounded features
Use RobustScaler for data with outliers
Encode categorical variables: OneHotEncoder, OrdinalEncoder, LabelEncoder
Handle missing values: SimpleImputer, KNNImputer

Pipelines

Always use Pipeline to chain preprocessing and modeling
Prevents data leakage by fitting transformers only on training data
Makes code cleaner and more reproducible
Enables easy deployment and serialization

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Column Transformers

Use ColumnTransformer for different preprocessing per feature type
Combine numeric and categorical preprocessing in single pipeline

Model Selection and Tuning

Cross-Validation

Use cross-validation for reliable performance estimates
cross_val_score() for quick evaluation
cross_validate() for multiple metrics
Use appropriate CV strategy:
KFold for regression
StratifiedKFold for classification
TimeSeriesSplit for temporal data
GroupKFold for grouped data

Hyperparameter Tuning

Use GridSearchCV for exhaustive search
Use RandomizedSearchCV for large parameter spaces
Always tune on training/validation data, never test data
Set n_jobs=-1 for parallel processing

Model Evaluation

Classification Metrics

Use appropriate metrics for your problem:
accuracy_score for balanced classes
precision_score, recall_score, f1_score for imbalanced
roc_auc_score for ranking ability
Use classification_report() for comprehensive overview
Examine confusion_matrix() for error analysis

Regression Metrics

mean_squared_error (MSE) for general use
mean_absolute_error (MAE) for interpretability
r2_score for explained variance

Evaluation Best Practices

Report confidence intervals, not just point estimates
Use multiple metrics to understand model behavior
Compare against meaningful baselines
Evaluate on held-out test set only once, at the end

Handling Imbalanced Data

Use stratified splitting and cross-validation
Consider class weights: class_weight='balanced'
Use appropriate metrics (F1, AUC-PR, not accuracy)
Adjust decision threshold based on business needs

Feature Selection

Use SelectKBest with statistical tests
Use RFE (Recursive Feature Elimination)
Use model-based selection: SelectFromModel
Examine feature importances from tree-based models

Model Persistence

Use joblib for saving and loading models
Save entire pipelines, not just models
Version control model artifacts
Document model metadata

Performance Optimization

Use n_jobs=-1 for parallel processing where available
Consider warm_start=True for iterative training
Use sparse matrices for high-dimensional sparse data
Consider incremental learning with partial_fit() for large data

Key Conventions

Import from submodules: from sklearn.ensemble import RandomForestClassifier
Set random_state for reproducibility
Use pipelines to prevent data leakage
Document model choices and hyperparameters

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.