mindrally

scikit-learn-best-practices

3
0
# Install this skill:
npx skills add Mindrally/skills --skill "scikit-learn-best-practices"

Install specific skill from multi-skill repository

# Description

Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python

# SKILL.md


name: scikit-learn-best-practices
description: Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python


Scikit-learn Best Practices

Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.

Code Style and Structure

  • Write concise, technical responses with accurate Python examples
  • Prioritize reproducibility in machine learning workflows
  • Use functional programming for data pipelines
  • Use object-oriented programming for custom estimators
  • Prefer vectorized operations over explicit loops
  • Follow PEP 8 style guidelines

Machine Learning Workflow

Data Preparation

  • Always split data before any preprocessing: train/validation/test
  • Use train_test_split() with random_state for reproducibility
  • Stratify splits for imbalanced classification: stratify=y
  • Keep test set completely separate until final evaluation

Feature Engineering

  • Scale features appropriately for distance-based algorithms
  • Use StandardScaler for normally distributed features
  • Use MinMaxScaler for bounded features
  • Use RobustScaler for data with outliers
  • Encode categorical variables: OneHotEncoder, OrdinalEncoder, LabelEncoder
  • Handle missing values: SimpleImputer, KNNImputer

Pipelines

  • Always use Pipeline to chain preprocessing and modeling
  • Prevents data leakage by fitting transformers only on training data
  • Makes code cleaner and more reproducible
  • Enables easy deployment and serialization
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Column Transformers

  • Use ColumnTransformer for different preprocessing per feature type
  • Combine numeric and categorical preprocessing in single pipeline

Model Selection and Tuning

Cross-Validation

  • Use cross-validation for reliable performance estimates
  • cross_val_score() for quick evaluation
  • cross_validate() for multiple metrics
  • Use appropriate CV strategy:
  • KFold for regression
  • StratifiedKFold for classification
  • TimeSeriesSplit for temporal data
  • GroupKFold for grouped data

Hyperparameter Tuning

  • Use GridSearchCV for exhaustive search
  • Use RandomizedSearchCV for large parameter spaces
  • Always tune on training/validation data, never test data
  • Set n_jobs=-1 for parallel processing

Model Evaluation

Classification Metrics

  • Use appropriate metrics for your problem:
  • accuracy_score for balanced classes
  • precision_score, recall_score, f1_score for imbalanced
  • roc_auc_score for ranking ability
  • Use classification_report() for comprehensive overview
  • Examine confusion_matrix() for error analysis

Regression Metrics

  • mean_squared_error (MSE) for general use
  • mean_absolute_error (MAE) for interpretability
  • r2_score for explained variance

Evaluation Best Practices

  • Report confidence intervals, not just point estimates
  • Use multiple metrics to understand model behavior
  • Compare against meaningful baselines
  • Evaluate on held-out test set only once, at the end

Handling Imbalanced Data

  • Use stratified splitting and cross-validation
  • Consider class weights: class_weight='balanced'
  • Use appropriate metrics (F1, AUC-PR, not accuracy)
  • Adjust decision threshold based on business needs

Feature Selection

  • Use SelectKBest with statistical tests
  • Use RFE (Recursive Feature Elimination)
  • Use model-based selection: SelectFromModel
  • Examine feature importances from tree-based models

Model Persistence

  • Use joblib for saving and loading models
  • Save entire pipelines, not just models
  • Version control model artifacts
  • Document model metadata

Performance Optimization

  • Use n_jobs=-1 for parallel processing where available
  • Consider warm_start=True for iterative training
  • Use sparse matrices for high-dimensional sparse data
  • Consider incremental learning with partial_fit() for large data

Key Conventions

  • Import from submodules: from sklearn.ensemble import RandomForestClassifier
  • Set random_state for reproducibility
  • Use pipelines to prevent data leakage
  • Document model choices and hyperparameters

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.