Manage Apple Reminders via the `remindctl` CLI on macOS (list, add, edit, complete, delete)....
npx skills add indiosmo/skills --skill "using-pandas"
Install specific skill from multi-skill repository
# Description
Idiomatic pandas usage patterns and performance best practices. Use when writing or reviewing pandas code to ensure: (1) Modern API usage (loc/iloc, method chaining, pipe), (2) Performance optimization (vectorization, dtypes, avoiding apply), (3) Proper data reshaping (tidy data, melt/pivot), (4) Correct handling of Copy-on-Write, categoricals, time series, (5) Avoiding common gotchas and antipatterns.
# SKILL.md
name: using-pandas
description: "Idiomatic pandas usage patterns and performance best practices. Use when writing or reviewing pandas code to ensure: (1) Modern API usage (loc/iloc, method chaining, pipe), (2) Performance optimization (vectorization, dtypes, avoiding apply), (3) Proper data reshaping (tidy data, melt/pivot), (4) Correct handling of Copy-on-Write, categoricals, time series, (5) Avoiding common gotchas and antipatterns."
Pandas Best Practices
Guidelines for writing idiomatic, performant pandas code.
Core Principles
1. Tidy Data
Structure data so that:
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table
# Tidy: one observation per row
# date | city | temperature
# 2024-01-01 | NYC | 32
# 2024-01-01 | LA | 68
# Not tidy: cities as columns
# date | NYC | LA
# 2024-01-01 | 32 | 68
# Convert wide to tidy
df_tidy = df.melt(id_vars=['date'], var_name='city', value_name='temperature')
2. Method Chaining
Chain operations for readable, debuggable code:
result = (
df
.query('age > 25')
.assign(income_bracket=lambda x: pd.cut(x['income'], bins=5))
.groupby('income_bracket')
.agg(count=('id', 'size'), avg_age=('age', 'mean'))
.reset_index()
)
3. Vectorization Over Iteration
Never iterate rows when vectorized operations exist:
# Bad: Row iteration
for idx, row in df.iterrows():
df.loc[idx, 'result'] = row['a'] + row['b']
# Good: Vectorized
df['result'] = df['a'] + df['b']
4. Copy-on-Write (pandas 3.0 Default)
Copy-on-Write prevents accidental mutations:
# Pre-CoW (problematic)
df2 = df[df['a'] > 0]
df2['b'] = 1 # May or may not modify df
# With CoW (safe)
df2 = df[df['a'] > 0]
df2['b'] = 1 # Never modifies df
# Enable CoW explicitly (pandas < 3.0)
pd.options.mode.copy_on_write = True
Indexing Best Practices
Always Use .loc and .iloc
| Method | Use For | Slice Behavior |
|---|---|---|
.loc[] |
Labels | Inclusive both ends |
.iloc[] |
Positions | Exclusive end |
# Label-based (inclusive)
df.loc['2024-01':'2024-06'] # Includes both Jan and Jun
df.loc[df['col'] > 5, 'target']
# Position-based (exclusive end)
df.iloc[0:5] # Rows 0-4
df.iloc[:, 0:3] # Columns 0-2
Never Use Chained Indexing
# Bad: Chained indexing (unpredictable behavior)
df[df['a'] > 0]['b'] = 1 # May not work, SettingWithCopyWarning
# Good: Single .loc
df.loc[df['a'] > 0, 'b'] = 1
MultiIndex Slicing
# Use pd.IndexSlice for complex MultiIndex selection
idx = pd.IndexSlice
df.loc[idx['2024', :], :] # All second-level indices for '2024'
df.loc[idx[:, 'category_a'], 'value'] # Specific second level
Performance Patterns
Avoid DataFrame.apply(axis=1)
apply(axis=1) iterates in Python - extremely slow:
# Bad: Row-wise apply
df['result'] = df.apply(lambda row: row['a'] + row['b'] * 2, axis=1)
# Good: Vectorized
df['result'] = df['a'] + df['b'] * 2
Build DataFrames Efficiently
# Bad: Iterative building (O(n^2))
df = pd.DataFrame()
for item in items:
df = pd.concat([df, pd.DataFrame([item])])
# Good: Collect then create (O(n))
rows = [item for item in items]
df = pd.DataFrame(rows)
Use pd.concat() Not append
# Combine DataFrames
df = pd.concat([df1, df2, df3], ignore_index=True)
Prefer Built-in GroupBy Methods
# Slow: Custom function
df.groupby('key')['value'].apply(lambda x: x.max() - x.min())
# Fast: Built-in
g = df.groupby('key')['value']
g.max() - g.min()
Choose Appropriate dtypes
# Low-cardinality strings -> category
df['status'] = df['status'].astype('category')
# PyArrow strings (pandas 2.0+)
df['name'] = df['name'].astype('string[pyarrow]')
# Downcast numerics
df['small_int'] = pd.to_numeric(df['small_int'], downcast='integer')
See references/performance.md for detailed optimization patterns.
Data Reshaping
Melt: Wide to Long
BEFORE (wide): AFTER (long):
id | name | 2022 | 2023 id | name | year | value
---+-------+------+------ ---+-------+------+-------
1 | Alice | 100 | 120 1 | Alice | 2022 | 100
2 | Bob | 200 | 250 1 | Alice | 2023 | 120
2 | Bob | 2022 | 200
2 | Bob | 2023 | 250
df_long = pd.melt(
df,
id_vars=['id', 'name'], # Keep as columns
value_vars=['2022', '2023'], # Convert to rows
var_name='year',
value_name='value'
)
Pivot: Long to Wide
BEFORE (long): AFTER (wide):
id | year | value id | 2022 | 2023
---+------+------- ---+------+------
1 | 2022 | 100 1 | 100 | 120
1 | 2023 | 120 2 | 200 | 250
2 | 2022 | 200
2 | 2023 | 250
df_wide = df.pivot(
index='id',
columns='year',
values='value'
)
# With aggregation (handles duplicates)
df_wide = df.pivot_table(
index='id',
columns='year',
values='value',
aggfunc='sum'
)
Adding Columns in Chains
# Use assign() for method chaining
df = (
df
.assign(
total=lambda x: x['a'] + x['b'],
ratio=lambda x: x['a'] / x['total']
)
)
Common Gotchas
Truth Value Ambiguity
# Error: Ambiguous truth value
if df['col'] > 5: # Series has multiple values
pass
# Solution: Use .any() or .all()
if (df['col'] > 5).any():
pass
# For boolean operations, use bitwise operators
df[(df['a'] > 5) & (df['b'] < 10)] # Not 'and'
df[(df['a'] > 5) | (df['b'] < 10)] # Not 'or'
The in Operator Tests Index
# This checks the INDEX, not values
'value' in df['col'] # Wrong!
# Check values with .isin() or .values
'value' in df['col'].values # Correct
df['col'].isin(['value']) # For multiple values
Integer Coercion with NaN
# Integers become float when NaN is present
df['int_col'] = [1, 2, None, 4] # dtype: float64
# Use nullable integer type
df['int_col'] = pd.array([1, 2, None, 4], dtype='Int64') # Capital I
Chained Assignment
# May fail silently
df[df['a'] > 0]['b'] = 1
# Always use single .loc
df.loc[df['a'] > 0, 'b'] = 1
Comparing with None
# Wrong: Comparison with None
df[df['col'] == None] # Doesn't work as expected
# Correct: Use isna/notna
df[df['col'].isna()]
df[df['col'].notna()]
Method Chaining with pipe()
For custom functions in chains:
def add_features(df, multiplier=2):
return df.assign(
doubled=df['value'] * multiplier,
log_value=np.log1p(df['value'])
)
result = (
df
.query('active == True')
.pipe(add_features, multiplier=3)
.groupby('category')
.agg({'doubled': 'mean'})
)
Quick Reference
Selection Patterns
| Task | Code |
|---|---|
| Filter rows | df.query('col > 5') or df[df['col'] > 5] |
| Select columns | df[['a', 'b']] or df.loc[:, ['a', 'b']] |
| Filter + select | df.loc[df['a'] > 5, ['b', 'c']] |
| By dtype | df.select_dtypes(include=['number']) |
Aggregation Patterns
| Task | Code |
|---|---|
| Group aggregate | df.groupby('key').agg(total=('val', 'sum')) |
| Group transform | df.groupby('key')['val'].transform('mean') |
| Rolling | df['val'].rolling(7).mean() |
| Expanding | df['val'].expanding().sum() |
Reshaping Patterns
| Task | Code |
|---|---|
| Wide to long | pd.melt(df, id_vars=['id'], value_vars=['a', 'b']) |
| Long to wide | df.pivot(index='id', columns='key', values='val') |
| Add column | df.assign(new=lambda x: x['a'] + x['b']) |
| Concatenate | pd.concat([df1, df2], ignore_index=True) |
Reference Files
| File | Contents | When to Consult |
|---|---|---|
references/performance.md |
Optimization patterns | Code running slowly |
references/io-formats.md |
File format selection | Reading/writing data |
references/timeseries.md |
Time series patterns | Working with dates |
references/groupby-window.md |
GroupBy and windows | Split-apply-combine |
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.