Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add defi-naly/skillbank --skill "understanding-deep-learning"
Install specific skill from multi-skill repository
# Description
Simon Prince's comprehensive deep learning framework for understanding neural networks, architectures, and training.
# SKILL.md
name: understanding-deep-learning
description: "Simon Prince's comprehensive deep learning framework for understanding neural networks, architectures, and training."
dimensions:
domain: [machine-learning, AI, data-science, engineering]
phase: [model-design, training, debugging, architecture-selection]
problem_type: [neural-networks, optimization, model-selection, technical-implementation]
contexts:
- situation: "building a machine learning model"
use_when: "need to choose architecture, loss function, or optimization approach"
- situation: "model isn't training well"
use_when: "debugging training issues (loss not decreasing, NaN, overfitting)"
- situation: "choosing between architectures"
use_when: "selecting CNN vs RNN vs Transformer for different data types"
- situation: "understanding a paper or implementation"
use_when: "need reference for attention, backprop, normalization concepts"
- situation: "optimizing model performance"
use_when: "tuning hyperparameters, regularization, learning rate schedules"
combines_with:
- thinking-fast-and-slow # meta: human vs machine cognition parallels
- lean-startup # experiment design for ML projects
- hidden-potential # learning systems and skill development
contrast_with:
- skill: thinking-fast-and-slow
distinction: "Deep Learning is MACHINE cognition; TF&S is HUMAN cognition"
Understanding Deep Learning
Core Concepts
Deep learning is function approximation using compositions of simple nonlinear functions. Neural networks learn hierarchical representations from data.
INPUT βββΊ [Layer 1] βββΊ [Layer 2] βββΊ ... βββΊ [Layer N] βββΊ OUTPUT
β β β
Low-level Mid-level High-level
features features features
Why deep? Depth enables hierarchical representations. Each layer builds on the previous, creating increasingly abstract features.
The Supervised Learning Framework
The Setup
Given: Training data {(xβ, yβ), (xβ, yβ), ..., (xβ, yβ)}
Goal: Learn function f(x, ΞΈ) that predicts y from x
Method: Minimize loss L(ΞΈ) over parameters ΞΈ
Loss Functions
| Task | Loss Function | Formula |
|---|---|---|
| Regression | Mean Squared Error | L = (1/n)Ξ£(y - Ε·)Β² |
| Binary Classification | Binary Cross-Entropy | L = -Ξ£[yΒ·log(Ε·) + (1-y)Β·log(1-Ε·)] |
| Multi-class | Cross-Entropy | L = -Ξ£ yΒ·log(Ε·) |
Key insight: The loss function defines what "good" means. Choose it carefullyβthe model optimizes exactly what you specify.
Gradient Descent
Repeat:
1. Compute gradient: βL(ΞΈ)
2. Update parameters: ΞΈ β ΞΈ - Ξ·Β·βL(ΞΈ)
Until convergence
Learning rate (Ξ·): Too high = unstable, too low = slow. This is the most important hyperparameter.
Stochastic Gradient Descent (SGD)
Use mini-batches instead of full dataset:
- Faster: Don't need full dataset for each update
- Regularizing: Noise helps escape local minima
- Memory efficient: Fits in GPU memory
Batch size tradeoffs:
- Smaller β more noise, better generalization, slower
- Larger β less noise, faster, may overfit
Neural Network Building Blocks
The Neuron (Perceptron)
xβ ββwββββ
xβ ββwββββΌβββΊ Ξ£ βββΊ activation βββΊ output
xβ ββwββββ + b
output = activation(wβxβ + wβxβ + wβxβ + b)
Activation Functions
| Function | Formula | Use Case |
|---|---|---|
| ReLU | max(0, x) | Default for hidden layers |
| Sigmoid | 1/(1+eβ»Λ£) | Binary output, gates |
| Tanh | (eΛ£-eβ»Λ£)/(eΛ£+eβ»Λ£) | Centered output [-1,1] |
| Softmax | eΛ£β±/Ξ£eΛ£Κ² | Multi-class probabilities |
| GELU | xΒ·Ξ¦(x) | Transformers |
| SiLU/Swish | xΒ·Ο(x) | Modern architectures |
Why nonlinearity? Without it, stacked layers collapse to single linear transformation. Nonlinearity enables learning complex functions.
Layers
Dense (Fully Connected):
Every neuron connects to every input. output = activation(Wx + b)
Convolutional:
Local connectivity with shared weights. Translation invariant.
Recurrent:
Connections through time. Maintains hidden state.
Attention:
Dynamic weighting of inputs. Learns what to focus on.
Backpropagation
The algorithm for computing gradients efficiently through the chain rule.
Forward pass: Compute outputs layer by layer
Backward pass: Compute gradients layer by layer (in reverse)
Chain Rule Application
For loss L with intermediate computations:
βL/βw = βL/βoutput Β· βoutput/βw
Gradients flow backward through the computation graph.
Gradient Problems
Vanishing gradients:
- Gradients shrink exponentially through layers
- Caused by: saturating activations, deep networks
- Solutions: ReLU, residual connections, careful initialization
Exploding gradients:
- Gradients grow exponentially
- Solutions: gradient clipping, normalization, careful initialization
Regularization
Techniques to prevent overfitting (memorizing training data).
Weight Decay (L2 Regularization)
Add penalty for large weights: L_total = L_data + λ·||w||²
Encourages smaller, more distributed weights.
Dropout
Randomly zero out neurons during training (typically p=0.5 for hidden, p=0.2 for input).
Effect: Forces redundancy, prevents co-adaptation.
At inference: Use all neurons but scale by (1-p).
Data Augmentation
Create variations of training data:
- Images: rotation, flip, crop, color jitter
- Text: synonym replacement, back-translation
- Audio: pitch shift, time stretch, noise
Philosophy: More diverse training data > more complex model.
Early Stopping
Stop training when validation loss stops improving.
Training loss: keeps decreasing
Validation loss: decreases, then increases β stop here
Batch Normalization
Normalize activations within each mini-batch:
xΜ = (x - ΞΌ_batch) / Ο_batch
output = Ξ³xΜ + Ξ² (learned scale and shift)
Benefits: Faster training, higher learning rates, some regularization.
Layer Normalization
Normalize across features (not batch). Preferred for transformers and RNNs.
Optimization
Optimizers
| Optimizer | Key Idea | When to Use |
|---|---|---|
| SGD | Basic gradient descent | With momentum, still competitive |
| SGD + Momentum | Accumulate velocity | Faster convergence |
| Adam | Adaptive learning rates + momentum | Default choice |
| AdamW | Adam with decoupled weight decay | Large models, transformers |
Learning Rate Schedules
Warmup: Start low, increase gradually. Helps stability early in training.
Decay: Reduce over time.
- Step decay: Drop by factor every N epochs
- Cosine: Smooth decrease following cosine curve
- Linear: Steady decrease
Cyclical: Oscillate between bounds. Can help escape local minima.
Convolutional Neural Networks (CNNs)
For grid-structured data (images, audio spectrograms).
Key Components
Convolutional layer:
- Slides filter/kernel across input
- Detects local patterns
- Parameters: kernel size, stride, padding, channels
Pooling layer:
- Downsamples spatially
- Max pooling (most common) or average pooling
- Provides translation invariance
Typical Architecture
INPUT βββΊ [Conv + ReLU + Pool] Γ N βββΊ Flatten βββΊ [Dense] Γ M βββΊ OUTPUT
(feature extraction) (classification)
Famous Architectures
| Architecture | Key Innovation |
|---|---|
| LeNet | First successful CNN |
| AlexNet | Deep CNNs work, ReLU, dropout |
| VGG | Smaller filters (3Γ3), deeper |
| ResNet | Residual connections |
| Inception | Multiple filter sizes in parallel |
| EfficientNet | Compound scaling |
Residual Connections
x βββββββββββββββββββββββ
β β
ββββΊ [Conv] βββΊ [Conv] ββ΄βββΊ x + F(x)
Why it works: Gradients flow directly through skip connections. Enables very deep networks (100+ layers).
Recurrent Neural Networks (RNNs)
For sequential data (text, time series).
Basic RNN
h_t = tanh(W_h Β· h_{t-1} + W_x Β· x_t + b)
Hidden state h carries information through time.
Problem: Vanishing/exploding gradients over long sequences.
LSTM (Long Short-Term Memory)
Gated architecture that controls information flow:
Forget gate: f_t = Ο(W_f Β· [h_{t-1}, x_t]) β what to forget
Input gate: i_t = Ο(W_i Β· [h_{t-1}, x_t]) β what to add
Output gate: o_t = Ο(W_o Β· [h_{t-1}, x_t]) β what to output
Cell state: c_t = f_t β c_{t-1} + i_t β tanh(W_c Β· [h_{t-1}, x_t])
Hidden: h_t = o_t β tanh(c_t)
Key insight: Cell state provides highway for gradients. Gates learn what to remember/forget.
GRU (Gated Recurrent Unit)
Simplified LSTM with two gates (reset, update). Fewer parameters, similar performance.
Transformers
The dominant architecture for sequences (and increasingly everything else).
Self-Attention
Compute weighted combination of all positions:
Attention(Q, K, V) = softmax(QK^T / βd_k) Β· V
- Q (Query): What am I looking for?
- K (Key): What do I contain?
- V (Value): What do I provide?
Scaled dot-product: Divide by βd_k to prevent softmax saturation.
Multi-Head Attention
Multiple attention operations in parallel:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) Β· W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
Why multiple heads? Different heads can attend to different types of relationships.
Transformer Block
x βββΊ LayerNorm βββΊ Multi-Head Attention βββΊ + βββΊ LayerNorm βββΊ FFN βββΊ + βββΊ output
β β β β
βββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
(residual) (residual)
Positional Encoding
Attention is permutation-invariantβneeds position information injected.
Sinusoidal: Fixed pattern of sines/cosines at different frequencies.
Learned: Trainable embedding per position.
Rotary (RoPE): Encode relative positions through rotation.
Encoder vs Decoder
Encoder: Bidirectional attention (sees all positions). BERT-style.
Decoder: Causal attention (only sees previous positions). GPT-style.
Encoder-Decoder: Encoder for input, decoder for output. T5-style.
Generative Models
Autoencoders
Input βββΊ Encoder βββΊ Latent Space βββΊ Decoder βββΊ Reconstruction
z
Learn compressed representation. Loss = reconstruction error.
Variational Autoencoders (VAEs)
Latent space is probability distribution, not point.
Loss = reconstruction + KL divergence (regularizes latent space).
Enables generation by sampling from latent space.
GANs (Generative Adversarial Networks)
Two networks competing:
- Generator: Creates fake samples
- Discriminator: Distinguishes real from fake
Training is adversarial game. Generator improves to fool discriminator.
Diffusion Models
Learn to reverse a noise-adding process:
1. Forward: Gradually add noise until pure noise
2. Reverse: Learn to denoise step by step
3. Generate: Start from noise, denoise to sample
State-of-the-art for image generation (DALL-E, Stable Diffusion, Midjourney).
Practical Training Guide
Debugging Checklist
| Symptom | Possible Causes | Solutions |
|---|---|---|
| Loss not decreasing | LR too low/high, bug in code | Check gradients, try different LR |
| Loss NaN/Inf | LR too high, numerical issues | Lower LR, gradient clipping, check data |
| Overfitting | Model too complex, not enough data | Regularization, more data, simpler model |
| Underfitting | Model too simple, LR too low | Bigger model, higher LR, train longer |
| Slow convergence | LR too low, bad initialization | Increase LR, use standard init |
Hyperparameter Priority
- Learning rate (most important)
- Batch size
- Architecture (depth, width)
- Regularization (dropout, weight decay)
- Optimizer settings
Training Recipe
- Start simple: Small model, verify it can overfit small batch
- Scale up: Gradually increase model size and data
- Tune LR: Find highest stable learning rate
- Add regularization: Only when overfitting
- Extend training: More epochs if still improving
Model Selection Guide
| Data Type | Recommended Architecture |
|---|---|
| Tabular | Gradient boosting, then MLP |
| Images | CNN (ResNet, EfficientNet) or ViT |
| Text | Transformer (BERT, GPT) |
| Sequences | Transformer or LSTM |
| Graphs | GNN (GCN, GAT) |
| Generation (images) | Diffusion models |
| Generation (text) | Autoregressive transformers |
Key Equations Reference
| Concept | Equation |
|---|---|
| Linear layer | y = Wx + b |
| ReLU | f(x) = max(0, x) |
| Softmax | p_i = e^{x_i} / Ξ£e^{x_j} |
| Cross-entropy | L = -Ξ£ y_i log(Ε·_i) |
| SGD update | ΞΈ β ΞΈ - Ξ·βL |
| Attention | softmax(QK^T/βd)V |
| BatchNorm | (x - ΞΌ)/Ο Β· Ξ³ + Ξ² |
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.