understanding-deep-learning

by @defi-naly in AI & LLM

# Install this skill:

npx skills add defi-naly/skillbank --skill "understanding-deep-learning"

Install specific skill from multi-skill repository

# Description

Simon Prince's comprehensive deep learning framework for understanding neural networks, architectures, and training.

# SKILL.md

name: understanding-deep-learning
description: "Simon Prince's comprehensive deep learning framework for understanding neural networks, architectures, and training."

dimensions:
domain: [machine-learning, AI, data-science, engineering]
phase: [model-design, training, debugging, architecture-selection]
problem_type: [neural-networks, optimization, model-selection, technical-implementation]

contexts:
- situation: "building a machine learning model"
use_when: "need to choose architecture, loss function, or optimization approach"
- situation: "model isn't training well"
use_when: "debugging training issues (loss not decreasing, NaN, overfitting)"
- situation: "choosing between architectures"
use_when: "selecting CNN vs RNN vs Transformer for different data types"
- situation: "understanding a paper or implementation"
use_when: "need reference for attention, backprop, normalization concepts"
- situation: "optimizing model performance"
use_when: "tuning hyperparameters, regularization, learning rate schedules"

combines_with:
- thinking-fast-and-slow # meta: human vs machine cognition parallels
- lean-startup # experiment design for ML projects
- hidden-potential # learning systems and skill development

contrast_with:
- skill: thinking-fast-and-slow
distinction: "Deep Learning is MACHINE cognition; TF&S is HUMAN cognition"

Understanding Deep Learning

Core Concepts

Deep learning is function approximation using compositions of simple nonlinear functions. Neural networks learn hierarchical representations from data.

INPUT ──► [Layer 1] ──► [Layer 2] ──► ... ──► [Layer N] ──► OUTPUT
              │              │                    │
         Low-level      Mid-level            High-level
         features       features             features

Why deep? Depth enables hierarchical representations. Each layer builds on the previous, creating increasingly abstract features.

The Supervised Learning Framework

The Setup

Given: Training data {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
Goal: Learn function f(x, θ) that predicts y from x
Method: Minimize loss L(θ) over parameters θ

Loss Functions

Task	Loss Function	Formula
Regression	Mean Squared Error	L = (1/n)Σ(y - ŷ)²
Binary Classification	Binary Cross-Entropy	L = -Σ[y·log(ŷ) + (1-y)·log(1-ŷ)]
Multi-class	Cross-Entropy	L = -Σ y·log(ŷ)

Key insight: The loss function defines what "good" means. Choose it carefully—the model optimizes exactly what you specify.

Gradient Descent

Repeat:
    1. Compute gradient: ∇L(θ)
    2. Update parameters: θ ← θ - η·∇L(θ)
Until convergence

Learning rate (η): Too high = unstable, too low = slow. This is the most important hyperparameter.

Stochastic Gradient Descent (SGD)

Use mini-batches instead of full dataset:
- Faster: Don't need full dataset for each update
- Regularizing: Noise helps escape local minima
- Memory efficient: Fits in GPU memory

Batch size tradeoffs:
- Smaller → more noise, better generalization, slower
- Larger → less noise, faster, may overfit

Neural Network Building Blocks

The Neuron (Perceptron)

        x₁ ──w₁──┐
        x₂ ──w₂──┼──► Σ ──► activation ──► output
        x₃ ──w₃──┘    + b

output = activation(w₁x₁ + w₂x₂ + w₃x₃ + b)

Activation Functions

Function	Formula	Use Case
ReLU	max(0, x)	Default for hidden layers
Sigmoid	1/(1+e⁻ˣ)	Binary output, gates
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	Centered output [-1,1]
Softmax	eˣⁱ/Σeˣʲ	Multi-class probabilities
GELU	x·Φ(x)	Transformers
SiLU/Swish	x·σ(x)	Modern architectures

Why nonlinearity? Without it, stacked layers collapse to single linear transformation. Nonlinearity enables learning complex functions.

Layers

Dense (Fully Connected):
Every neuron connects to every input. output = activation(Wx + b)

Convolutional:
Local connectivity with shared weights. Translation invariant.

Recurrent:
Connections through time. Maintains hidden state.

Attention:
Dynamic weighting of inputs. Learns what to focus on.

Backpropagation

The algorithm for computing gradients efficiently through the chain rule.

Forward pass: Compute outputs layer by layer
Backward pass: Compute gradients layer by layer (in reverse)

Chain Rule Application

For loss L with intermediate computations:

∂L/∂w = ∂L/∂output · ∂output/∂w

Gradients flow backward through the computation graph.

Gradient Problems

Vanishing gradients:
- Gradients shrink exponentially through layers
- Caused by: saturating activations, deep networks
- Solutions: ReLU, residual connections, careful initialization

Exploding gradients:
- Gradients grow exponentially
- Solutions: gradient clipping, normalization, careful initialization

Regularization

Techniques to prevent overfitting (memorizing training data).

Weight Decay (L2 Regularization)

Add penalty for large weights: L_total = L_data + λ·||w||²

Encourages smaller, more distributed weights.

Dropout

Randomly zero out neurons during training (typically p=0.5 for hidden, p=0.2 for input).

Effect: Forces redundancy, prevents co-adaptation.

At inference: Use all neurons but scale by (1-p).

Data Augmentation

Create variations of training data:
- Images: rotation, flip, crop, color jitter
- Text: synonym replacement, back-translation
- Audio: pitch shift, time stretch, noise

Philosophy: More diverse training data > more complex model.

Early Stopping

Stop training when validation loss stops improving.

Training loss: keeps decreasing
Validation loss: decreases, then increases ← stop here

Batch Normalization

Normalize activations within each mini-batch:

x̂ = (x - μ_batch) / σ_batch
output = γx̂ + β  (learned scale and shift)

Benefits: Faster training, higher learning rates, some regularization.

Layer Normalization

Normalize across features (not batch). Preferred for transformers and RNNs.

Optimization

Optimizers

Optimizer	Key Idea	When to Use
SGD	Basic gradient descent	With momentum, still competitive
SGD + Momentum	Accumulate velocity	Faster convergence
Adam	Adaptive learning rates + momentum	Default choice
AdamW	Adam with decoupled weight decay	Large models, transformers

Learning Rate Schedules

Warmup: Start low, increase gradually. Helps stability early in training.

Decay: Reduce over time.
- Step decay: Drop by factor every N epochs
- Cosine: Smooth decrease following cosine curve
- Linear: Steady decrease

Cyclical: Oscillate between bounds. Can help escape local minima.

Convolutional Neural Networks (CNNs)

For grid-structured data (images, audio spectrograms).

Key Components

Convolutional layer:
- Slides filter/kernel across input
- Detects local patterns
- Parameters: kernel size, stride, padding, channels

Pooling layer:
- Downsamples spatially
- Max pooling (most common) or average pooling
- Provides translation invariance

Typical Architecture

INPUT ──► [Conv + ReLU + Pool] × N ──► Flatten ──► [Dense] × M ──► OUTPUT
          (feature extraction)                     (classification)

Famous Architectures

Architecture	Key Innovation
LeNet	First successful CNN
AlexNet	Deep CNNs work, ReLU, dropout
VGG	Smaller filters (3×3), deeper
ResNet	Residual connections
Inception	Multiple filter sizes in parallel
EfficientNet	Compound scaling

Residual Connections

x ──────────────────────┐
│                       │
└──► [Conv] ──► [Conv] ─┴──► x + F(x)

Why it works: Gradients flow directly through skip connections. Enables very deep networks (100+ layers).

Recurrent Neural Networks (RNNs)

For sequential data (text, time series).

Basic RNN

h_t = tanh(W_h · h_{t-1} + W_x · x_t + b)

Hidden state h carries information through time.

Problem: Vanishing/exploding gradients over long sequences.

LSTM (Long Short-Term Memory)

Gated architecture that controls information flow:

Forget gate: f_t = σ(W_f · [h_{t-1}, x_t])     ← what to forget
Input gate:  i_t = σ(W_i · [h_{t-1}, x_t])     ← what to add
Output gate: o_t = σ(W_o · [h_{t-1}, x_t])     ← what to output

Cell state:  c_t = f_t ⊙ c_{t-1} + i_t ⊙ tanh(W_c · [h_{t-1}, x_t])
Hidden:      h_t = o_t ⊙ tanh(c_t)

Key insight: Cell state provides highway for gradients. Gates learn what to remember/forget.

GRU (Gated Recurrent Unit)

Simplified LSTM with two gates (reset, update). Fewer parameters, similar performance.

Transformers

The dominant architecture for sequences (and increasingly everything else).

Self-Attention

Compute weighted combination of all positions:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Q (Query): What am I looking for?
K (Key): What do I contain?
V (Value): What do I provide?

Scaled dot-product: Divide by √d_k to prevent softmax saturation.

Multi-Head Attention

Multiple attention operations in parallel:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Why multiple heads? Different heads can attend to different types of relationships.

Transformer Block

x ──► LayerNorm ──► Multi-Head Attention ──► + ──► LayerNorm ──► FFN ──► + ──► output
│                                           │ │                         │
└───────────────────────────────────────────┘ └─────────────────────────┘
                  (residual)                        (residual)

Positional Encoding

Attention is permutation-invariant—needs position information injected.

Sinusoidal: Fixed pattern of sines/cosines at different frequencies.
Learned: Trainable embedding per position.
Rotary (RoPE): Encode relative positions through rotation.

Encoder vs Decoder

Encoder: Bidirectional attention (sees all positions). BERT-style.
Decoder: Causal attention (only sees previous positions). GPT-style.
Encoder-Decoder: Encoder for input, decoder for output. T5-style.

Generative Models

Autoencoders

Input ──► Encoder ──► Latent Space ──► Decoder ──► Reconstruction
                         z

Learn compressed representation. Loss = reconstruction error.

Variational Autoencoders (VAEs)

Latent space is probability distribution, not point.
Loss = reconstruction + KL divergence (regularizes latent space).

Enables generation by sampling from latent space.

GANs (Generative Adversarial Networks)

Two networks competing:
- Generator: Creates fake samples
- Discriminator: Distinguishes real from fake

Training is adversarial game. Generator improves to fool discriminator.

Diffusion Models

Learn to reverse a noise-adding process:
1. Forward: Gradually add noise until pure noise
2. Reverse: Learn to denoise step by step
3. Generate: Start from noise, denoise to sample

State-of-the-art for image generation (DALL-E, Stable Diffusion, Midjourney).

Practical Training Guide

Debugging Checklist

Symptom	Possible Causes	Solutions
Loss not decreasing	LR too low/high, bug in code	Check gradients, try different LR
Loss NaN/Inf	LR too high, numerical issues	Lower LR, gradient clipping, check data
Overfitting	Model too complex, not enough data	Regularization, more data, simpler model
Underfitting	Model too simple, LR too low	Bigger model, higher LR, train longer
Slow convergence	LR too low, bad initialization	Increase LR, use standard init

Hyperparameter Priority

Learning rate (most important)
Batch size
Architecture (depth, width)
Regularization (dropout, weight decay)
Optimizer settings

Training Recipe

Start simple: Small model, verify it can overfit small batch
Scale up: Gradually increase model size and data
Tune LR: Find highest stable learning rate
Add regularization: Only when overfitting
Extend training: More epochs if still improving

Model Selection Guide

Data Type	Recommended Architecture
Tabular	Gradient boosting, then MLP
Images	CNN (ResNet, EfficientNet) or ViT
Text	Transformer (BERT, GPT)
Sequences	Transformer or LSTM
Graphs	GNN (GCN, GAT)
Generation (images)	Diffusion models
Generation (text)	Autoregressive transformers

Key Equations Reference

Concept	Equation
Linear layer	y = Wx + b
ReLU	f(x) = max(0, x)
Softmax	p_i = e^{x_i} / Σe^{x_j}
Cross-entropy	L = -Σ y_i log(ŷ_i)
SGD update	θ ← θ - η∇L
Attention	softmax(QK^T/√d)V
BatchNorm	(x - μ)/σ · γ + β

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.