trotsky1997

codebase-reading

6
2
# Install this skill:
npx skills add trotsky1997/My-Claude-Agent-Skills --skill "codebase-reading"

Install specific skill from multi-skill repository

# Description

Systematic methodology for reading and understanding large codebases efficiently. Use when (1) Understanding a new or unfamiliar codebase quickly, (2) Preparing to modify or extend existing code safely, (3) Debugging complex issues requiring deep code understanding, (4) Onboarding new team members to a codebase, (5) Performing code audits or security reviews, (6) Refactoring legacy code with confidence, (7) Creating documentation for existing systems, (8) Tracing execution flows and data transformations

# SKILL.md


name: codebase-reading
description: Systematic methodology for reading and understanding large codebases efficiently. Use when (1) Understanding a new or unfamiliar codebase quickly, (2) Preparing to modify or extend existing code safely, (3) Debugging complex issues requiring deep code understanding, (4) Onboarding new team members to a codebase, (5) Performing code audits or security reviews, (6) Refactoring legacy code with confidence, (7) Creating documentation for existing systems, (8) Tracing execution flows and data transformations
metadata:
short-description: Efficiently read and understand large codebases


Large Codebase Reading Methodology

A systematic approach to understanding large codebases efficiently: read to modify safely, not to memorize everything.

Core Principle

Goal-oriented reading: Always start with a concrete objective (fix bug, add feature, debug issue, security audit). Without a goal, you'll get lost in details.

Success criteria: You can explain the execution flow and modify code confidently, not that you've read every file.

Documentation Structure

When documenting your understanding, use this structure:

code-reading/
β”œβ”€β”€ README.md              # Index and progress tracking
β”œβ”€β”€ code-reading.md        # Main framework (methodology + findings + terminology)
β”œβ”€β”€ architecture.md        # C4 architecture map (Level 1-4)
β”œβ”€β”€ api-flow.md            # Execution flow tracing
└── key-modules.md         # Detailed module analysis

Progressive disclosure:
- README: Overview and navigation
- Main doc: Methodology and high-level findings
- Specialized docs: Detailed analysis (loaded only when needed)

Terminology section:
- Include a dedicated "Terminology" or "Key Concepts and Terminology" section in code-reading.md (e.g., section 9)
- Build incrementally from day 1, don't wait until everything is understood
- Cross-reference terms throughout documentation (architecture, flows, modules)
- Maintain as a living document - update as understanding deepens

Quick Start

Step 1: Define Your Goal

Start with a concrete, one-sentence objective:

  • βœ… Good: "Trace HTTP request from route handler to database and back"
  • βœ… Good: "Understand how authentication tokens are validated and refreshed"
  • ❌ Bad: "Understand the entire codebase"

Step 2: Run the Project First

First hour priority:

  1. Read README - project purpose, setup, minimal example
  2. Read CONTRIBUTING/development guide - testing, structure, workflow
  3. Run a minimal path - local build, one test, or demo

Why: Running code transforms reading from guessing to verification.

Step 3: Map the Architecture (C4 Thinking)

Draw a rough map from far to near (no need to be perfect):

Level 1: System Context - Who uses it? What external dependencies?
Level 2: Containers - What are the deployable units? (services, databases, workers)
Level 3: Components - What are the key components within one container?

You don't need beautiful diagrams - just enough to navigate: where's the entry point, how does data flow, where are the boundaries?

Step 4: Trace a Real Request Path

Instead of reading modules, read one execution path:

  • Web: route β†’ controller β†’ service β†’ repository β†’ external API
  • CLI: main β†’ command parser β†’ execution path
  • Async: consumer β†’ handler β†’ processing β†’ ack/retry

Strongly recommended: Use a debugger, add logging, set breakpoints to walk through once.

Step 5: Treat Tests as Executable Documentation

If code is old/hard to test, write characterization tests first - record current behavior, then refactor under protection.

Core Workflow

1. Goal-Oriented Setup

Define your reading task:

Write a one-sentence goal that describes what you want to achieve:
- "I can explain the request flow from HTTP to database"
- "I can safely modify the authentication module"
- "I can trace the OCR pipeline from image input to result output"

Avoid: Reading without purpose or starting from directory trees.

2. Project Bootstrap

Priority order (first hour):

  1. README.md - What does it do? How to start? Minimal example?
  2. CONTRIBUTING.md / docs/ - How to test? Code structure? Branch strategy?
  3. Run minimal path - Can you build it? Run one test? Execute a demo?

Verify: Model files exist, tests run, examples work.

3. Architecture Mapping (C4 Model)

Level 1: System Context
- Users/applications that interact with the system
- External dependencies (databases, APIs, services)
- System boundaries

Level 2: Containers
- Deployable units (frontend, API server, workers, databases, queues)
- Communication between containers

Level 3: Components
- Key components within a container (auth service, domain service, repository, adapter)
- Component relationships and dependencies

Output: A rough map that answers:
- Where's the entry point?
- How does data flow?
- Where are the boundaries?

4. Entry Point + Request Tracing

Find the entry point:
- Web: main(), route handlers, controllers
- CLI: main(), command parsers
- Library: public API functions, constructors

Trace one complete path:
- Use debugger to step through
- Add logging at key points
- Set breakpoints to verify understanding

Document the flow: Write down the execution path as you trace it.

5. Tests as Documentation

Existing tests:
- Read tests to understand expected behavior
- Tests show how components are used
- Tests document edge cases and error handling

Missing tests:
- Write characterization tests (record current behavior)
- Use tests as a safety net before modifications
- Tests become regression suite

6. Git Archaeology

When you see "weird code":

Don't dismiss it immediately - ask: What historical problem is this solving?

Commands:

git blame -w -- path/to/file          # Who changed this and when?
git log -p -- path/to/file            # Full change history
git log --grep="keyword"              # Find related commits
git show <commit-hash>                # View specific change

What to look for:
- Commit messages explaining why
- PR discussions showing trade-offs
- Major refactors showing architecture evolution

7. Use Tools Efficiently

Code search:

rg "keyword" -n .                     # Ripgrep (faster than grep)
git grep "keyword"                    # Git-optimized search
rg "main\(" .                         # Find entry points
rg "TODO|FIXME" .                     # Find todos

Git tools:

git log --graph --oneline --all       # Visual history
git log --follow -- path/to/file      # File rename tracking
git diff HEAD~5 HEAD                  # Compare versions

Language-specific tools:
- Rust: cargo tree, cargo clippy, cargo doc
- Python: pytest, mypy, pylint
- JavaScript: npm list, eslint, tsc

8. Build and Maintain Terminology Glossary

Critical for understanding: Build a living terminology glossary from day 1, not as an afterthought.

Why it matters:
- Domain-specific terms (OCR: detection, recognition, CTC, NMS)
- Acronyms that need expansion (CLS, DB, IOU)
- Project-specific abstractions (custom types, internal concepts)
- Confusion between similar concepts (Orientation vs CLS, Detection vs Recognition)

When to build:
- Initial collection: During first reading pass (Steps 1-4) - collect terms as you encounter them
- Deep dive enrichment: During detailed module analysis (Step 10) - add technical details and context
- Continuous maintenance: Update as understanding deepens, add cross-references, refine definitions

What to include:
- Definition: Clear explanation of what the term means
- Context: Where and how it's used in the codebase
- Code references: File paths, function names, line numbers
- Relationships: Related terms, parent/child concepts, synonyms
- Examples: Usage examples, code snippets when helpful

Organization:
- Classify by domain (OCR terms, ML terms, project-specific)
- Group by abstraction level (concepts, algorithms, data structures, implementation)
- Cross-reference to architecture diagrams, API flows, module analysis

Example structure:

## Terminology Glossary

### Domain-Specific Terms
- **Detection (ζ£€ζ΅‹)**: Locating text regions in images
  - **Implementation**: `src/det.rs`
  - **Related**: Recognition, NMS, DB
  - **See**: `api-flow.md` section 4.1

### Data Structures
- **`Mat`**: Image matrix abstraction
  - **Implementation**: `src/image_impl.rs`
  - **Purpose**: Unified image representation for Pure Rust and OpenCV backends

### Algorithms
- **NMS (Non-Maximum Suppression)**: Algorithm for filtering overlapping boxes
  - **Implementation**: `src/geometry.rs::nms()`
  - **Related**: IOU (Intersection over Union)

Tools for term extraction:

# Extract acronyms (all caps, 2-5 chars)
rg "\b[A-Z]{2,5}\b" README.md docs/ | sort -u

# Extract struct/enum/trait names
rg "^(pub )?(struct|enum|trait|type) \w+" src/ -o | sort -u

# Find config-related terms
rg "Config.*\{|struct \w+Config" src/

Maintenance checklist:
- [ ] All acronyms have expansions
- [ ] All domain terms have clear definitions
- [ ] All key data structures are documented
- [ ] Cross-references to code are accurate
- [ ] Related terms are linked
- [ ] Examples provided for complex terms

Common Patterns

Pattern 1: Reading Without a Goal

Problem: Reading files alphabetically or by directory structure.

Solution: Start with a concrete task. Even if it's "understand how X works," make it specific.

Pattern 2: Getting Lost in Details

Problem: Deep-diving into every module before understanding the flow.

Solution: First trace one complete execution path. Then dive into specific modules as needed.

Pattern 3: Ignoring Tests

Problem: Treating tests as optional or skipping them.

Solution: Tests are the most reliable documentation. Read them first. If missing, write characterization tests.

Pattern 4: Not Using Git History

Problem: Seeing "weird code" and assuming it's wrong.

Solution: Use git blame and git log to understand context. Code exists for reasons, even if not obvious.

Pattern 5: Ignoring Terminology (Acronyms and Domain Terms)

Problem: Encountering acronyms (CTC, CLS, DB, NMS) or domain terms (Detection vs Recognition) and assuming you'll remember what they mean later. Repeatedly looking up the same terms.

Solution: Build a terminology glossary from day 1. Collect terms as you encounter them, enrich with context as understanding deepens. Cross-reference terms throughout documentation. Treat it as a living document, not a one-time task.

Success Indicators

You've successfully understood a codebase when:

βœ… You can explain the main execution flow from entry to exit
βœ… You can locate where specific functionality is implemented
βœ… You can trace data transformations through the system
βœ… You can identify where to make changes for your goal
βœ… You can explain architectural decisions (even if you'd do differently)
βœ… You have a comprehensive terminology glossary that helps navigate the codebase
βœ… You can explain any domain-specific term in one sentence
βœ… You rarely need to re-lookup the same term
βœ… You feel confident modifying code without breaking things

Not required:
- ❌ Having read every file
- ❌ Memorizing every function
- ❌ Understanding every detail

Troubleshooting

"I don't know where to start"
β†’ Define a concrete goal. Even "understand how user authentication works" is better than "understand everything."

"The codebase is too large"
β†’ Focus on your goal. Trace one execution path. Ignore unrelated modules.

"I can't find the entry point"
β†’ Look for main(), route definitions, or public API functions. Use code search tools.

"The code doesn't make sense"
β†’ Use Git history to understand why it's written this way. Check tests for usage examples.

"There are no tests"
β†’ Write characterization tests. Record current behavior before modifying.

"I keep forgetting what CTC/CLS/DB means"
β†’ Build a terminology glossary. Start early, collect terms as you encounter them. Include definitions, context, code references, and relationships. Cross-reference throughout documentation.

"The domain terms are confusing"
β†’ Don't just look them up once. Document them with context, usage examples, and relationships to other terms. Classify by domain and abstraction level. Update as understanding deepens.

References

For detailed methodology and examples, see:

  • methodology.md - Complete 11-step methodology with detailed explanations
  • tool-checklist.md - Comprehensive tool checklist by category (search, Git, language-specific, debugging)
  • terminology-building.md - Complete guide to building and maintaining terminology glossaries (5-step process, classification methods, best practices, common patterns)

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.