context-ingestion

by @miles-knowbl in Productivity

# Install this skill:

npx skills add miles-knowbl/orchestrator --skill "context-ingestion"

Install specific skill from multi-skill repository

# Description

Gather and organize context from multiple sources into a structured format. Handles documents, URLs, conversations, and notes. Produces a verified source registry and compiled context corpus ready for downstream analysis.

# SKILL.md

name: context-ingestion
description: "Gather and organize context from multiple sources into a structured format. Handles documents, URLs, conversations, and notes. Produces a verified source registry and compiled context corpus ready for downstream analysis."
phase: INIT
category: specialized
version: "2.0.0"
depends_on: []
tags: [planning, research, intake, sources, information-gathering]

Context Ingestion

Gather and organize context from multiple sources into a structured format.

When to Use

New proposal or project kickoff -- Need to collect and structure background information before analysis begins
Client or domain onboarding -- Entering unfamiliar territory and need to build a knowledge base quickly
Research intake -- Multiple documents, links, or conversations need to be cataloged and made searchable
Scattered source material -- Information exists across different formats and locations, needs consolidation
Pre-analysis preparation -- Downstream skills (context-cultivation, priority-matrix) need a clean, structured input
When you say: "gather context", "collect sources", "ingest this", "pull together the background", "what do we know?"

Reference Requirements

MUST read before applying this skill:

Reference	Why Required
`source-evaluation.md`	Criteria for assessing source reliability and relevance
`extraction-patterns.md`	Standard patterns for pulling content from different source types

Read if applicable:

Reference	When Needed
`metadata-schema.md`	When extending default metadata fields for a domain
`deduplication-rules.md`	When sources overlap or repeat across formats
`url-fetch-guidelines.md`	When ingesting web content at scale

Verification: Ensure CONTEXT-SOURCES.md contains at least one entry per source provided, each with a completed metadata block and reliability rating.

Required Deliverables

Deliverable	Location	Condition
`CONTEXT-SOURCES.md`	Project root	Always -- registry of all sources with metadata
`RAW-CONTEXT.md`	Project root	Always -- extracted content organized by source
`INGESTION-LOG.md`	Project root	When 5+ sources -- processing notes and decisions

Core Concept

Context Ingestion answers: "What do we know, where did it come from, and how reliable is it?"

Context ingestion is:
- Systematic -- Every source is processed through the same evaluation and extraction pipeline
- Traceable -- Every fact in the compiled output links back to a specific source
- Evaluative -- Sources are rated for reliability, recency, and relevance, not treated as equally valid
- Comprehensive -- Actively seeks breadth across source types to avoid blind spots
- Non-interpretive -- Captures what sources say without adding analysis (that is context-cultivation's job)

Context ingestion is NOT:
- Analysis or synthesis (that is context-cultivation)
- Priority setting or ranking (that is priority-matrix)
- Proposal writing (that is proposal-builder)
- Making recommendations based on what was found
- Summarizing to the point of losing source fidelity

The Context Ingestion Process

┌──────────────────────────────────────────────────────────────────┐
│                    CONTEXT INGESTION PROCESS                     │
│                                                                  │
│  1. SOURCE DISCOVERY                                             │
│     └─> Inventory all available and findable sources             │
│                                                                  │
│  2. SOURCE TRIAGE                                                │
│     └─> Evaluate relevance and reliability, prioritize intake    │
│                                                                  │
│  3. CONTENT EXTRACTION                                           │
│     └─> Pull structured content from each source type            │
│                                                                  │
│  4. METADATA TAGGING                                             │
│     └─> Attach provenance, dates, authors, reliability scores    │
│                                                                  │
│  5. DEDUPLICATION & CROSS-REFERENCING                            │
│     └─> Identify overlaps, flag contradictions                   │
│                                                                  │
│  6. REGISTRY ASSEMBLY                                            │
│     └─> Build CONTEXT-SOURCES.md with full metadata              │
│                                                                  │
│  7. CORPUS COMPILATION                                           │
│     └─> Build RAW-CONTEXT.md organized by topic and source       │
│                                                                  │
│  8. COMPLETENESS CHECK                                           │
│     └─> Verify coverage, flag gaps, log decisions                │
└──────────────────────────────────────────────────────────────────┘

Step 1: Source Discovery

Identify every available source before processing any of them. Cast a wide net.

Source Inventory Checklist

- [ ] User-provided documents (PDFs, Word docs, slides, spreadsheets)
- [ ] URLs and web pages explicitly shared
- [ ] Conversation transcripts or chat logs
- [ ] Meeting notes or recordings
- [ ] Freeform notes or braindumps
- [ ] Existing project files (README, specs, prior proposals)
- [ ] Codebase context (if technical project)
- [ ] Email threads or correspondence
- [ ] Competitor or market materials
- [ ] Regulatory or compliance documents

Discovery Strategies

Strategy	Description	When to Apply
Explicit collection	Gather everything the user has directly provided	Always -- first pass
Adjacency search	Look for related files near provided sources	When sources reference other documents
Gap-driven search	Identify missing perspectives and ask for them	After initial inventory reveals blind spots
Domain scan	Search for standard artifacts in the domain	When onboarding to a new industry or codebase
Stakeholder mapping	Identify who else might have relevant information	When building proposals or strategies

Source Discovery Output

### Source Inventory

| # | Source Name | Type | Status | Notes |
|---|------------|------|--------|-------|
| 1 | Client brief.pdf | Document | Pending extraction | Primary input |
| 2 | https://example.com/about | URL | Pending fetch | Company background |
| 3 | Kickoff call notes | Conversation | Pending extraction | Key decisions made |
| 4 | Competitor analysis spreadsheet | Document | Pending extraction | Market context |
| 5 | [GAP] Technical requirements | Unknown | Not yet provided | Need to request |

Step 2: Source Triage

Not all sources deserve equal attention. Evaluate before extracting.

Reliability Assessment

Rate each source on a 1-5 scale across three dimensions:

Dimension	1 (Low)	3 (Medium)	5 (High)
Authority	Unknown author, no credentials	Known author, some expertise	Domain expert, official source
Recency	Over 2 years old	6-24 months old	Less than 6 months old
Specificity	Generic, tangentially related	Partially relevant	Directly addresses the topic

Composite Reliability Score

Reliability = (Authority + Recency + Specificity) / 3
  5.0 - 4.0  =  HIGH    -->  Extract fully, high confidence
  3.9 - 2.5  =  MEDIUM  -->  Extract selectively, note caveats
  2.4 - 1.0  =  LOW     -->  Extract key claims only, flag uncertainty

Triage Priority Matrix

Reliability	Relevance High	Relevance Medium	Relevance Low
HIGH	Extract first, full depth	Extract second, full depth	Extract key points only
MEDIUM	Extract second, selective	Extract third, selective	Skip or skim
LOW	Extract with caveats	Skim for unique claims	Skip

Red Flags During Triage

Watch for sources that require extra scrutiny:

Undated material -- Cannot assess recency; flag and note
Promotional content -- May overstate capabilities; cross-reference claims
Single-source claims -- Important facts backed by only one source; note as unverified
Contradicting sources -- Two sources disagree on facts; capture both, flag for resolution
Stale technical content -- Technology references that may be outdated; verify currency

Step 3: Content Extraction

Apply source-type-specific extraction patterns to pull structured content.

Extraction by Source Type

Source Type	Extraction Method	Key Elements to Capture
Documents (PDF, DOCX)	Section-by-section extraction	Headings, key paragraphs, data tables, figures, conclusions
URLs / Web pages	Main content extraction, ignore navigation	Article body, author, date, key data points
Conversations	Decision and action extraction	Decisions made, action items, open questions, participants
Meeting notes	Structured summary	Attendees, agenda items, decisions, follow-ups
Freeform notes	Topic clustering	Group by theme, preserve original phrasing for key ideas
Spreadsheets	Data characterization	Column meanings, row counts, key metrics, date ranges
Code / Repos	Structure and pattern extraction	Architecture, tech stack, dependencies, conventions
Email threads	Chronological decision tracking	Thread of decisions, final positions, open items

Extraction Template

For each source, produce a block following this format:

### [Source Name]

**Source ID:** SRC-001
**Type:** Document | URL | Conversation | Notes | Code | Data
**Reliability:** HIGH | MEDIUM | LOW (score: X.X)
**Extracted:** [date]

#### Key Content

[Extracted content organized by topic. Use direct quotes for important
statements. Paraphrase for general context. Always attribute.]

#### Notable Claims

- [Specific factual claim from the source]
- [Another claim worth tracking]

#### Open Questions

- [Questions raised by this source]
- [Ambiguities that need clarification]

Extraction Quality Rules

Rule	Rationale
Preserve original language for key claims	Paraphrasing can shift meaning; quote when precision matters
Note page/section numbers	Enables verification without re-reading entire source
Separate fact from opinion	Mark subjective assessments as such (e.g., "Author claims...")
Flag quantitative data	Numbers, dates, and metrics are high-value; always capture precisely
Capture what is NOT said	Notable omissions (e.g., no mention of budget) are information too

Step 4: Metadata Tagging

Every source entry gets a complete metadata block. Consistent metadata enables filtering, sorting, and tracing.

Standard Metadata Schema

---
source_id: SRC-001
title: "Client Requirements Brief"
type: document
format: pdf
author: "Jane Smith, VP Product"
organization: "Acme Corp"
date_created: 2025-11-15
date_accessed: 2026-01-25
reliability: 4.2
authority: 5
recency: 4
specificity: 4
word_count: 3200
topics: [requirements, timeline, budget, integrations]
related_sources: [SRC-003, SRC-007]
contradicts: []
status: extracted
notes: "Primary input document. Contains both requirements and constraints."
---

Required vs Optional Metadata

Field	Required	Default if Missing
`source_id`	Yes	Auto-generated (SRC-NNN)
`title`	Yes	Filename or URL
`type`	Yes	--
`reliability`	Yes	--
`date_accessed`	Yes	Current date
`topics`	Yes	--
`author`	No	"Unknown"
`date_created`	No	"Unknown"
`organization`	No	"Unknown"
`related_sources`	No	[]
`contradicts`	No	[]

Step 5: Deduplication and Cross-Referencing

After extraction, identify overlaps and build connections between sources.

Deduplication Rules

Scenario	Action
Exact duplicate	Keep the more authoritative or more recent version; note the duplicate
Overlapping content	Merge into a single entry, cite both sources
Same topic, different angles	Keep both; link via `related_sources`
Contradicting claims	Keep both; flag in `contradicts` field; note in INGESTION-LOG.md

Cross-Reference Matrix

Build a topic-to-source matrix to visualize coverage:

### Cross-Reference Matrix

| Topic | SRC-001 | SRC-002 | SRC-003 | SRC-004 | Coverage |
|-------|---------|---------|---------|---------|----------|
| Budget | X | | X | | 2 sources |
| Timeline | X | X | | | 2 sources |
| Tech stack | | | X | X | 2 sources |
| User needs | X | X | X | | 3 sources |
| Competitors | | | | X | 1 source |
| Compliance | | | | | 0 -- GAP |

Topics with 0-1 sources should be flagged as potential gaps.

Step 6: Registry Assembly

Compile CONTEXT-SOURCES.md as the authoritative source registry.

CONTEXT-SOURCES.md Template

# Context Sources Registry

**Project:** [Project name]
**Ingestion Date:** [Date]
**Total Sources:** [Count]
**Source Types:** [Breakdown by type]

## Summary Statistics

| Metric | Value |
|--------|-------|
| Total sources | [N] |
| High reliability | [N] |
| Medium reliability | [N] |
| Low reliability | [N] |
| Unique topics covered | [N] |
| Identified gaps | [N] |

## Source Registry

### SRC-001: [Title]

- **Type:** [type]
- **Author:** [author]
- **Date:** [date]
- **Reliability:** [score] ([HIGH/MEDIUM/LOW])
- **Topics:** [topic1, topic2, topic3]
- **Related:** [SRC-NNN, SRC-NNN]
- **Summary:** [One-line description of what this source contributes]

### SRC-002: [Title]

[Same format...]

## Coverage Map

[Cross-reference matrix from Step 5]

## Identified Gaps

| Gap | Impact | Suggested Action |
|-----|--------|-----------------|
| [Missing topic] | [How this affects downstream work] | [How to fill it] |

Step 7: Corpus Compilation

Build RAW-CONTEXT.md as the organized content corpus, structured for consumption by context-cultivation.

RAW-CONTEXT.md Template

# Raw Context Corpus

**Project:** [Project name]
**Compiled:** [Date]
**Sources:** [Count] (see CONTEXT-SOURCES.md for full registry)

## How to Read This Document

Content is organized by topic. Each section draws from one or more
sources, identified by source ID (e.g., SRC-001). Direct quotes are
in blockquotes. Paraphrased content is in plain text.

---

## Topic: [Topic Name]

**Sources:** SRC-001, SRC-003, SRC-007

### From SRC-001 (Client Brief, reliability: HIGH)

[Extracted content relevant to this topic]

> "Direct quote from the source when precision matters" (p.12)

### From SRC-003 (Meeting Notes, reliability: MEDIUM)

[Extracted content relevant to this topic]

**Note:** This partially contradicts SRC-001 regarding [specific point].
See INGESTION-LOG.md for details.

---

## Topic: [Next Topic]

[Same structure...]

---

## Unclassified Content

[Content that does not fit neatly into a topic but may be relevant.
Tag with source ID for traceability.]

Organization Principles

Principle	Application
Topic-first, not source-first	Group by what the content is about, not where it came from
Highest reliability first	Within each topic, lead with the most authoritative source
Contradictions are visible	When sources disagree, show both and flag the conflict
Gaps are explicit	Empty topics or thin coverage are noted, not hidden
Original language preserved	Use blockquotes for critical statements; paraphrase for general context

Step 8: Completeness Check

Before marking ingestion as complete, verify coverage and quality.

Completeness Checklist

## Ingestion Completeness Review

### Source Coverage
- [ ] All provided sources have been processed
- [ ] Each source has a complete metadata block in CONTEXT-SOURCES.md
- [ ] Each source has extracted content in RAW-CONTEXT.md
- [ ] Source IDs are consistent across all documents

### Quality Checks
- [ ] Reliability scores assigned to every source
- [ ] No placeholder or stub entries remain
- [ ] Contradictions between sources are flagged
- [ ] Direct quotes are accurately attributed
- [ ] Quantitative data (dates, numbers, metrics) verified against source

### Coverage Analysis
- [ ] Cross-reference matrix is complete
- [ ] Gaps are identified and documented
- [ ] Gap impact and suggested actions provided
- [ ] Topics with single-source coverage flagged for attention

### Downstream Readiness
- [ ] RAW-CONTEXT.md is organized by topic (not by source)
- [ ] Content is sufficient for context-cultivation to begin
- [ ] INGESTION-LOG.md documents any decisions or anomalies (if applicable)

Coverage Threshold

Metric	Target	Minimum
Sources processed	100% of provided	90% of provided
Metadata completeness	All required fields	source_id + title + type + reliability
Topic coverage	No gaps in core topics	Gaps documented with impact
Cross-references	All relationships mapped	Major relationships identified

Output Formats

Quick Format (3 or fewer sources)

# Context Sources

**Project:** [Name]
**Date:** [Date]

## Sources

### SRC-001: [Title]
- **Type:** [type] | **Reliability:** [score]
- **Key content:** [2-3 sentence summary]

### SRC-002: [Title]
- **Type:** [type] | **Reliability:** [score]
- **Key content:** [2-3 sentence summary]

## Compiled Context

### [Topic 1]
[Combined content from sources, attributed by SRC-ID]

### [Topic 2]
[Combined content from sources, attributed by SRC-ID]

## Gaps
- [Any missing information noted]

Full Format (4+ sources)

Use the complete CONTEXT-SOURCES.md and RAW-CONTEXT.md templates from Steps 6 and 7, with:
- Full metadata blocks for every source
- Cross-reference matrix
- INGESTION-LOG.md for processing decisions
- Completeness checklist executed and documented

Common Patterns

The Deep Dive

Process a small number of highly detailed sources with maximum extraction depth. Every section, every data point, every claim is captured and tagged.

Use when: Working with 1-3 dense, authoritative documents (e.g., an RFP, a technical specification, a regulatory filing).

The Wide Sweep

Process many sources at moderate depth, prioritizing breadth of coverage over extraction completeness. Focus on key claims and unique contributions from each source.

Use when: Onboarding to a new domain with 10+ heterogeneous sources (e.g., market research, competitor sites, internal docs, meeting notes).

The Conversation Harvest

Extract structured information from unstructured dialogue -- meetings, chat logs, interviews. Focus on decisions, action items, stated preferences, and unanswered questions.

Use when: Primary inputs are conversations, calls, or informal exchanges rather than polished documents.

The Incremental Build

Start with an initial source set, process it, then add new sources as they arrive. Each addition triggers a targeted update to the registry and corpus rather than a full reprocessing.

Use when: Sources arrive over time rather than all at once (e.g., ongoing client engagement, rolling research).

Relationship to Other Skills

Skill	Relationship
`context-cultivation`	Receives CONTEXT-SOURCES.md and RAW-CONTEXT.md as primary inputs; transforms raw context into synthesized insights
`priority-matrix`	Uses cultivated context (which depends on ingested context) to establish priorities
`proposal-builder`	Final consumer in the proposal-loop chain; traces claims back to source IDs
`metadata-extraction`	Can be invoked as a sub-process during Step 4 for automated metadata tagging
`content-analysis`	Complementary skill for deeper analysis of individual complex documents
`architect`	In engineering contexts, architecture decisions benefit from ingested technical context

Key Principles

Provenance is non-negotiable. Every piece of extracted content must trace back to a specific source via its SRC-ID. Untraceable claims are unreliable claims.

Evaluate before you extract. Triage saves time. A low-relevance, low-reliability source processed at full depth is wasted effort. Assess first, then calibrate extraction depth.

Structure enables downstream work. The value of ingestion is not in the reading -- it is in organizing content so that context-cultivation, priority-matrix, and proposal-builder can work efficiently.

Gaps are findings, not failures. Discovering what is missing is as valuable as capturing what is present. Always document gaps with their impact and a suggested path to fill them.

Preserve fidelity, defer interpretation. Capture what sources actually say, in their own words when it matters. Interpretation, synthesis, and judgment belong to downstream skills.

Contradictions are signals. When sources disagree, do not resolve the conflict -- surface it. Contradictions often point to the most important areas for further investigation.

References

references/source-evaluation.md: Detailed criteria for assessing source authority, recency, and specificity
references/extraction-patterns.md: Source-type-specific extraction templates and techniques
references/metadata-schema.md: Extended metadata fields for domain-specific ingestion
references/deduplication-rules.md: Rules for handling overlapping and duplicate content
references/url-fetch-guidelines.md: Best practices for web content extraction at scale

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.