metadata-extraction

by @miles-knowbl in Web & API

# Install this skill:

npx skills add miles-knowbl/orchestrator --skill "metadata-extraction"

Install specific skill from multi-skill repository

# Description

Extract, parse, and normalize metadata from diverse content sources including URLs, HTTP headers, HTML meta tags, structured data formats, and document structural elements. Produces standardized metadata records conforming to Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies. Supports the inbox/second-brain pipeline by ensuring every ingested item has rich, consistent, machine-readable metadata.

# SKILL.md

name: metadata-extraction
description: "Extract, parse, and normalize metadata from diverse content sources including URLs, HTTP headers, HTML meta tags, structured data formats, and document structural elements. Produces standardized metadata records conforming to Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies. Supports the inbox/second-brain pipeline by ensuring every ingested item has rich, consistent, machine-readable metadata."
phase: IMPLEMENT
category: meta
version: "2.0.0"
depends_on: []
tags: [meta, analysis, utilities, parsing]

Metadata Extraction

Extract and normalize metadata from diverse content sources into standardized, machine-readable records.

When to Use

Content enters the inbox pipeline --- A new URL, document, or media item arrives and needs cataloging before downstream processing
Source identification required --- You need to determine what a piece of content is, where it came from, and who created it
Structured data harvesting --- A page or document contains embedded metadata (OpenGraph, JSON-LD, microdata) that should be captured
Cross-platform normalization --- Content from GitHub, YouTube, documentation sites, blogs, and other platforms needs a uniform metadata format
Cataloging and classification --- Building a knowledge base or second brain that requires consistent metadata across heterogeneous sources
Attribution and provenance tracking --- Need to establish authorship, publication dates, licensing, and source chains
Quality assessment of sources --- Evaluating metadata completeness and reliability before committing to deeper analysis
When you say: "extract metadata", "catalog this", "what is this source", "parse this URL", "normalize these records", "tag this content"

Reference Requirements

MUST read before applying this skill:

Reference	Why Required
`metadata-schemas.md`	Canonical field definitions for Dublin Core, OpenGraph, JSON-LD, and schema.org mappings
`platform-extractors.md`	Platform-specific extraction patterns for GitHub, YouTube, docs sites, blogs, and APIs

Read if applicable:

Reference	When Needed
`normalization-rules.md`	When merging metadata from multiple vocabularies or resolving conflicts between embedded schemas
`structured-data-formats.md`	When parsing RDFa, Microdata, JSON-LD, or other embedded structured data from HTML documents

Verification: Ensure every extracted record contains at minimum: title, source_url (or source_path), content_type, date_accessed, and at least one topic tag. Records missing required fields must be flagged as incomplete.

Required Deliverables

Deliverable	Location	Condition
Metadata record(s)	Inline or appended to source entry	Always --- at least one record per input source
Extraction report	Inline summary	When 3+ sources processed --- completeness statistics and quality flags
Normalization log	Inline notes	When schema conflicts detected --- decisions on field mapping and precedence
Quality assessment	Appended to record	When metadata completeness falls below 70% --- flags and remediation suggestions

Core Concept

Metadata Extraction answers: "What is this content, where did it come from, and how should it be cataloged?"

Good metadata extraction is:
- Comprehensive --- Pulls from every available signal: URL structure, HTTP headers, HTML meta tags, structured data, visible content, and platform conventions
- Normalized --- Maps heterogeneous source metadata into a single consistent schema regardless of origin platform
- Standards-aware --- Leverages Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies rather than inventing ad-hoc fields
- Resilient --- Degrades gracefully when metadata is sparse, falling back through extraction layers until something useful is found
- Honest --- Reports confidence levels and flags incomplete or ambiguous metadata rather than guessing

Metadata Extraction is NOT:
- Content analysis or summarization (that is content-analysis)
- Source evaluation or reliability scoring (that is context-ingestion)
- Semantic understanding or theme extraction (that is context-cultivation)
- Full-text indexing or search optimization
- Data transformation or ETL processing

The Metadata Extraction Process

+-----------------------------------------------------------------+
|               METADATA EXTRACTION PROCESS                        |
|                                                                  |
|  1. SOURCE IDENTIFICATION                                        |
|     +---> Parse URL, detect file type, identify platform         |
|                                                                  |
|  2. STRUCTURAL METADATA EXTRACTION                               |
|     +---> Titles, headings, dates, authors, descriptions         |
|                                                                  |
|  3. EMBEDDED SCHEMA EXTRACTION                                   |
|     +---> OpenGraph, JSON-LD, Microdata, Dublin Core, RDFa       |
|                                                                  |
|  4. PLATFORM-SPECIFIC EXTRACTION                                 |
|     +---> GitHub, YouTube, docs sites, blogs, APIs               |
|                                                                  |
|  5. SEMANTIC METADATA INFERENCE                                  |
|     +---> Topics, categories, language, content class            |
|                                                                  |
|  6. SCHEMA NORMALIZATION                                         |
|     +---> Map all fields to canonical schema, resolve conflicts  |
|                                                                  |
|  7. QUALITY ASSESSMENT                                           |
|     +---> Completeness score, reliability flags, gap report      |
|                                                                  |
|  8. OUTPUT FORMATTING                                            |
|     +---> Produce standardized metadata record                   |
+-----------------------------------------------------------------+

Step 1: Source Identification

Before extracting metadata, determine what you are looking at. The source type dictates which extraction strategies apply.

URL Parsing

Decompose URLs to extract embedded metadata signals:

URL Component	Metadata Signal	Example
Protocol	Security, access type	`https` = secure; `file` = local
Domain	Publisher, platform	`github.com` = code repository
Subdomain	Content division	`docs.example.com` = documentation
Path segments	Content hierarchy, type	`/blog/2025/01/title` = blog post, dated
Path extensions	File format	`.pdf`, `.md`, `.html`
Query parameters	View state, filters	`?v=xxxxx` = YouTube video ID
Fragment	Section reference	`#installation` = specific section

URL Pattern Recognition

## Common URL Patterns

### GitHub
- `github.com/{owner}/{repo}` --> Repository root
- `github.com/{owner}/{repo}/blob/{branch}/{path}` --> File view
- `github.com/{owner}/{repo}/issues/{n}` --> Issue
- `github.com/{owner}/{repo}/pull/{n}` --> Pull request
- `github.com/{owner}/{repo}/releases/tag/{tag}` --> Release

### YouTube
- `youtube.com/watch?v={id}` --> Video
- `youtube.com/playlist?list={id}` --> Playlist
- `youtube.com/@{handle}` --> Channel
- `youtu.be/{id}` --> Short video link

### Documentation Sites
- `docs.{product}.com/{path}` --> Product documentation
- `{product}.readthedocs.io/{path}` --> ReadTheDocs project
- `developer.{company}.com/{path}` --> Developer portal
- `{domain}/api/{version}/{path}` --> API documentation

### Blogs and Articles
- `{domain}/blog/{year}/{month}/{slug}` --> Dated blog post
- `{domain}/posts/{slug}` --> Undated post
- `medium.com/@{author}/{slug}` --> Medium article
- `dev.to/{author}/{slug}` --> Dev.to article

Content Type Detection

Determine content type through multiple signals, in priority order:

Priority	Signal	Method
1	HTTP Content-Type header	Direct server declaration; most authoritative
2	File extension	`.pdf`, `.html`, `.json`, `.md`, `.mp4`
3	URL pattern	Platform-specific patterns (see above)
4	Content sniffing	First bytes / magic numbers of file content
5	Heuristic	Presence of HTML tags, JSON structure, markdown syntax

Content Type Taxonomy

## Content Classes

- **document**: PDF, Word, slides, spreadsheets
- **webpage**: HTML pages, blog posts, articles
- **code**: Source files, repositories, gists, notebooks
- **media**: Video, audio, images, podcasts
- **data**: JSON, CSV, XML, YAML data files
- **api**: API documentation, endpoint references, schemas
- **conversation**: Forum threads, Q&A, chat logs, comments
- **reference**: Wiki pages, encyclopedic entries, glossaries

Source Identification Checklist

- [ ] URL parsed into components (protocol, domain, path, query, fragment)
- [ ] Platform identified (GitHub, YouTube, docs site, blog, generic)
- [ ] Content type determined (document, webpage, code, media, data, api)
- [ ] File format identified if applicable (PDF, HTML, MD, JSON)
- [ ] Access method confirmed (public, authenticated, local)

Step 2: Structural Metadata Extraction

Extract metadata from the visible and structural elements of the content.

HTML Document Metadata

Element	Extraction Target	Fallback
`<title>`	Page title	First `<h1>`, then `<h2>`
`<meta name="description">`	Page description	First `<p>` of main content
`<meta name="author">`	Author name	Byline element, schema.org author
`<meta name="keywords">`	Topic keywords	None (often absent or unreliable)
`<meta name="date">` / `<time>`	Publication date	URL date pattern, last-modified header
`<link rel="canonical">`	Canonical URL	Current URL
`<html lang="...">`	Content language	Heuristic detection

Date Extraction Priority

Dates are critical metadata but notoriously inconsistent. Extract in this priority order:

Priority	Source	Reliability	Format Variants
1	`<meta property="article:published_time">`	High	ISO 8601
2	`<time datetime="...">` element	High	ISO 8601
3	JSON-LD `datePublished`	High	ISO 8601
4	HTTP `Last-Modified` header	Medium	RFC 7231
5	URL path date pattern (`/2025/01/15/`)	Medium	Year/month/day
6	Visible text pattern ("Published January 15, 2025")	Low	Natural language
7	Copyright year in footer	Low	Year only

Date Normalization

All extracted dates must be normalized to ISO 8601 format:

Full:      2025-01-15T14:30:00Z
Date only: 2025-01-15
Year-month: 2025-01
Year only:  2025
Unknown:   null (with flag: "date_unknown": true)

Author Extraction

Source	Pattern	Confidence
`<meta name="author">`	Direct declaration	High
Schema.org `author` property	Structured data	High
Byline element (`class="author"`, `class="byline"`)	Convention-based	Medium
GitHub commit / profile	Platform API	High
YouTube channel name	Platform convention	High
Text pattern ("By John Smith")	Regex heuristic	Low

Structural Elements Checklist

- [ ] Title extracted (with source noted: <title>, <h1>, og:title, etc.)
- [ ] Description extracted or generated from first paragraph
- [ ] Author identified (or marked "Unknown" with confidence: low)
- [ ] Publication date extracted and normalized to ISO 8601
- [ ] Language identified (ISO 639-1 code)
- [ ] Canonical URL determined
- [ ] Word count or content length estimated

Step 3: Embedded Schema Extraction

Modern web pages embed structured metadata in multiple overlapping formats. Extract from all available schemas, then normalize in Step 6.

OpenGraph Protocol (og:)

OpenGraph tags are the most widely deployed structured metadata on the web. Extract from <meta property="og:..."> tags.

Property	Maps To	Notes
`og:title`	title	Often more descriptive than `<title>`
`og:description`	description	Social-sharing optimized description
`og:type`	content_type	`article`, `video.other`, `website`, `profile`
`og:url`	canonical_url	Authoritative URL for the content
`og:image`	thumbnail_url	Preview image URL
`og:site_name`	publisher	Site or brand name
`og:locale`	language	Format: `en_US`
`article:published_time`	date_published	ISO 8601 datetime
`article:modified_time`	date_modified	ISO 8601 datetime
`article:author`	author_url	URL to author profile
`article:section`	category	Content section/category
`article:tag`	topics[]	Content tags (may repeat)

JSON-LD (Linked Data)

JSON-LD is the most machine-readable format, embedded in <script type="application/ld+json">. Extract and parse the JSON structure.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Example Article Title",
  "author": {
    "@type": "Person",
    "name": "Jane Smith",
    "url": "https://example.com/authors/jane"
  },
  "datePublished": "2025-01-15T10:00:00Z",
  "dateModified": "2025-01-20T14:30:00Z",
  "publisher": {
    "@type": "Organization",
    "name": "Example Publisher",
    "logo": { "@type": "ImageObject", "url": "https://example.com/logo.png" }
  },
  "description": "A description of the article content.",
  "image": "https://example.com/article-image.jpg",
  "mainEntityOfPage": "https://example.com/article-url"
}

Key schema.org Types

@type	Expected Properties	Content Class
`Article` / `NewsArticle` / `BlogPosting`	headline, author, datePublished, publisher	webpage
`WebPage` / `WebSite`	name, url, description	webpage
`SoftwareSourceCode` / `SoftwareApplication`	name, author, programmingLanguage, codeRepository	code
`VideoObject`	name, description, duration, uploadDate, thumbnailUrl	media
`HowTo` / `TechArticle`	name, step[], proficiencyLevel	reference
`Person` / `Organization`	name, url, sameAs[]	entity
`Dataset`	name, description, distribution, license	data
`APIReference`	name, description, programmingModel	api

Dublin Core Metadata

Dublin Core elements appear as <meta name="DC...." content="..."> or <meta name="dcterms....">. This is common in academic, government, and library contexts.

DC Element	Maps To	Notes
`DC.title`	title	Formal title
`DC.creator`	author	Creator/author name
`DC.subject`	topics[]	Subject keywords
`DC.description`	description	Content description
`DC.publisher`	publisher	Publishing entity
`DC.date`	date_published	Publication date
`DC.type`	content_type	DCMI Type Vocabulary
`DC.format`	format	MIME type
`DC.identifier`	identifier	DOI, ISBN, URI
`DC.language`	language	ISO 639 code
`DC.rights`	license	Rights statement
`DC.source`	source_url	Derived-from source
`DC.relation`	related[]	Related resources
`DC.coverage`	coverage	Spatial/temporal scope

Microdata and RDFa

These formats embed metadata directly in HTML attributes:

<!-- Microdata -->
<div itemscope itemtype="https://schema.org/Article">
  <h1 itemprop="headline">Article Title</h1>
  <span itemprop="author">Jane Smith</span>
  <time itemprop="datePublished" datetime="2025-01-15">Jan 15, 2025</time>
</div>

<!-- RDFa -->
<div vocab="https://schema.org/" typeof="Article">
  <h1 property="headline">Article Title</h1>
  <span property="author">Jane Smith</span>
  <time property="datePublished" datetime="2025-01-15">Jan 15, 2025</time>
</div>

Schema Extraction Checklist

- [ ] OpenGraph tags extracted (og:title, og:description, og:type, og:url, og:image)
- [ ] JSON-LD blocks parsed (all <script type="application/ld+json"> elements)
- [ ] Dublin Core elements checked (DC.* and dcterms.* meta tags)
- [ ] Microdata scanned (itemscope/itemprop attributes)
- [ ] RDFa scanned (vocab/typeof/property attributes)
- [ ] Twitter Card tags checked (twitter:title, twitter:description, twitter:image)
- [ ] Schema conflicts noted (e.g., og:title differs from JSON-LD headline)

Step 4: Platform-Specific Extraction

Different platforms expose metadata through platform-specific conventions, APIs, and page structures. Apply targeted extractors based on the platform identified in Step 1.

GitHub

Metadata Field	Extraction Source
Repository name	URL path: `/{owner}/{repo}`
Owner / organization	URL path: `/{owner}/`
Description	`<meta property="og:description">` or repo about section
Primary language	Language bar / `linguist` data
Topics	Repository topics (tags below description)
Stars / forks	Social proof metrics
License	`LICENSE` file or API `license` field
Last updated	Most recent commit date or push date
README content	`README.md` at repository root
Contributors	Contributor count from page or API
Open issues	Issue count for activity assessment

YouTube

Metadata Field	Extraction Source
Video title	`og:title` or JSON-LD `name`
Channel name	`og:site_name` or channel link text
Upload date	JSON-LD `uploadDate` or `datePublished`
Duration	JSON-LD `duration` (ISO 8601 duration)
Description	`og:description` (often truncated) or JSON-LD
View count	JSON-LD `interactionCount`
Thumbnail	`og:image` or JSON-LD `thumbnailUrl`
Tags	`<meta name="keywords">` (comma-separated)
Category	JSON-LD `genre` or visible category link
Captions available	Accessibility metadata or API

Documentation Sites

Metadata Field	Extraction Source
Product / project name	Site header, `og:site_name`, breadcrumb root
Doc version	URL segment (`/v2/`, `/latest/`), version selector
Section / category	Breadcrumb trail, sidebar navigation context
Page title	`<h1>` or `og:title` (strip product prefix)
Last updated	Footer date, git blame date, meta tag
Navigation context	Previous/next links, breadcrumb hierarchy
API version	Path segment or header declaration

Blog / Article Platforms

Platform	Author Source	Date Source	Tags Source
Medium	Author card, `@{handle}` in URL	`<time>` element	Bottom-of-article tags
Dev.to	URL `/{author}/`, profile card	`<time>` element	Visible tag list
WordPress	`<meta name="author">`, byline	`<time class="entry-date">`	`<a rel="tag">` elements
Ghost	JSON-LD author, byline	JSON-LD datePublished	JSON-LD keywords
Substack	Publication name + author	`<time>` or JSON-LD	Categories if present
Hugo / Jekyll	YAML frontmatter `author`	Frontmatter `date`	Frontmatter `tags`

Platform Extraction Checklist

- [ ] Platform identified from URL pattern
- [ ] Platform-specific extractor applied
- [ ] Platform-unique fields captured (stars, views, duration, etc.)
- [ ] Author profile URL captured (not just name)
- [ ] Platform-specific content type mapped to canonical type

Step 5: Semantic Metadata Inference

When explicit metadata is absent or sparse, infer semantic properties from content analysis. These inferred fields carry lower confidence than extracted fields.

Topic Inference

Derive topics from available signals, in priority order:

Priority	Signal	Method	Confidence
1	Explicit tags / keywords	Direct extraction	High
2	Article section / category	Platform categorization	High
3	Title keywords	Key noun phrases from title	Medium
4	Heading structure	H2/H3 topics across the document	Medium
5	URL path segments	Meaningful path words	Low
6	Content body analysis	Frequent terms and phrases	Low

Content Classification

Assign a primary content class based on structural signals:

Content Class	Signals
tutorial	Step-numbered headings, "how to" in title, code blocks interspersed with prose
reference	Table-heavy, alphabetical ordering, definition lists, API signatures
conceptual	Explanatory prose, minimal code, "what is" / "understanding" in title
news	Dateline, recent date, event-driven topic, publication attribution
opinion	First-person voice, evaluative language, "I think" / "my experience"
announcement	"Introducing", "announcing", "releasing", version numbers, changelog format
discussion	Multiple voices, thread structure, Q&A format
specification	Formal language, MUST/SHALL/MAY keywords (RFC 2119), numbered requirements

Language Detection

Method	Reliability	Application
`<html lang="...">` attribute	High	Declared by page author
`Content-Language` HTTP header	High	Server-declared
`og:locale` property	High	OpenGraph declaration
Heuristic character analysis	Medium	Script detection (Latin, CJK, Arabic, Cyrillic)
n-gram frequency analysis	Medium	Statistical language identification

Semantic Inference Checklist

- [ ] Topics assigned (with source: explicit, inferred, or both)
- [ ] Content class determined (tutorial, reference, conceptual, etc.)
- [ ] Language identified (ISO 639-1 code)
- [ ] Reading level estimated if applicable (technical depth)
- [ ] Inferred fields marked with confidence: medium or confidence: low

Step 6: Schema Normalization

Map all extracted metadata --- from structural elements, embedded schemas, platform-specific fields, and inferred properties --- into a single canonical record.

Canonical Metadata Schema

# Required fields
title: "string"              # Content title
source_url: "string"         # URL or file path of the source
content_type: "string"       # Canonical type: webpage, document, code, media, data, api, conversation, reference
date_accessed: "ISO 8601"    # When this metadata was extracted
topics: ["string"]           # At least one topic tag

# Strongly recommended fields
description: "string"        # Content summary (1-3 sentences)
author: "string"             # Primary author name
author_url: "string"         # URL to author profile
date_published: "ISO 8601"   # Publication or creation date
date_modified: "ISO 8601"    # Last modification date
language: "string"           # ISO 639-1 language code
publisher: "string"          # Publishing platform or organization
canonical_url: "string"      # Authoritative URL (may differ from source_url)

# Optional enrichment fields
license: "string"            # License identifier (SPDX or description)
thumbnail_url: "string"      # Preview image URL
word_count: "number"         # Approximate content length
reading_time_minutes: "number"  # Estimated reading time
content_class: "string"      # tutorial, reference, conceptual, news, opinion, etc.
platform: "string"           # github, youtube, medium, docs, blog, etc.
format: "string"             # MIME type or file format
identifier: "string"         # DOI, ISBN, or other formal identifier
related: ["string"]          # URLs of related content
tags_explicit: ["string"]    # Tags declared by the source
tags_inferred: ["string"]    # Tags inferred during extraction

# Quality fields
metadata_completeness: "number"  # 0.0-1.0 score
extraction_confidence: "string"  # high, medium, low
extraction_notes: ["string"]     # Flags, warnings, or decisions made during extraction

Field Precedence Rules

When multiple sources provide the same field, apply this precedence:

Field	First Choice	Second Choice	Third Choice	Last Resort
title	JSON-LD `headline`	`og:title`	`<title>`	First `<h1>`
description	JSON-LD `description`	`og:description`	`<meta name="description">`	First paragraph
author	JSON-LD `author.name`	`<meta name="author">`	Byline element	Platform profile
date_published	JSON-LD `datePublished`	`article:published_time`	`<time datetime>`	URL date pattern
topics	Explicit tags + `article:tag`	DC.subject	`<meta name="keywords">`	Inferred from title
content_type	JSON-LD `@type` mapped	`og:type` mapped	Platform convention	Heuristic

Conflict Resolution

When extracted values conflict across schemas:

Conflict Type	Resolution Strategy
Minor wording difference	Prefer the more specific or complete version
Substantive disagreement	Prefer JSON-LD > OpenGraph > meta tags > inferred
Date discrepancy	Prefer the earliest plausible date as published; note discrepancy
Author mismatch	Prefer structured data (JSON-LD) over unstructured (byline text)
Multiple types	Use most specific type; note alternative classification

Normalization Checklist

- [ ] All extracted fields mapped to canonical schema
- [ ] Field precedence applied where conflicts exist
- [ ] Dates normalized to ISO 8601
- [ ] Language normalized to ISO 639-1
- [ ] Content type mapped to canonical taxonomy
- [ ] Conflict resolution decisions documented in extraction_notes
- [ ] Required fields present (title, source_url, content_type, date_accessed, topics)

Step 7: Quality Assessment

Assess the completeness and reliability of the extracted metadata record before finalizing.

Completeness Scoring

Calculate metadata completeness as a ratio of populated fields to total applicable fields:

Completeness = (populated required fields / 5) * 0.5
             + (populated recommended fields / 7) * 0.3
             + (populated optional fields / total optional) * 0.2

Score ranges:
  0.90 - 1.00  =  EXCELLENT  -->  Rich metadata, high confidence
  0.70 - 0.89  =  GOOD       -->  Sufficient for cataloging and discovery
  0.50 - 0.69  =  FAIR       -->  Usable but with notable gaps
  0.30 - 0.49  =  POOR       -->  Significant gaps, flag for manual review
  0.00 - 0.29  =  MINIMAL    -->  Only basic identification possible

Reliability Indicators

Indicator	Positive Signal	Negative Signal
Source authority	Known publisher, verified author	Anonymous, unverifiable source
Schema richness	JSON-LD + OpenGraph + meta tags	No structured data at all
Date availability	Published date with timezone	No dates anywhere
Cross-schema consistency	All schemas agree on key fields	Contradictions between schemas
Platform recognition	Known platform with established patterns	Unknown site with no conventions
Content structure	Clear headings, semantic HTML	Flat text, no structure

Quality Flags

Flag any of the following conditions in extraction_notes:

## Quality Flags

- MISSING_DATE: No publication date found from any source
- MISSING_AUTHOR: No author identified; attribution not possible
- INFERRED_ONLY: Key fields populated only through inference (no explicit metadata)
- SCHEMA_CONFLICT: Conflicting values found across metadata schemas
- STALE_CONTENT: Content appears outdated (date > 2 years old)
- TRUNCATED_DESCRIPTION: Description was cut off or incomplete
- NO_STRUCTURED_DATA: No JSON-LD, OpenGraph, or Dublin Core found
- REDIRECT_DETECTED: Source URL redirected; canonical may differ
- PAYWALL_SUSPECTED: Full content may not be accessible
- GENERATED_CONTENT: Content appears AI-generated (metadata pattern)

Quality Assessment Checklist

- [ ] Completeness score calculated
- [ ] Reliability indicators assessed
- [ ] Quality flags applied where applicable
- [ ] Missing required fields explicitly noted
- [ ] Confidence level assigned (high / medium / low)
- [ ] Remediation suggestions provided for POOR or MINIMAL scores

Step 8: Output Formatting

Produce the final metadata record in the standardized format.

Record Assembly

Combine all extraction, normalization, and quality assessment into the final output. Include provenance information showing which extraction step produced each field.

Final Record Validation

Before outputting, verify:

- [ ] All required fields present and non-empty
- [ ] All dates in ISO 8601 format
- [ ] All URLs are absolute (not relative)
- [ ] Topics array has at least one entry
- [ ] content_type is from canonical taxonomy
- [ ] metadata_completeness score is calculated
- [ ] extraction_confidence is set
- [ ] extraction_notes captures any flags or decisions

Output Formats

Quick Format (single source, inline use)

**Metadata: [Title]**
- **Source:** [URL or path]
- **Type:** [content_type] | **Platform:** [platform]
- **Author:** [author] | **Published:** [date_published]
- **Language:** [language] | **License:** [license]
- **Topics:** [topic1, topic2, topic3]
- **Description:** [description]
- **Completeness:** [score] ([EXCELLENT/GOOD/FAIR/POOR/MINIMAL])
- **Flags:** [any quality flags]

Full Format (detailed record, cataloging use)

---
# Metadata Record
title: "[Title]"
source_url: "[URL]"
canonical_url: "[Canonical URL]"
content_type: "[type]"
content_class: "[class]"
platform: "[platform]"
format: "[MIME type or format]"

# Attribution
author: "[Author Name]"
author_url: "[Author Profile URL]"
publisher: "[Publisher Name]"

# Temporal
date_published: "[ISO 8601]"
date_modified: "[ISO 8601]"
date_accessed: "[ISO 8601]"

# Classification
language: "[ISO 639-1]"
topics:
  - "[topic1]"
  - "[topic2]"
tags_explicit:
  - "[tag1]"
tags_inferred:
  - "[tag2]"

# Enrichment
description: "[1-3 sentence description]"
thumbnail_url: "[URL]"
word_count: [number]
reading_time_minutes: [number]
license: "[SPDX or description]"
identifier: "[DOI, ISBN, etc.]"
related:
  - "[URL]"

# Quality
metadata_completeness: [0.0-1.0]
extraction_confidence: "[high/medium/low]"
extraction_notes:
  - "[Note or flag]"

# Provenance
extracted_by: "metadata-extraction v2.0.0"
extraction_date: "[ISO 8601]"
schemas_found: ["json-ld", "opengraph", "dublin-core"]
---

Common Patterns

Web Page Extraction

Process a standard web page (blog post, article, documentation page) through the full pipeline. Start with URL parsing to identify the platform, then extract HTML structural metadata, harvest all embedded schemas (OpenGraph, JSON-LD, Dublin Core), apply platform-specific extractors, infer any missing semantic fields, normalize, and assess quality.

Use when: The source is an HTML page accessible via URL. This is the most common extraction scenario and exercises all pipeline steps.

API Documentation Extraction

Extract metadata from API reference pages, focusing on technical specificity. Capture the API product name, version, endpoint paths, authentication requirements, and rate limits. Map to schema.org APIReference or TechArticle types. Pay special attention to version information embedded in URLs (/v2/, /latest/) and documentation framework metadata (Swagger/OpenAPI, Redoc, Slate).

Use when: The source is API documentation, SDK reference, or developer portal content. These sources have distinctive structural patterns and specialized metadata needs.

Video and Media Extraction

Extract metadata from video platforms (YouTube, Vimeo) and podcast hosts. Focus on duration, upload date, creator/channel, view metrics, and associated text content (descriptions, transcripts, show notes). JSON-LD on video platforms is typically rich; prefer it over OpenGraph for structured fields like duration.

Use when: The source is a video, podcast episode, or other media content. These sources carry temporal metadata (duration) and engagement metrics not found in text content.

Code Repository Extraction

Extract metadata from code hosting platforms (GitHub, GitLab, Bitbucket). Capture repository name, owner, description, primary language, topics/tags, license, star/fork counts, contributor count, and last activity date. README content provides the richest description source. API endpoints (when accessible) provide the most structured data.

Use when: The source is a code repository, gist, or code-sharing platform. These sources have a unique metadata vocabulary (stars, forks, languages) that must be mapped to canonical fields.

Relationship to Other Skills

Skill	Relationship
`context-ingestion`	Ingestion invokes metadata-extraction as a sub-process during source registration; extracted metadata populates the CONTEXT-SOURCES.md entries
`content-analysis`	Content analysis operates on the content body; metadata-extraction provides the structural envelope (title, author, date, type) that frames the analysis
`context-cultivation`	Cultivation uses metadata fields (topics, dates, authors) to group and cross-reference sources during thematic coding
`priority-matrix`	Source metadata (recency, authority, content type) informs priority weighting and relevance scoring
`proposal-builder`	Extracted metadata provides attribution data for source citations in proposals

Key Principles

Extract before you infer. Always attempt to find explicitly declared metadata before falling back to inference. Declared metadata from the content author is more reliable than algorithmically derived values. Only infer when explicit sources are exhausted.

Standards over invention. Use Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies. These standards exist because metadata interoperability is a solved problem at the vocabulary level. Custom fields should be exceptional, not default.

Degrade gracefully. Not every source has rich metadata. A PDF with no embedded metadata still has a filename, a file size, and a creation date. The extraction pipeline must produce a usable record from even the sparsest inputs, clearly marking what was found versus inferred.

Provenance is non-negotiable. Every field in the output record should be traceable to a specific extraction step and source. When downstream consumers question a metadata value, the extraction notes must answer "where did this come from?"

Completeness is measurable. Do not hand-wave about metadata quality. Calculate a completeness score. Flag gaps explicitly. A record that honestly reports 40% completeness is more useful than one that silently omits the assessment.

Normalize relentlessly. The entire value of metadata extraction collapses if dates are in five formats, content types use three taxonomies, and languages are sometimes codes and sometimes names. Every field must conform to its canonical format, every time, without exception.

References

references/metadata-schemas.md: Canonical field definitions, Dublin Core element set, OpenGraph property catalog, schema.org type mappings, and cross-schema equivalence tables
references/platform-extractors.md: Platform-specific extraction patterns for GitHub, YouTube, Medium, Dev.to, WordPress, ReadTheDocs, and common documentation frameworks
references/normalization-rules.md: Date format normalization, language code mapping, content type taxonomy, and conflict resolution procedures
references/structured-data-formats.md: Parsing guides for JSON-LD, RDFa, Microdata, and embedded metadata in PDF, DOCX, and image EXIF headers

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.