Build or update the BlueBubbles external channel plugin for Moltbot (extension package, REST...
npx skills add miles-knowbl/orchestrator --skill "metadata-extraction"
Install specific skill from multi-skill repository
# Description
Extract, parse, and normalize metadata from diverse content sources including URLs, HTTP headers, HTML meta tags, structured data formats, and document structural elements. Produces standardized metadata records conforming to Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies. Supports the inbox/second-brain pipeline by ensuring every ingested item has rich, consistent, machine-readable metadata.
# SKILL.md
name: metadata-extraction
description: "Extract, parse, and normalize metadata from diverse content sources including URLs, HTTP headers, HTML meta tags, structured data formats, and document structural elements. Produces standardized metadata records conforming to Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies. Supports the inbox/second-brain pipeline by ensuring every ingested item has rich, consistent, machine-readable metadata."
phase: IMPLEMENT
category: meta
version: "2.0.0"
depends_on: []
tags: [meta, analysis, utilities, parsing]
Metadata Extraction
Extract and normalize metadata from diverse content sources into standardized, machine-readable records.
When to Use
- Content enters the inbox pipeline --- A new URL, document, or media item arrives and needs cataloging before downstream processing
- Source identification required --- You need to determine what a piece of content is, where it came from, and who created it
- Structured data harvesting --- A page or document contains embedded metadata (OpenGraph, JSON-LD, microdata) that should be captured
- Cross-platform normalization --- Content from GitHub, YouTube, documentation sites, blogs, and other platforms needs a uniform metadata format
- Cataloging and classification --- Building a knowledge base or second brain that requires consistent metadata across heterogeneous sources
- Attribution and provenance tracking --- Need to establish authorship, publication dates, licensing, and source chains
- Quality assessment of sources --- Evaluating metadata completeness and reliability before committing to deeper analysis
- When you say: "extract metadata", "catalog this", "what is this source", "parse this URL", "normalize these records", "tag this content"
Reference Requirements
MUST read before applying this skill:
| Reference | Why Required |
|---|---|
metadata-schemas.md |
Canonical field definitions for Dublin Core, OpenGraph, JSON-LD, and schema.org mappings |
platform-extractors.md |
Platform-specific extraction patterns for GitHub, YouTube, docs sites, blogs, and APIs |
Read if applicable:
| Reference | When Needed |
|---|---|
normalization-rules.md |
When merging metadata from multiple vocabularies or resolving conflicts between embedded schemas |
structured-data-formats.md |
When parsing RDFa, Microdata, JSON-LD, or other embedded structured data from HTML documents |
Verification: Ensure every extracted record contains at minimum: title, source_url (or source_path), content_type, date_accessed, and at least one topic tag. Records missing required fields must be flagged as incomplete.
Required Deliverables
| Deliverable | Location | Condition |
|---|---|---|
| Metadata record(s) | Inline or appended to source entry | Always --- at least one record per input source |
| Extraction report | Inline summary | When 3+ sources processed --- completeness statistics and quality flags |
| Normalization log | Inline notes | When schema conflicts detected --- decisions on field mapping and precedence |
| Quality assessment | Appended to record | When metadata completeness falls below 70% --- flags and remediation suggestions |
Core Concept
Metadata Extraction answers: "What is this content, where did it come from, and how should it be cataloged?"
Good metadata extraction is:
- Comprehensive --- Pulls from every available signal: URL structure, HTTP headers, HTML meta tags, structured data, visible content, and platform conventions
- Normalized --- Maps heterogeneous source metadata into a single consistent schema regardless of origin platform
- Standards-aware --- Leverages Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies rather than inventing ad-hoc fields
- Resilient --- Degrades gracefully when metadata is sparse, falling back through extraction layers until something useful is found
- Honest --- Reports confidence levels and flags incomplete or ambiguous metadata rather than guessing
Metadata Extraction is NOT:
- Content analysis or summarization (that is content-analysis)
- Source evaluation or reliability scoring (that is context-ingestion)
- Semantic understanding or theme extraction (that is context-cultivation)
- Full-text indexing or search optimization
- Data transformation or ETL processing
The Metadata Extraction Process
+-----------------------------------------------------------------+
| METADATA EXTRACTION PROCESS |
| |
| 1. SOURCE IDENTIFICATION |
| +---> Parse URL, detect file type, identify platform |
| |
| 2. STRUCTURAL METADATA EXTRACTION |
| +---> Titles, headings, dates, authors, descriptions |
| |
| 3. EMBEDDED SCHEMA EXTRACTION |
| +---> OpenGraph, JSON-LD, Microdata, Dublin Core, RDFa |
| |
| 4. PLATFORM-SPECIFIC EXTRACTION |
| +---> GitHub, YouTube, docs sites, blogs, APIs |
| |
| 5. SEMANTIC METADATA INFERENCE |
| +---> Topics, categories, language, content class |
| |
| 6. SCHEMA NORMALIZATION |
| +---> Map all fields to canonical schema, resolve conflicts |
| |
| 7. QUALITY ASSESSMENT |
| +---> Completeness score, reliability flags, gap report |
| |
| 8. OUTPUT FORMATTING |
| +---> Produce standardized metadata record |
+-----------------------------------------------------------------+
Step 1: Source Identification
Before extracting metadata, determine what you are looking at. The source type dictates which extraction strategies apply.
URL Parsing
Decompose URLs to extract embedded metadata signals:
| URL Component | Metadata Signal | Example |
|---|---|---|
| Protocol | Security, access type | https = secure; file = local |
| Domain | Publisher, platform | github.com = code repository |
| Subdomain | Content division | docs.example.com = documentation |
| Path segments | Content hierarchy, type | /blog/2025/01/title = blog post, dated |
| Path extensions | File format | .pdf, .md, .html |
| Query parameters | View state, filters | ?v=xxxxx = YouTube video ID |
| Fragment | Section reference | #installation = specific section |
URL Pattern Recognition
## Common URL Patterns
### GitHub
- `github.com/{owner}/{repo}` --> Repository root
- `github.com/{owner}/{repo}/blob/{branch}/{path}` --> File view
- `github.com/{owner}/{repo}/issues/{n}` --> Issue
- `github.com/{owner}/{repo}/pull/{n}` --> Pull request
- `github.com/{owner}/{repo}/releases/tag/{tag}` --> Release
### YouTube
- `youtube.com/watch?v={id}` --> Video
- `youtube.com/playlist?list={id}` --> Playlist
- `youtube.com/@{handle}` --> Channel
- `youtu.be/{id}` --> Short video link
### Documentation Sites
- `docs.{product}.com/{path}` --> Product documentation
- `{product}.readthedocs.io/{path}` --> ReadTheDocs project
- `developer.{company}.com/{path}` --> Developer portal
- `{domain}/api/{version}/{path}` --> API documentation
### Blogs and Articles
- `{domain}/blog/{year}/{month}/{slug}` --> Dated blog post
- `{domain}/posts/{slug}` --> Undated post
- `medium.com/@{author}/{slug}` --> Medium article
- `dev.to/{author}/{slug}` --> Dev.to article
Content Type Detection
Determine content type through multiple signals, in priority order:
| Priority | Signal | Method |
|---|---|---|
| 1 | HTTP Content-Type header | Direct server declaration; most authoritative |
| 2 | File extension | .pdf, .html, .json, .md, .mp4 |
| 3 | URL pattern | Platform-specific patterns (see above) |
| 4 | Content sniffing | First bytes / magic numbers of file content |
| 5 | Heuristic | Presence of HTML tags, JSON structure, markdown syntax |
Content Type Taxonomy
## Content Classes
- **document**: PDF, Word, slides, spreadsheets
- **webpage**: HTML pages, blog posts, articles
- **code**: Source files, repositories, gists, notebooks
- **media**: Video, audio, images, podcasts
- **data**: JSON, CSV, XML, YAML data files
- **api**: API documentation, endpoint references, schemas
- **conversation**: Forum threads, Q&A, chat logs, comments
- **reference**: Wiki pages, encyclopedic entries, glossaries
Source Identification Checklist
- [ ] URL parsed into components (protocol, domain, path, query, fragment)
- [ ] Platform identified (GitHub, YouTube, docs site, blog, generic)
- [ ] Content type determined (document, webpage, code, media, data, api)
- [ ] File format identified if applicable (PDF, HTML, MD, JSON)
- [ ] Access method confirmed (public, authenticated, local)
Step 2: Structural Metadata Extraction
Extract metadata from the visible and structural elements of the content.
HTML Document Metadata
| Element | Extraction Target | Fallback |
|---|---|---|
<title> |
Page title | First <h1>, then <h2> |
<meta name="description"> |
Page description | First <p> of main content |
<meta name="author"> |
Author name | Byline element, schema.org author |
<meta name="keywords"> |
Topic keywords | None (often absent or unreliable) |
<meta name="date"> / <time> |
Publication date | URL date pattern, last-modified header |
<link rel="canonical"> |
Canonical URL | Current URL |
<html lang="..."> |
Content language | Heuristic detection |
Date Extraction Priority
Dates are critical metadata but notoriously inconsistent. Extract in this priority order:
| Priority | Source | Reliability | Format Variants |
|---|---|---|---|
| 1 | <meta property="article:published_time"> |
High | ISO 8601 |
| 2 | <time datetime="..."> element |
High | ISO 8601 |
| 3 | JSON-LD datePublished |
High | ISO 8601 |
| 4 | HTTP Last-Modified header |
Medium | RFC 7231 |
| 5 | URL path date pattern (/2025/01/15/) |
Medium | Year/month/day |
| 6 | Visible text pattern ("Published January 15, 2025") | Low | Natural language |
| 7 | Copyright year in footer | Low | Year only |
Date Normalization
All extracted dates must be normalized to ISO 8601 format:
Full: 2025-01-15T14:30:00Z
Date only: 2025-01-15
Year-month: 2025-01
Year only: 2025
Unknown: null (with flag: "date_unknown": true)
Author Extraction
| Source | Pattern | Confidence |
|---|---|---|
<meta name="author"> |
Direct declaration | High |
Schema.org author property |
Structured data | High |
Byline element (class="author", class="byline") |
Convention-based | Medium |
| GitHub commit / profile | Platform API | High |
| YouTube channel name | Platform convention | High |
| Text pattern ("By John Smith") | Regex heuristic | Low |
Structural Elements Checklist
- [ ] Title extracted (with source noted: <title>, <h1>, og:title, etc.)
- [ ] Description extracted or generated from first paragraph
- [ ] Author identified (or marked "Unknown" with confidence: low)
- [ ] Publication date extracted and normalized to ISO 8601
- [ ] Language identified (ISO 639-1 code)
- [ ] Canonical URL determined
- [ ] Word count or content length estimated
Step 3: Embedded Schema Extraction
Modern web pages embed structured metadata in multiple overlapping formats. Extract from all available schemas, then normalize in Step 6.
OpenGraph Protocol (og:)
OpenGraph tags are the most widely deployed structured metadata on the web. Extract from <meta property="og:..."> tags.
| Property | Maps To | Notes |
|---|---|---|
og:title |
title | Often more descriptive than <title> |
og:description |
description | Social-sharing optimized description |
og:type |
content_type | article, video.other, website, profile |
og:url |
canonical_url | Authoritative URL for the content |
og:image |
thumbnail_url | Preview image URL |
og:site_name |
publisher | Site or brand name |
og:locale |
language | Format: en_US |
article:published_time |
date_published | ISO 8601 datetime |
article:modified_time |
date_modified | ISO 8601 datetime |
article:author |
author_url | URL to author profile |
article:section |
category | Content section/category |
article:tag |
topics[] | Content tags (may repeat) |
JSON-LD (Linked Data)
JSON-LD is the most machine-readable format, embedded in <script type="application/ld+json">. Extract and parse the JSON structure.
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Example Article Title",
"author": {
"@type": "Person",
"name": "Jane Smith",
"url": "https://example.com/authors/jane"
},
"datePublished": "2025-01-15T10:00:00Z",
"dateModified": "2025-01-20T14:30:00Z",
"publisher": {
"@type": "Organization",
"name": "Example Publisher",
"logo": { "@type": "ImageObject", "url": "https://example.com/logo.png" }
},
"description": "A description of the article content.",
"image": "https://example.com/article-image.jpg",
"mainEntityOfPage": "https://example.com/article-url"
}
Key schema.org Types
| @type | Expected Properties | Content Class |
|---|---|---|
Article / NewsArticle / BlogPosting |
headline, author, datePublished, publisher | webpage |
WebPage / WebSite |
name, url, description | webpage |
SoftwareSourceCode / SoftwareApplication |
name, author, programmingLanguage, codeRepository | code |
VideoObject |
name, description, duration, uploadDate, thumbnailUrl | media |
HowTo / TechArticle |
name, step[], proficiencyLevel | reference |
Person / Organization |
name, url, sameAs[] | entity |
Dataset |
name, description, distribution, license | data |
APIReference |
name, description, programmingModel | api |
Dublin Core Metadata
Dublin Core elements appear as <meta name="DC...." content="..."> or <meta name="dcterms....">. This is common in academic, government, and library contexts.
| DC Element | Maps To | Notes |
|---|---|---|
DC.title |
title | Formal title |
DC.creator |
author | Creator/author name |
DC.subject |
topics[] | Subject keywords |
DC.description |
description | Content description |
DC.publisher |
publisher | Publishing entity |
DC.date |
date_published | Publication date |
DC.type |
content_type | DCMI Type Vocabulary |
DC.format |
format | MIME type |
DC.identifier |
identifier | DOI, ISBN, URI |
DC.language |
language | ISO 639 code |
DC.rights |
license | Rights statement |
DC.source |
source_url | Derived-from source |
DC.relation |
related[] | Related resources |
DC.coverage |
coverage | Spatial/temporal scope |
Microdata and RDFa
These formats embed metadata directly in HTML attributes:
<!-- Microdata -->
<div itemscope itemtype="https://schema.org/Article">
<h1 itemprop="headline">Article Title</h1>
<span itemprop="author">Jane Smith</span>
<time itemprop="datePublished" datetime="2025-01-15">Jan 15, 2025</time>
</div>
<!-- RDFa -->
<div vocab="https://schema.org/" typeof="Article">
<h1 property="headline">Article Title</h1>
<span property="author">Jane Smith</span>
<time property="datePublished" datetime="2025-01-15">Jan 15, 2025</time>
</div>
Schema Extraction Checklist
- [ ] OpenGraph tags extracted (og:title, og:description, og:type, og:url, og:image)
- [ ] JSON-LD blocks parsed (all <script type="application/ld+json"> elements)
- [ ] Dublin Core elements checked (DC.* and dcterms.* meta tags)
- [ ] Microdata scanned (itemscope/itemprop attributes)
- [ ] RDFa scanned (vocab/typeof/property attributes)
- [ ] Twitter Card tags checked (twitter:title, twitter:description, twitter:image)
- [ ] Schema conflicts noted (e.g., og:title differs from JSON-LD headline)
Step 4: Platform-Specific Extraction
Different platforms expose metadata through platform-specific conventions, APIs, and page structures. Apply targeted extractors based on the platform identified in Step 1.
GitHub
| Metadata Field | Extraction Source |
|---|---|
| Repository name | URL path: /{owner}/{repo} |
| Owner / organization | URL path: /{owner}/ |
| Description | <meta property="og:description"> or repo about section |
| Primary language | Language bar / linguist data |
| Topics | Repository topics (tags below description) |
| Stars / forks | Social proof metrics |
| License | LICENSE file or API license field |
| Last updated | Most recent commit date or push date |
| README content | README.md at repository root |
| Contributors | Contributor count from page or API |
| Open issues | Issue count for activity assessment |
YouTube
| Metadata Field | Extraction Source |
|---|---|
| Video title | og:title or JSON-LD name |
| Channel name | og:site_name or channel link text |
| Upload date | JSON-LD uploadDate or datePublished |
| Duration | JSON-LD duration (ISO 8601 duration) |
| Description | og:description (often truncated) or JSON-LD |
| View count | JSON-LD interactionCount |
| Thumbnail | og:image or JSON-LD thumbnailUrl |
| Tags | <meta name="keywords"> (comma-separated) |
| Category | JSON-LD genre or visible category link |
| Captions available | Accessibility metadata or API |
Documentation Sites
| Metadata Field | Extraction Source |
|---|---|
| Product / project name | Site header, og:site_name, breadcrumb root |
| Doc version | URL segment (/v2/, /latest/), version selector |
| Section / category | Breadcrumb trail, sidebar navigation context |
| Page title | <h1> or og:title (strip product prefix) |
| Last updated | Footer date, git blame date, meta tag |
| Navigation context | Previous/next links, breadcrumb hierarchy |
| API version | Path segment or header declaration |
Blog / Article Platforms
| Platform | Author Source | Date Source | Tags Source |
|---|---|---|---|
| Medium | Author card, @{handle} in URL |
<time> element |
Bottom-of-article tags |
| Dev.to | URL /{author}/, profile card |
<time> element |
Visible tag list |
| WordPress | <meta name="author">, byline |
<time class="entry-date"> |
<a rel="tag"> elements |
| Ghost | JSON-LD author, byline | JSON-LD datePublished | JSON-LD keywords |
| Substack | Publication name + author | <time> or JSON-LD |
Categories if present |
| Hugo / Jekyll | YAML frontmatter author |
Frontmatter date |
Frontmatter tags |
Platform Extraction Checklist
- [ ] Platform identified from URL pattern
- [ ] Platform-specific extractor applied
- [ ] Platform-unique fields captured (stars, views, duration, etc.)
- [ ] Author profile URL captured (not just name)
- [ ] Platform-specific content type mapped to canonical type
Step 5: Semantic Metadata Inference
When explicit metadata is absent or sparse, infer semantic properties from content analysis. These inferred fields carry lower confidence than extracted fields.
Topic Inference
Derive topics from available signals, in priority order:
| Priority | Signal | Method | Confidence |
|---|---|---|---|
| 1 | Explicit tags / keywords | Direct extraction | High |
| 2 | Article section / category | Platform categorization | High |
| 3 | Title keywords | Key noun phrases from title | Medium |
| 4 | Heading structure | H2/H3 topics across the document | Medium |
| 5 | URL path segments | Meaningful path words | Low |
| 6 | Content body analysis | Frequent terms and phrases | Low |
Content Classification
Assign a primary content class based on structural signals:
| Content Class | Signals |
|---|---|
| tutorial | Step-numbered headings, "how to" in title, code blocks interspersed with prose |
| reference | Table-heavy, alphabetical ordering, definition lists, API signatures |
| conceptual | Explanatory prose, minimal code, "what is" / "understanding" in title |
| news | Dateline, recent date, event-driven topic, publication attribution |
| opinion | First-person voice, evaluative language, "I think" / "my experience" |
| announcement | "Introducing", "announcing", "releasing", version numbers, changelog format |
| discussion | Multiple voices, thread structure, Q&A format |
| specification | Formal language, MUST/SHALL/MAY keywords (RFC 2119), numbered requirements |
Language Detection
| Method | Reliability | Application |
|---|---|---|
<html lang="..."> attribute |
High | Declared by page author |
Content-Language HTTP header |
High | Server-declared |
og:locale property |
High | OpenGraph declaration |
| Heuristic character analysis | Medium | Script detection (Latin, CJK, Arabic, Cyrillic) |
| n-gram frequency analysis | Medium | Statistical language identification |
Semantic Inference Checklist
- [ ] Topics assigned (with source: explicit, inferred, or both)
- [ ] Content class determined (tutorial, reference, conceptual, etc.)
- [ ] Language identified (ISO 639-1 code)
- [ ] Reading level estimated if applicable (technical depth)
- [ ] Inferred fields marked with confidence: medium or confidence: low
Step 6: Schema Normalization
Map all extracted metadata --- from structural elements, embedded schemas, platform-specific fields, and inferred properties --- into a single canonical record.
Canonical Metadata Schema
# Required fields
title: "string" # Content title
source_url: "string" # URL or file path of the source
content_type: "string" # Canonical type: webpage, document, code, media, data, api, conversation, reference
date_accessed: "ISO 8601" # When this metadata was extracted
topics: ["string"] # At least one topic tag
# Strongly recommended fields
description: "string" # Content summary (1-3 sentences)
author: "string" # Primary author name
author_url: "string" # URL to author profile
date_published: "ISO 8601" # Publication or creation date
date_modified: "ISO 8601" # Last modification date
language: "string" # ISO 639-1 language code
publisher: "string" # Publishing platform or organization
canonical_url: "string" # Authoritative URL (may differ from source_url)
# Optional enrichment fields
license: "string" # License identifier (SPDX or description)
thumbnail_url: "string" # Preview image URL
word_count: "number" # Approximate content length
reading_time_minutes: "number" # Estimated reading time
content_class: "string" # tutorial, reference, conceptual, news, opinion, etc.
platform: "string" # github, youtube, medium, docs, blog, etc.
format: "string" # MIME type or file format
identifier: "string" # DOI, ISBN, or other formal identifier
related: ["string"] # URLs of related content
tags_explicit: ["string"] # Tags declared by the source
tags_inferred: ["string"] # Tags inferred during extraction
# Quality fields
metadata_completeness: "number" # 0.0-1.0 score
extraction_confidence: "string" # high, medium, low
extraction_notes: ["string"] # Flags, warnings, or decisions made during extraction
Field Precedence Rules
When multiple sources provide the same field, apply this precedence:
| Field | First Choice | Second Choice | Third Choice | Last Resort |
|---|---|---|---|---|
| title | JSON-LD headline |
og:title |
<title> |
First <h1> |
| description | JSON-LD description |
og:description |
<meta name="description"> |
First paragraph |
| author | JSON-LD author.name |
<meta name="author"> |
Byline element | Platform profile |
| date_published | JSON-LD datePublished |
article:published_time |
<time datetime> |
URL date pattern |
| topics | Explicit tags + article:tag |
DC.subject | <meta name="keywords"> |
Inferred from title |
| content_type | JSON-LD @type mapped |
og:type mapped |
Platform convention | Heuristic |
Conflict Resolution
When extracted values conflict across schemas:
| Conflict Type | Resolution Strategy |
|---|---|
| Minor wording difference | Prefer the more specific or complete version |
| Substantive disagreement | Prefer JSON-LD > OpenGraph > meta tags > inferred |
| Date discrepancy | Prefer the earliest plausible date as published; note discrepancy |
| Author mismatch | Prefer structured data (JSON-LD) over unstructured (byline text) |
| Multiple types | Use most specific type; note alternative classification |
Normalization Checklist
- [ ] All extracted fields mapped to canonical schema
- [ ] Field precedence applied where conflicts exist
- [ ] Dates normalized to ISO 8601
- [ ] Language normalized to ISO 639-1
- [ ] Content type mapped to canonical taxonomy
- [ ] Conflict resolution decisions documented in extraction_notes
- [ ] Required fields present (title, source_url, content_type, date_accessed, topics)
Step 7: Quality Assessment
Assess the completeness and reliability of the extracted metadata record before finalizing.
Completeness Scoring
Calculate metadata completeness as a ratio of populated fields to total applicable fields:
Completeness = (populated required fields / 5) * 0.5
+ (populated recommended fields / 7) * 0.3
+ (populated optional fields / total optional) * 0.2
Score ranges:
0.90 - 1.00 = EXCELLENT --> Rich metadata, high confidence
0.70 - 0.89 = GOOD --> Sufficient for cataloging and discovery
0.50 - 0.69 = FAIR --> Usable but with notable gaps
0.30 - 0.49 = POOR --> Significant gaps, flag for manual review
0.00 - 0.29 = MINIMAL --> Only basic identification possible
Reliability Indicators
| Indicator | Positive Signal | Negative Signal |
|---|---|---|
| Source authority | Known publisher, verified author | Anonymous, unverifiable source |
| Schema richness | JSON-LD + OpenGraph + meta tags | No structured data at all |
| Date availability | Published date with timezone | No dates anywhere |
| Cross-schema consistency | All schemas agree on key fields | Contradictions between schemas |
| Platform recognition | Known platform with established patterns | Unknown site with no conventions |
| Content structure | Clear headings, semantic HTML | Flat text, no structure |
Quality Flags
Flag any of the following conditions in extraction_notes:
## Quality Flags
- MISSING_DATE: No publication date found from any source
- MISSING_AUTHOR: No author identified; attribution not possible
- INFERRED_ONLY: Key fields populated only through inference (no explicit metadata)
- SCHEMA_CONFLICT: Conflicting values found across metadata schemas
- STALE_CONTENT: Content appears outdated (date > 2 years old)
- TRUNCATED_DESCRIPTION: Description was cut off or incomplete
- NO_STRUCTURED_DATA: No JSON-LD, OpenGraph, or Dublin Core found
- REDIRECT_DETECTED: Source URL redirected; canonical may differ
- PAYWALL_SUSPECTED: Full content may not be accessible
- GENERATED_CONTENT: Content appears AI-generated (metadata pattern)
Quality Assessment Checklist
- [ ] Completeness score calculated
- [ ] Reliability indicators assessed
- [ ] Quality flags applied where applicable
- [ ] Missing required fields explicitly noted
- [ ] Confidence level assigned (high / medium / low)
- [ ] Remediation suggestions provided for POOR or MINIMAL scores
Step 8: Output Formatting
Produce the final metadata record in the standardized format.
Record Assembly
Combine all extraction, normalization, and quality assessment into the final output. Include provenance information showing which extraction step produced each field.
Final Record Validation
Before outputting, verify:
- [ ] All required fields present and non-empty
- [ ] All dates in ISO 8601 format
- [ ] All URLs are absolute (not relative)
- [ ] Topics array has at least one entry
- [ ] content_type is from canonical taxonomy
- [ ] metadata_completeness score is calculated
- [ ] extraction_confidence is set
- [ ] extraction_notes captures any flags or decisions
Output Formats
Quick Format (single source, inline use)
**Metadata: [Title]**
- **Source:** [URL or path]
- **Type:** [content_type] | **Platform:** [platform]
- **Author:** [author] | **Published:** [date_published]
- **Language:** [language] | **License:** [license]
- **Topics:** [topic1, topic2, topic3]
- **Description:** [description]
- **Completeness:** [score] ([EXCELLENT/GOOD/FAIR/POOR/MINIMAL])
- **Flags:** [any quality flags]
Full Format (detailed record, cataloging use)
---
# Metadata Record
title: "[Title]"
source_url: "[URL]"
canonical_url: "[Canonical URL]"
content_type: "[type]"
content_class: "[class]"
platform: "[platform]"
format: "[MIME type or format]"
# Attribution
author: "[Author Name]"
author_url: "[Author Profile URL]"
publisher: "[Publisher Name]"
# Temporal
date_published: "[ISO 8601]"
date_modified: "[ISO 8601]"
date_accessed: "[ISO 8601]"
# Classification
language: "[ISO 639-1]"
topics:
- "[topic1]"
- "[topic2]"
tags_explicit:
- "[tag1]"
tags_inferred:
- "[tag2]"
# Enrichment
description: "[1-3 sentence description]"
thumbnail_url: "[URL]"
word_count: [number]
reading_time_minutes: [number]
license: "[SPDX or description]"
identifier: "[DOI, ISBN, etc.]"
related:
- "[URL]"
# Quality
metadata_completeness: [0.0-1.0]
extraction_confidence: "[high/medium/low]"
extraction_notes:
- "[Note or flag]"
# Provenance
extracted_by: "metadata-extraction v2.0.0"
extraction_date: "[ISO 8601]"
schemas_found: ["json-ld", "opengraph", "dublin-core"]
---
Common Patterns
Web Page Extraction
Process a standard web page (blog post, article, documentation page) through the full pipeline. Start with URL parsing to identify the platform, then extract HTML structural metadata, harvest all embedded schemas (OpenGraph, JSON-LD, Dublin Core), apply platform-specific extractors, infer any missing semantic fields, normalize, and assess quality.
Use when: The source is an HTML page accessible via URL. This is the most common extraction scenario and exercises all pipeline steps.
API Documentation Extraction
Extract metadata from API reference pages, focusing on technical specificity. Capture the API product name, version, endpoint paths, authentication requirements, and rate limits. Map to schema.org APIReference or TechArticle types. Pay special attention to version information embedded in URLs (/v2/, /latest/) and documentation framework metadata (Swagger/OpenAPI, Redoc, Slate).
Use when: The source is API documentation, SDK reference, or developer portal content. These sources have distinctive structural patterns and specialized metadata needs.
Video and Media Extraction
Extract metadata from video platforms (YouTube, Vimeo) and podcast hosts. Focus on duration, upload date, creator/channel, view metrics, and associated text content (descriptions, transcripts, show notes). JSON-LD on video platforms is typically rich; prefer it over OpenGraph for structured fields like duration.
Use when: The source is a video, podcast episode, or other media content. These sources carry temporal metadata (duration) and engagement metrics not found in text content.
Code Repository Extraction
Extract metadata from code hosting platforms (GitHub, GitLab, Bitbucket). Capture repository name, owner, description, primary language, topics/tags, license, star/fork counts, contributor count, and last activity date. README content provides the richest description source. API endpoints (when accessible) provide the most structured data.
Use when: The source is a code repository, gist, or code-sharing platform. These sources have a unique metadata vocabulary (stars, forks, languages) that must be mapped to canonical fields.
Relationship to Other Skills
| Skill | Relationship |
|---|---|
context-ingestion |
Ingestion invokes metadata-extraction as a sub-process during source registration; extracted metadata populates the CONTEXT-SOURCES.md entries |
content-analysis |
Content analysis operates on the content body; metadata-extraction provides the structural envelope (title, author, date, type) that frames the analysis |
context-cultivation |
Cultivation uses metadata fields (topics, dates, authors) to group and cross-reference sources during thematic coding |
priority-matrix |
Source metadata (recency, authority, content type) informs priority weighting and relevance scoring |
proposal-builder |
Extracted metadata provides attribution data for source citations in proposals |
Key Principles
Extract before you infer. Always attempt to find explicitly declared metadata before falling back to inference. Declared metadata from the content author is more reliable than algorithmically derived values. Only infer when explicit sources are exhausted.
Standards over invention. Use Dublin Core, OpenGraph, JSON-LD, and schema.org vocabularies. These standards exist because metadata interoperability is a solved problem at the vocabulary level. Custom fields should be exceptional, not default.
Degrade gracefully. Not every source has rich metadata. A PDF with no embedded metadata still has a filename, a file size, and a creation date. The extraction pipeline must produce a usable record from even the sparsest inputs, clearly marking what was found versus inferred.
Provenance is non-negotiable. Every field in the output record should be traceable to a specific extraction step and source. When downstream consumers question a metadata value, the extraction notes must answer "where did this come from?"
Completeness is measurable. Do not hand-wave about metadata quality. Calculate a completeness score. Flag gaps explicitly. A record that honestly reports 40% completeness is more useful than one that silently omits the assessment.
Normalize relentlessly. The entire value of metadata extraction collapses if dates are in five formats, content types use three taxonomies, and languages are sometimes codes and sometimes names. Every field must conform to its canonical format, every time, without exception.
References
references/metadata-schemas.md: Canonical field definitions, Dublin Core element set, OpenGraph property catalog, schema.org type mappings, and cross-schema equivalence tablesreferences/platform-extractors.md: Platform-specific extraction patterns for GitHub, YouTube, Medium, Dev.to, WordPress, ReadTheDocs, and common documentation frameworksreferences/normalization-rules.md: Date format normalization, language code mapping, content type taxonomy, and conflict resolution proceduresreferences/structured-data-formats.md: Parsing guides for JSON-LD, RDFa, Microdata, and embedded metadata in PDF, DOCX, and image EXIF headers
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.