javiermontano-sofka

sofka-analytics-engineering

0
0
# Install this skill:
npx skills add javiermontano-sofka/sdf --skill "sofka-analytics-engineering"

Install specific skill from multi-skill repository

# Description

>

# SKILL.md


name: sofka-analytics-engineering
description: >
Analytics pipeline design — dbt-style transformations, data modeling, testing, documentation.
Use when the user asks to "design analytics models", "set up dbt project", "plan data transformations",
"define data contracts", "model star schema", or mentions staging models, marts, incremental strategies, or materializations.
model: opus
context: fork
allowed-tools:
- Read
- Write
- Edit
- Glob
- Grep
- Bash


Analytics Engineering: Transformation Pipeline Design & Data Modeling

Analytics engineering defines how raw data is transformed into reliable, documented, and tested analytical models — source-to-target mapping, modeling patterns, transformation frameworks, testing, and documentation. This skill produces analytics engineering documentation that enables teams to build maintainable, trustworthy data transformation pipelines.

Principio Rector

Un modelo analítico sin tests es una opinión con formato de tabla. Los tests son ciudadanos de primera clase — no un afterthought. La documentación es parte del modelo, no un artefacto separado. Incremental sobre full-refresh siempre que el volumen lo justifique. Cada modelo tiene grain explícito, owner identificado, y contract enforced en CI.

Inputs

The user provides a system or project name as $ARGUMENTS. Parse $1 as the system/project name used throughout all output artifacts.

Parameters:
- {MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
- piloto-auto: Auto para staging y naming conventions, HITL para modeling patterns y data contracts.
- desatendido: Cero interrupciones. Pipeline completo documentado automáticamente. Supuestos documentados.
- supervisado: Autónomo con checkpoint en modeling pattern selection y testing strategy.
- paso-a-paso: Confirma cada modelo, materialization, test y exposure.
- {FORMATO}: markdown (default) | html | dual
- {VARIANTE}: ejecutiva (~40% — S1 source-to-target + S2 modeling patterns + S4 testing strategy) | técnica (full 6 sections, default)

Before generating architecture, detect the project context:

!find . -name "*.sql" -o -name "*.yml" -o -name "dbt_project.yml" -o -name "profiles.yml" -o -name "*.py" | head -30

Use detected tools (dbt, SQLMesh, Dataform, stored procedures, etc.) to tailor recommendations.

If reference materials exist, load them:

Read ${CLAUDE_SKILL_DIR}/references/analytics-patterns.md

When to Use

  • Designing source-to-target data transformation pipelines
  • Selecting data modeling patterns (star schema, OBT, activity schema)
  • Setting up dbt or similar transformation framework projects
  • Defining testing strategies and data contracts for analytical models
  • Planning documentation and data discovery infrastructure
  • Optimizing warehouse performance and controlling compute costs

When NOT to Use

  • Data ingestion and orchestration pipelines → use data-engineering skill
  • Dashboard design and KPI frameworks → use bi-architecture skill
  • ML feature engineering and model serving → use data-science-architecture skill
  • Data profiling and anomaly detection → use data-quality skill

Delivery Structure: 6 Sections

S1: Source-to-Target Mapping

Maps the journey from raw sources through staging to consumption-ready marts.

dbt project structure conventions:

models/
  staging/           # 1:1 source mappings — stg_{source}_{entity}.sql
    salesforce/
    stripe/
  intermediate/      # Business logic joins — int_{entity}_{verb}.sql
    finance/
    marketing/
  marts/             # Consumption models
    finance/         # fct_{event}.sql, dim_{entity}.sql
    marketing/
    core/            # Shared dimensions — dim_customer, dim_date

Naming conventions (enforce via CI linting):
- stg_ — staging: rename, cast, deduplicate; one model per source table
- int_ — intermediate: joins, pivots, aggregations that simplify mart logic
- fct_ — fact: event/transaction grain, numeric measures, foreign keys
- dim_ — dimension: descriptive attributes, surrogate keys, SCD tracking
- mrt_ — mart-level aggregations when pre-aggregating for BI performance

Includes:
- Layer architecture (raw/landing → staging → intermediate → marts → metrics)
- Source inventory (systems, tables, extraction method, freshness SLA)
- Staging model design (1:1 source mapping, renaming, type casting, deduplication)
- Intermediate model patterns (joins, business logic aggregation, pivoting)
- Mart design (domain-specific consumption models, access patterns)

Key decisions:
- Layer count: 3 layers (stg/int/mart) for most teams; add metrics layer when semantic layer is needed
- Schema strategy: one schema per layer (staging, intermediate, marts) for simplicity; one per domain at mart level for access control
- Naming rigidity: enforce prefixes (stg_, int_, fct_, dim_) and source identifiers (stg_salesforce_, stg_stripe_) — enables automated lineage and CI rules

S2: Data Modeling Patterns

Selects the modeling approach based on query patterns and data characteristics.

Includes:
- Star schema design (fact tables, dimension tables, grain definition, surrogate keys)
- Wide/denormalized tables (One Big Table pattern for simple analytics)
- Activity schema (event-centric modeling for behavioral analytics)
- Slowly changing dimensions (SCD Type 1: overwrite; Type 2: history tracking with effective dates; Type 3: previous + current columns — Type 2 is default for anything auditable)
- Bridge tables for many-to-many relationships

Key decisions:
- Star vs OBT: star for multiple consumption patterns and BI tools; OBT for single-use analytics — common mistake: over-normalizing for analytics creates 10-15 join queries that break easily
- Surrogate vs natural keys: surrogate preferred for SCDs and warehouse-internal joins; natural keys at staging layer
- Grain selection: too fine = expensive queries; too coarse = lost detail — always document grain explicitly in model YAML description
- Model complexity rule: keep individual model files to ~100 lines of SQL; models exceeding this need decomposition into intermediate steps

S3: Transformation Framework

Documents tool configuration, model organization, and materialization strategies.

Incremental strategy comparison:

Strategy Mechanism Best For Watch Out
append Insert new rows only Immutable event streams (logs, clicks) Cannot handle late-arriving updates
merge Upsert on unique_key Mutable entities (orders, users) Requires stable unique key; expensive on large tables
delete+insert Delete partition, re-insert Late-arriving data in known partitions Partition key must be deterministic
insert_overwrite Overwrite entire partition Cost-efficient on BigQuery/Hive Not supported on all warehouses
microbatch Process in time-windowed batches Very large event tables (1B+ rows) Newer dbt feature; requires event_time column

Acceptance criteria for incremental models: test incremental runs against full refresh monthly to catch drift; always define unique_key and updated_at; set on_schema_change: 'append_new_columns' as default.

ref() vs source() conventions:
- source() only in staging models — never reference raw tables in intermediate or mart layers
- ref() everywhere else to maintain DAG integrity and enable state-aware builds
- Exposures for downstream consumers (BI dashboards, ML pipelines, reverse ETL)

Includes:
- Project structure (models/, tests/, macros/, seeds/, snapshots/, analyses/)
- Materialization strategy per layer (staging: view; intermediate: ephemeral or view; marts: table or incremental)
- Macro library (reusable SQL logic, cross-database compatibility)
- Snapshot strategy (SCD Type 2 tracking via timestamps or check columns)

Key decisions:
- Incremental vs full refresh: data volume and freshness drive this — incremental for >10M rows, full refresh for everything else
- Macro abstraction level: too many macros = unreadable SQL; too few = duplication — extract when logic is used in 3+ models
- Environment parity: dev/staging/prod must produce identical results on same data

S4: Testing & Data Contracts

Defines testing strategy and contract enforcement for data reliability.

Testing pyramid for data (invest effort bottom-up):

Level What Tools Blocks Deploy?
Source freshness Data arrived on time dbt source freshness Warn at 2x SLA, error at 4x
Schema tests not_null, unique, accepted_values, relationships dbt generic tests in YAML Yes — mart layer always, staging for critical
Custom data tests Business rule validation, cross-model consistency dbt singular tests (.sql files) Yes for mart layer
Unit tests Macro logic, complex SQL transformations dbt unit tests (v1.8+), SQLMesh audits Yes — CI blocks merge
Contract tests Column names, types, constraints between teams dbt model contracts (contract: {enforced: true}) Yes — breaking changes blocked

CI/CD for dbt:
- Slim CI: dbt build --select state:modified+ on PRs — test only changed models and downstream dependencies
- Deferred execution: --defer --state prod-artifacts/ — reference production tables for unchanged models, avoiding full rebuilds
- Block merge on test failure for mart-layer models; warn-only for staging
- Run dbt source freshness as pre-build step; skip stale sources with --exclude source:stale
- PR comment bot: post model changes, test results, and warehouse cost estimate

Includes:
- Data contract specification (schema, types, constraints, SLAs between teams)
- Contract enforcement (breaking change detection, versioned interfaces)
- Test severity levels (warn vs error; which tests block deployment)

Key decisions:
- Test coverage target: mart models = 100% schema tests + custom tests; staging = not_null + unique on primary key minimum
- Breaking change policy: strict (block) for marts consumed by external teams; lenient (warn) for internal intermediate models
- Test execution: pre-merge CI + post-deploy monitoring — both, not either/or

S5: Documentation & Discovery

Plans auto-generated and manually enriched documentation for data discovery.

Exposure and metric definitions (connecting to BI):

exposures:
  - name: weekly_revenue_dashboard
    type: dashboard
    maturity: high
    url: https://bi-tool.company.com/dashboard/123
    depends_on:
      - ref('fct_orders')
      - ref('dim_customer')
    owner:
      name: Finance Analytics
      email: [email protected]

metrics:
  - name: monthly_recurring_revenue
    label: MRR
    type: derived
    description: Sum of active subscription revenue, normalized to monthly
    calculation_method: derived
    expression: "sum(amount) where status = 'active'"
    time_grains: [day, week, month]
    dimensions: [plan_type, region, customer_segment]

Exposures create accountability: when a model breaks, the owner of every downstream exposure is notified. Define exposures for every L1-L2 dashboard and every ML pipeline consuming marts.

Includes:
- Auto-documentation setup (dbt docs generate, column descriptions, model descriptions)
- Column-level documentation (business meaning, calculation logic, example values)
- Lineage visualization (source-to-mart traceability, impact analysis)
- Data catalog integration (Atlan, DataHub, OpenMetadata — choose based on existing stack)
- Tagging and classification (PII, financial, experimental, certified)
- Ownership mapping (model owner, domain owner, SLA contact)

Key decisions:
- Documentation completeness: all columns described for mart models; business-critical columns for staging
- Catalog tool: dbt docs suffice for <50 models; invest in dedicated catalog at 100+ models
- Lineage depth: model-level default; column-level for regulated industries

S6: Performance & Cost Optimization

Optimizes warehouse performance and controls transformation costs.

Includes:
- Query profiling (execution plans, scan volume, spill-to-disk analysis)
- Clustering and sort keys (most-filtered columns, cardinality-aware selection)
- Partition pruning (date-based partitioning covers 80% of use cases)
- Materialization tuning (when to materialize vs compute on read)
- Cost attribution (per-model cost tracking via query tags, budget alerts, chargeback)
- Warehouse sizing (auto-scaling, multi-cluster, dedicated vs shared)

Key decisions:
- Cluster key selection: choose columns with medium cardinality that appear in WHERE/JOIN clauses
- Cost visibility: tag every dbt model query (query_tag in profiles.yml) to enable per-model cost tracking
- Warehouse isolation: heavy transforms on dedicated warehouse; ad-hoc queries on separate warehouse with auto-suspend


Trade-off Matrix

Decision Enables Constrains Threshold
Star Schema Fast queries, intuitive for BI, clear grain More joins, ETL complexity Multiple consumption patterns, 3+ BI consumers
One Big Table No joins, fast development Redundancy, update complexity Single-use analytics, <100M rows
Incremental Models Fast builds, cost efficient Harder debugging, late-arriving data risk Fact tables >10M rows, frequent builds
Full Refresh Simple, deterministic Expensive at scale, slow Dimension tables, prototyping, <10M rows
Strict Data Contracts Reliability, breaking change prevention Slower iteration Production-critical marts, multi-team
Column-Level Lineage Precise impact analysis Tooling cost, maintenance Regulated industries, 100+ models

Assumptions

  • Data warehouse or lakehouse is provisioned and accessible
  • Source data is being ingested (or ingestion is designed in parallel)
  • Team has SQL proficiency and familiarity with transformation tools
  • Version control (git) is used for transformation code
  • CI/CD pipeline exists or is planned for automated deployment

Limits

  • Focuses on transformation and modeling, not data ingestion
  • Does not design BI consumption layer (dashboards, KPIs)
  • Does not address data quality monitoring beyond transformation tests
  • Performance optimization is warehouse-specific; recommendations require knowing the platform

Edge Cases

Legacy Stored Procedures Migration:
Map existing logic to dbt models, preserve business rules, run parallel validation. Expect 20-30% of stored procedure logic to be obsolete or duplicated.

Multi-Warehouse Environment:
Models consumed across Snowflake, BigQuery, and Redshift. Use cross-database macros, abstract warehouse-specific SQL, test on each target platform.

Real-Time Transformation Needs:
dbt is batch-oriented. For streaming transformations, consider Materialize, RisingWave, or SQLMesh with streaming support. Hybrid architecture: batch marts enriched by streaming aggregates.

Massive Scale (10B+ Rows):
Incremental models mandatory. Microbatch strategy, partition pruning, and clustering are critical. Profile query plans before and after optimization.

Single Analytics Engineer:
Skip intermediate layers initially. Start with staging + marts. Add layers as complexity grows. Documentation is critical for bus-factor mitigation.


Validation Gate

Before finalizing delivery, verify:

  • [ ] Layer architecture defined with enforced naming conventions (stg_, int_, fct_, dim_)
  • [ ] Source-to-mart lineage traceable for every consumption model
  • [ ] Modeling pattern matches query patterns and data characteristics
  • [ ] Materialization strategy balances freshness, cost, and complexity
  • [ ] Mart models have 100% schema tests (not_null, unique, relationships)
  • [ ] Data contracts enforced for cross-team interfaces
  • [ ] Exposures defined for every downstream dashboard and ML pipeline
  • [ ] Slim CI configured (state:modified+, deferred execution)
  • [ ] Cost per model measurable via query tagging
  • [ ] Documentation covers column descriptions for all mart-layer models

Output Format Protocol

Format Default Description
markdown Yes Markdown con Mermaid embebido (DAG, star schema diagrams).
html On demand Branded HTML (Design System). Visual impact.
dual On demand Both formats.

Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter.

Output Artifact

Primary: A-01_Analytics_Engineering.html — Source-to-target mapping, modeling patterns, transformation framework, testing strategy, documentation plan, performance optimization.

Secondary: Model dependency DAG, naming convention guide, test coverage report template, cost attribution dashboard spec.


Autor: Javier Montaño | Última actualización: 12 de marzo de 2026

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.