Manage Apple Reminders via the `remindctl` CLI on macOS (list, add, edit, complete, delete)....
npx skills add javiermontano-sofka/sdf --skill "sofka-analytics-engineering"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: sofka-analytics-engineering
description: >
Analytics pipeline design — dbt-style transformations, data modeling, testing, documentation.
Use when the user asks to "design analytics models", "set up dbt project", "plan data transformations",
"define data contracts", "model star schema", or mentions staging models, marts, incremental strategies, or materializations.
model: opus
context: fork
allowed-tools:
- Read
- Write
- Edit
- Glob
- Grep
- Bash
Analytics Engineering: Transformation Pipeline Design & Data Modeling
Analytics engineering defines how raw data is transformed into reliable, documented, and tested analytical models — source-to-target mapping, modeling patterns, transformation frameworks, testing, and documentation. This skill produces analytics engineering documentation that enables teams to build maintainable, trustworthy data transformation pipelines.
Principio Rector
Un modelo analítico sin tests es una opinión con formato de tabla. Los tests son ciudadanos de primera clase — no un afterthought. La documentación es parte del modelo, no un artefacto separado. Incremental sobre full-refresh siempre que el volumen lo justifique. Cada modelo tiene grain explícito, owner identificado, y contract enforced en CI.
Inputs
The user provides a system or project name as $ARGUMENTS. Parse $1 as the system/project name used throughout all output artifacts.
Parameters:
- {MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
- piloto-auto: Auto para staging y naming conventions, HITL para modeling patterns y data contracts.
- desatendido: Cero interrupciones. Pipeline completo documentado automáticamente. Supuestos documentados.
- supervisado: Autónomo con checkpoint en modeling pattern selection y testing strategy.
- paso-a-paso: Confirma cada modelo, materialization, test y exposure.
- {FORMATO}: markdown (default) | html | dual
- {VARIANTE}: ejecutiva (~40% — S1 source-to-target + S2 modeling patterns + S4 testing strategy) | técnica (full 6 sections, default)
Before generating architecture, detect the project context:
!find . -name "*.sql" -o -name "*.yml" -o -name "dbt_project.yml" -o -name "profiles.yml" -o -name "*.py" | head -30
Use detected tools (dbt, SQLMesh, Dataform, stored procedures, etc.) to tailor recommendations.
If reference materials exist, load them:
Read ${CLAUDE_SKILL_DIR}/references/analytics-patterns.md
When to Use
- Designing source-to-target data transformation pipelines
- Selecting data modeling patterns (star schema, OBT, activity schema)
- Setting up dbt or similar transformation framework projects
- Defining testing strategies and data contracts for analytical models
- Planning documentation and data discovery infrastructure
- Optimizing warehouse performance and controlling compute costs
When NOT to Use
- Data ingestion and orchestration pipelines → use data-engineering skill
- Dashboard design and KPI frameworks → use bi-architecture skill
- ML feature engineering and model serving → use data-science-architecture skill
- Data profiling and anomaly detection → use data-quality skill
Delivery Structure: 6 Sections
S1: Source-to-Target Mapping
Maps the journey from raw sources through staging to consumption-ready marts.
dbt project structure conventions:
models/
staging/ # 1:1 source mappings — stg_{source}_{entity}.sql
salesforce/
stripe/
intermediate/ # Business logic joins — int_{entity}_{verb}.sql
finance/
marketing/
marts/ # Consumption models
finance/ # fct_{event}.sql, dim_{entity}.sql
marketing/
core/ # Shared dimensions — dim_customer, dim_date
Naming conventions (enforce via CI linting):
- stg_ — staging: rename, cast, deduplicate; one model per source table
- int_ — intermediate: joins, pivots, aggregations that simplify mart logic
- fct_ — fact: event/transaction grain, numeric measures, foreign keys
- dim_ — dimension: descriptive attributes, surrogate keys, SCD tracking
- mrt_ — mart-level aggregations when pre-aggregating for BI performance
Includes:
- Layer architecture (raw/landing → staging → intermediate → marts → metrics)
- Source inventory (systems, tables, extraction method, freshness SLA)
- Staging model design (1:1 source mapping, renaming, type casting, deduplication)
- Intermediate model patterns (joins, business logic aggregation, pivoting)
- Mart design (domain-specific consumption models, access patterns)
Key decisions:
- Layer count: 3 layers (stg/int/mart) for most teams; add metrics layer when semantic layer is needed
- Schema strategy: one schema per layer (staging, intermediate, marts) for simplicity; one per domain at mart level for access control
- Naming rigidity: enforce prefixes (stg_, int_, fct_, dim_) and source identifiers (stg_salesforce_, stg_stripe_) — enables automated lineage and CI rules
S2: Data Modeling Patterns
Selects the modeling approach based on query patterns and data characteristics.
Includes:
- Star schema design (fact tables, dimension tables, grain definition, surrogate keys)
- Wide/denormalized tables (One Big Table pattern for simple analytics)
- Activity schema (event-centric modeling for behavioral analytics)
- Slowly changing dimensions (SCD Type 1: overwrite; Type 2: history tracking with effective dates; Type 3: previous + current columns — Type 2 is default for anything auditable)
- Bridge tables for many-to-many relationships
Key decisions:
- Star vs OBT: star for multiple consumption patterns and BI tools; OBT for single-use analytics — common mistake: over-normalizing for analytics creates 10-15 join queries that break easily
- Surrogate vs natural keys: surrogate preferred for SCDs and warehouse-internal joins; natural keys at staging layer
- Grain selection: too fine = expensive queries; too coarse = lost detail — always document grain explicitly in model YAML description
- Model complexity rule: keep individual model files to ~100 lines of SQL; models exceeding this need decomposition into intermediate steps
S3: Transformation Framework
Documents tool configuration, model organization, and materialization strategies.
Incremental strategy comparison:
| Strategy | Mechanism | Best For | Watch Out |
|---|---|---|---|
| append | Insert new rows only | Immutable event streams (logs, clicks) | Cannot handle late-arriving updates |
| merge | Upsert on unique_key |
Mutable entities (orders, users) | Requires stable unique key; expensive on large tables |
| delete+insert | Delete partition, re-insert | Late-arriving data in known partitions | Partition key must be deterministic |
| insert_overwrite | Overwrite entire partition | Cost-efficient on BigQuery/Hive | Not supported on all warehouses |
| microbatch | Process in time-windowed batches | Very large event tables (1B+ rows) | Newer dbt feature; requires event_time column |
Acceptance criteria for incremental models: test incremental runs against full refresh monthly to catch drift; always define unique_key and updated_at; set on_schema_change: 'append_new_columns' as default.
ref() vs source() conventions:
- source() only in staging models — never reference raw tables in intermediate or mart layers
- ref() everywhere else to maintain DAG integrity and enable state-aware builds
- Exposures for downstream consumers (BI dashboards, ML pipelines, reverse ETL)
Includes:
- Project structure (models/, tests/, macros/, seeds/, snapshots/, analyses/)
- Materialization strategy per layer (staging: view; intermediate: ephemeral or view; marts: table or incremental)
- Macro library (reusable SQL logic, cross-database compatibility)
- Snapshot strategy (SCD Type 2 tracking via timestamps or check columns)
Key decisions:
- Incremental vs full refresh: data volume and freshness drive this — incremental for >10M rows, full refresh for everything else
- Macro abstraction level: too many macros = unreadable SQL; too few = duplication — extract when logic is used in 3+ models
- Environment parity: dev/staging/prod must produce identical results on same data
S4: Testing & Data Contracts
Defines testing strategy and contract enforcement for data reliability.
Testing pyramid for data (invest effort bottom-up):
| Level | What | Tools | Blocks Deploy? |
|---|---|---|---|
| Source freshness | Data arrived on time | dbt source freshness |
Warn at 2x SLA, error at 4x |
| Schema tests | not_null, unique, accepted_values, relationships | dbt generic tests in YAML | Yes — mart layer always, staging for critical |
| Custom data tests | Business rule validation, cross-model consistency | dbt singular tests (.sql files) | Yes for mart layer |
| Unit tests | Macro logic, complex SQL transformations | dbt unit tests (v1.8+), SQLMesh audits | Yes — CI blocks merge |
| Contract tests | Column names, types, constraints between teams | dbt model contracts (contract: {enforced: true}) |
Yes — breaking changes blocked |
CI/CD for dbt:
- Slim CI: dbt build --select state:modified+ on PRs — test only changed models and downstream dependencies
- Deferred execution: --defer --state prod-artifacts/ — reference production tables for unchanged models, avoiding full rebuilds
- Block merge on test failure for mart-layer models; warn-only for staging
- Run dbt source freshness as pre-build step; skip stale sources with --exclude source:stale
- PR comment bot: post model changes, test results, and warehouse cost estimate
Includes:
- Data contract specification (schema, types, constraints, SLAs between teams)
- Contract enforcement (breaking change detection, versioned interfaces)
- Test severity levels (warn vs error; which tests block deployment)
Key decisions:
- Test coverage target: mart models = 100% schema tests + custom tests; staging = not_null + unique on primary key minimum
- Breaking change policy: strict (block) for marts consumed by external teams; lenient (warn) for internal intermediate models
- Test execution: pre-merge CI + post-deploy monitoring — both, not either/or
S5: Documentation & Discovery
Plans auto-generated and manually enriched documentation for data discovery.
Exposure and metric definitions (connecting to BI):
exposures:
- name: weekly_revenue_dashboard
type: dashboard
maturity: high
url: https://bi-tool.company.com/dashboard/123
depends_on:
- ref('fct_orders')
- ref('dim_customer')
owner:
name: Finance Analytics
email: [email protected]
metrics:
- name: monthly_recurring_revenue
label: MRR
type: derived
description: Sum of active subscription revenue, normalized to monthly
calculation_method: derived
expression: "sum(amount) where status = 'active'"
time_grains: [day, week, month]
dimensions: [plan_type, region, customer_segment]
Exposures create accountability: when a model breaks, the owner of every downstream exposure is notified. Define exposures for every L1-L2 dashboard and every ML pipeline consuming marts.
Includes:
- Auto-documentation setup (dbt docs generate, column descriptions, model descriptions)
- Column-level documentation (business meaning, calculation logic, example values)
- Lineage visualization (source-to-mart traceability, impact analysis)
- Data catalog integration (Atlan, DataHub, OpenMetadata — choose based on existing stack)
- Tagging and classification (PII, financial, experimental, certified)
- Ownership mapping (model owner, domain owner, SLA contact)
Key decisions:
- Documentation completeness: all columns described for mart models; business-critical columns for staging
- Catalog tool: dbt docs suffice for <50 models; invest in dedicated catalog at 100+ models
- Lineage depth: model-level default; column-level for regulated industries
S6: Performance & Cost Optimization
Optimizes warehouse performance and controls transformation costs.
Includes:
- Query profiling (execution plans, scan volume, spill-to-disk analysis)
- Clustering and sort keys (most-filtered columns, cardinality-aware selection)
- Partition pruning (date-based partitioning covers 80% of use cases)
- Materialization tuning (when to materialize vs compute on read)
- Cost attribution (per-model cost tracking via query tags, budget alerts, chargeback)
- Warehouse sizing (auto-scaling, multi-cluster, dedicated vs shared)
Key decisions:
- Cluster key selection: choose columns with medium cardinality that appear in WHERE/JOIN clauses
- Cost visibility: tag every dbt model query (query_tag in profiles.yml) to enable per-model cost tracking
- Warehouse isolation: heavy transforms on dedicated warehouse; ad-hoc queries on separate warehouse with auto-suspend
Trade-off Matrix
| Decision | Enables | Constrains | Threshold |
|---|---|---|---|
| Star Schema | Fast queries, intuitive for BI, clear grain | More joins, ETL complexity | Multiple consumption patterns, 3+ BI consumers |
| One Big Table | No joins, fast development | Redundancy, update complexity | Single-use analytics, <100M rows |
| Incremental Models | Fast builds, cost efficient | Harder debugging, late-arriving data risk | Fact tables >10M rows, frequent builds |
| Full Refresh | Simple, deterministic | Expensive at scale, slow | Dimension tables, prototyping, <10M rows |
| Strict Data Contracts | Reliability, breaking change prevention | Slower iteration | Production-critical marts, multi-team |
| Column-Level Lineage | Precise impact analysis | Tooling cost, maintenance | Regulated industries, 100+ models |
Assumptions
- Data warehouse or lakehouse is provisioned and accessible
- Source data is being ingested (or ingestion is designed in parallel)
- Team has SQL proficiency and familiarity with transformation tools
- Version control (git) is used for transformation code
- CI/CD pipeline exists or is planned for automated deployment
Limits
- Focuses on transformation and modeling, not data ingestion
- Does not design BI consumption layer (dashboards, KPIs)
- Does not address data quality monitoring beyond transformation tests
- Performance optimization is warehouse-specific; recommendations require knowing the platform
Edge Cases
Legacy Stored Procedures Migration:
Map existing logic to dbt models, preserve business rules, run parallel validation. Expect 20-30% of stored procedure logic to be obsolete or duplicated.
Multi-Warehouse Environment:
Models consumed across Snowflake, BigQuery, and Redshift. Use cross-database macros, abstract warehouse-specific SQL, test on each target platform.
Real-Time Transformation Needs:
dbt is batch-oriented. For streaming transformations, consider Materialize, RisingWave, or SQLMesh with streaming support. Hybrid architecture: batch marts enriched by streaming aggregates.
Massive Scale (10B+ Rows):
Incremental models mandatory. Microbatch strategy, partition pruning, and clustering are critical. Profile query plans before and after optimization.
Single Analytics Engineer:
Skip intermediate layers initially. Start with staging + marts. Add layers as complexity grows. Documentation is critical for bus-factor mitigation.
Validation Gate
Before finalizing delivery, verify:
- [ ] Layer architecture defined with enforced naming conventions (stg_, int_, fct_, dim_)
- [ ] Source-to-mart lineage traceable for every consumption model
- [ ] Modeling pattern matches query patterns and data characteristics
- [ ] Materialization strategy balances freshness, cost, and complexity
- [ ] Mart models have 100% schema tests (not_null, unique, relationships)
- [ ] Data contracts enforced for cross-team interfaces
- [ ] Exposures defined for every downstream dashboard and ML pipeline
- [ ] Slim CI configured (
state:modified+, deferred execution) - [ ] Cost per model measurable via query tagging
- [ ] Documentation covers column descriptions for all mart-layer models
Output Format Protocol
| Format | Default | Description |
|---|---|---|
markdown |
Yes | Markdown con Mermaid embebido (DAG, star schema diagrams). |
html |
On demand | Branded HTML (Design System). Visual impact. |
dual |
On demand | Both formats. |
Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter.
Output Artifact
Primary: A-01_Analytics_Engineering.html — Source-to-target mapping, modeling patterns, transformation framework, testing strategy, documentation plan, performance optimization.
Secondary: Model dependency DAG, naming convention guide, test coverage report template, cost attribution dashboard spec.
Autor: Javier Montaño | Última actualización: 12 de marzo de 2026
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.