greenplum-to-databricks

Name: greenplum-to-databricks
Author: dey-abhishek

by @dey-abhishek in Data & Analytics

# Install this skill:

npx skills add dey-abhishek/genie-code-migration-skills --skill "greenplum-to-databricks"

Install specific skill from multi-skill repository

# Description

Migrate Greenplum (GPDB) databases to Databricks Lakehouse. Use when planning or executing Greenplum migrations, converting GP SQL to Spark SQL, mapping data types, migrating stored procedures, external tables, distribution strategies, or ETL pipelines from Greenplum to Databricks Unity Catalog and Spark Declarative Pipelines.

# SKILL.md

name: greenplum-to-databricks
description: Migrate Greenplum (GPDB) databases to Databricks Lakehouse. Use when planning or executing Greenplum migrations, converting GP SQL to Spark SQL, mapping data types, migrating stored procedures, external tables, distribution strategies, or ETL pipelines from Greenplum to Databricks Unity Catalog and Spark Declarative Pipelines.

Greenplum to Databricks Migration

Expert guidance for migrating Greenplum (GPDB/Pivotal) data warehouses to Databricks Lakehouse Platform, covering schema conversion, SQL dialect translation, performance optimization, and pipeline modernization.

Capabilities

Convert Greenplum DDL to Databricks SQL with Delta Lake
Map Greenplum data types to Databricks/Spark types
Translate GP-specific SQL (window functions, arrays, regex) to Spark SQL
Migrate distribution/partitioning strategies to Liquid Clustering
Convert external tables (gpfdist, PXF, S3) to Unity Catalog external locations
Migrate PL/pgSQL stored procedures to Databricks SQL scripting (DBR 16.3+), stored procedures (DBR 17.0+), Python UDFs, or notebooks
Modernize ETL pipelines from GP workflows to Spark Declarative Pipelines (SDP)
Migrate Greenplum transactions to Databricks transaction modes (non-interactive and interactive)
Convert temporary tables from GP to Databricks temporary tables (SQL warehouse)
Plan phased migration with validation checkpoints

Workflow

Phase 1: Assessment

Inventory GP schemas, tables, views, functions, external tables
Identify data volumes, distribution keys, partition strategies
Catalog PL/pgSQL procedures and their complexity
Map dependencies and execution order
Reference: 1-assessment-checklist.md

Phase 2: Schema & DDL Migration

Convert GP DDL to Databricks SQL (Delta tables in Unity Catalog)
Apply data type mappings (see reference)
Replace DISTRIBUTED BY with CLUSTER BY (Liquid Clustering)
Convert range/list partitions to Delta partitioning or clustering
Reference: 2-schema-ddl-migration.md

Phase 3: SQL Dialect Conversion

Identify target runtime (DBSQL, Spark SQL, or SDP) — see runtime matrix in reference
Translate GP-specific functions to Databricks equivalents
Convert :: casting to CAST() syntax (or keep — Databricks supports ::)
Rewrite GP array operations, regex, string functions
Handle ILIKE, SIMILAR TO, interval arithmetic
Watch for behavioral differences (NULL ordering, integer division, concat with NULL)
Convert GP transactions (BEGIN ... COMMIT) to Databricks transaction modes — see 10-transactions.md
Migrate temporary tables (CREATE TEMP TABLE) to Databricks temporary tables — see 11-temporary-tables.md
Reference: 3-sql-dialect-conversion.md

Phase 4: Data Type Mapping

Map all GP types to Spark/Delta types
Handle GP-specific types (money, inet, cidr, macaddr, geometric)
Convert array types and composite types
Reference: 4-data-type-mapping.md

Phase 5: Stored Procedure Migration

Classify procedures by complexity (simple SQL, cursor-based, dynamic SQL)
Convert simple SQL procedures to Databricks SQL functions
Migrate complex PL/pgSQL to SQL scripting (DBR 16.3+), stored procedures (DBR 17.0+), or Python UDFs
Migrate triggers to SDP expectations or notebook workflows
Handle transactions within procedures — see 10-transactions.md
Migrate temporary tables used in procedures — see 11-temporary-tables.md
Reference: 5-stored-procedure-migration.md

Phase 6: External Table & Data Loading

Map gpfdist/PXF external tables to Unity Catalog external locations
Convert CREATE EXTERNAL TABLE to read_files() or Auto Loader
Migrate S3/GCS/ADLS references to Volumes or external locations
Reference: 6-external-tables-data-loading.md

Phase 7: ETL Pipeline Modernization

Map GP ETL patterns to SDP medallion architecture
Convert INSERT INTO ... SELECT chains to streaming tables
Modernize scheduling from cron/GP orchestration to Databricks Jobs
Reference: 7-etl-pipeline-modernization.md

Phase 8: Performance Tuning

Replace distribution keys with Liquid Clustering columns
Convert GP indexes to Z-ORDER or clustering strategies
Optimize joins (broadcast hints, AQE)
Tune file sizes and compaction
Reference: 8-performance-tuning.md

Phase 9: Validation & Testing

Row count and checksum validation
Query result comparison
Performance benchmarking
Reference: 9-validation-testing.md

Quick Reference: Common Conversions

Greenplum	Databricks
`DISTRIBUTED BY (col)`	`CLUSTER BY (col)`
`PARTITION BY RANGE (date)`	`CLUSTER BY (date)` or partitioning
`CREATE EXTERNAL TABLE ... LOCATION`	`read_files()` / External Location
`serial` / `bigserial`	`BIGINT` + `GENERATED ALWAYS AS IDENTITY`
`text` / `varchar`	`STRING`
`numeric(p,s)`	`DECIMAL(p,s)`
`boolean`	`BOOLEAN`
`timestamp with time zone`	`TIMESTAMP` (UTC normalized)
`jsonb` / `json`	`STRING` (parse with `:` or `from_json()`)
`array[]`	`ARRAY<>`
`col::type`	`CAST(col AS type)`
`ILIKE`	`ILIKE` (natively supported)
`IS DISTINCT FROM`	`IS DISTINCT FROM` (natively supported)
`NOW()`	`current_timestamp()`
`EXTRACT(EPOCH FROM ts)`	`unix_timestamp(ts)`
`string_agg()`	`string_agg()` / `listagg()` (native)
`generate_series()`	`explode(sequence())`
`jsonb` / `json`	`VARIANT` (preferred), `OBJECT`, or `STRING` (`:` operator)
`hstore`	`MAP<STRING, STRING>`
`age(ts1, ts2)`	`FLOOR(MONTHS_BETWEEN(ts1, ts2) / 12)`
`ENCODE/DECODE`	`base64()` / `unbase64()` / `hex()`
`aes_encrypt` / `aes_decrypt`	`aes_encrypt()` / `aes_decrypt()` (native)
`ST_GeomFromText` etc.	Native `GEOMETRY`/`GEOGRAPHY` + H3
`xpath()`	`xpath()` / `xpath_string()` (native)
`to_tsvector/tsquery`	`RLIKE` or Vector Search
`postgres_fdw`	Lakehouse Federation (`CREATE CONNECTION`)
`INHERITS`	Single table + type column, or UNION ALL view
`PL/pgSQL function`	SQL scripting (DBR 16.3+), Python UDF, or SQL function
`CREATE PROCEDURE`	`CREATE PROCEDURE` / `CALL` (DBR 17.0+, Unity Catalog)
`PL/pgSQL CURSOR`	`DECLARE CURSOR` / `OPEN` / `FETCH` / `CLOSE` (DBR 18.1+)
`BEGIN...END` block	SQL compound statements (IF, LOOP, WHILE, FOR, CASE) (DBR 16.3+)
`EXECUTE query` (dynamic SQL)	`EXECUTE IMMEDIATE` (DBR 16.3+)
`RAISE EXCEPTION`	`SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'msg'` (DBR 18.1+)
`EXCEPTION WHEN ... THEN`	`DECLARE HANDLER FOR ...` (DBR 18.1+)
`array_to_string()`	`array_join()`
`col::type` (unsafe)	`try_cast(col AS type)` (NULL on failure)
`json_object_keys()`	`json_object_keys()` (native)
`from_xml()` / `xmltable()`	`from_xml()` / `schema_of_xml()` (native)
`percentile_cont/disc`	`percentile_cont()` / `percentile_disc()` (native)
`mode()` / `median`	`mode()` / `median()` (native)
`gpfdist`	Auto Loader / `read_files()`
`BEGIN; ... COMMIT;`	`BEGIN ATOMIC ... END;` (non-interactive) or `BEGIN TRANSACTION; ... COMMIT;` (interactive)
`CREATE TEMP TABLE`	`CREATE TEMPORARY TABLE` (SQL warehouse only)
`ALTER TABLE temp_table ADD COLUMN`	Drop and recreate (schema modifications not supported)

Examples

Example: Migrate a GP table

User says: "Convert this Greenplum CREATE TABLE to Databricks"
Result: Converts DDL with proper type mapping, replaces DISTRIBUTED BY with CLUSTER BY, adds to Unity Catalog

Example: Convert GP SQL query

User says: "Rewrite this Greenplum query for Spark SQL"
Result: Translates GP-specific syntax (::, ILIKE, arrays, intervals) to Spark SQL equivalents

Example: Plan a full migration

User says: "Help me migrate our Greenplum warehouse to Databricks"
Result: Generates phased migration plan with assessment, schema conversion, data loading, ETL modernization, and validation steps

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.