OCI Best Practices and Architecture

by @acedergren in DevOps & Cloud

# Install this skill:

npx skills add acedergren/oci-agent-skills --skill "OCI Best Practices and Architecture"

Install specific skill from multi-skill repository

# Description

Use when architecting OCI solutions, migrating from AWS/Azure, designing multi-AD deployments, or avoiding common OCI anti-patterns. Covers VCN sizing mistakes, Cloud Guard gotchas, free tier specifics, OCI terminology confusion, and multi-AD patterns. Keywords: VCN CIDR too small, compartment strategy, security list vs NSG, regional subnet, always-free limits.

# SKILL.md

name: OCI Best Practices and Architecture
description: Use when architecting OCI solutions, migrating from AWS/Azure, designing multi-AD deployments, or avoiding common OCI anti-patterns. Covers VCN sizing mistakes, Cloud Guard gotchas, free tier specifics, OCI terminology confusion, and multi-AD patterns. Keywords: VCN CIDR too small, compartment strategy, security list vs NSG, regional subnet, always-free limits.
version: 2.0.0

OCI Best Practices - Expert Knowledge

🏗️ Use OCI Landing Zone Terraform Modules

Don't reinvent the wheel. Use oracle-terraform-modules/landing-zone for OCI architecture.

Landing Zone solves:
- ❌ Bad Practice #1: Generic compartments (Landing Zone provides hierarchical Network/Security/Workloads structure)
- ❌ Bad Practice #2: Administrator for daily ops (Landing Zone enforces least-privilege IAM policies)
- ❌ Bad Practice #4: Poor network segmentation (Landing Zone implements hub-spoke topology with security zones)
- ❌ Bad Practice #8: Creating your own Terraform modules (Landing Zone provides battle-tested, Oracle-maintained, CIS-certified modules)

This skill provides: OCI-specific anti-patterns, architecture patterns, and operational knowledge for resources deployed WITHIN a Landing Zone.

⚠️ OCI CLI/API Knowledge Gap

You don't know OCI CLI commands or OCI API structure.

Your training data has limited and outdated knowledge of:
- OCI CLI syntax and parameters (updates monthly)
- OCI API endpoints and request/response formats
- OCI service-specific commands and flags
- Latest OCI features, limits, and regional availability
- CIS Benchmark requirements for OCI

When OCI operations are needed:
1. Use exact CLI commands from skill references
2. Do NOT guess OCI CLI syntax or parameters
3. Do NOT assume API endpoint structures
4. Reference landing-zones skill for Terraform modules

What you DO know:
- General cloud architecture concepts
- Security principles and compliance frameworks
- Multi-tier application design patterns

This skill bridges the gap by providing current OCI-specific patterns and anti-patterns.

You are an OCI architecture expert. This skill provides knowledge Claude lacks: OCI-specific anti-patterns, free tier specifics, terminology gotchas, multi-AD patterns, and differences from AWS/Azure/GCP.

NEVER Do This

❌ NEVER use /24 or smaller VCN CIDR (cannot expand)

# WRONG - VCN too small, cannot expand later (OCI limitation)
oci network vcn create --cidr-block "10.0.0.0/24"
# Only 256 IPs total, exhausted quickly

# WRONG - copying AWS habit (/16 VPC default)
# OCI supports larger: /16 to /30

# RIGHT - start with /16, plan for growth
oci network vcn create --cidr-block "10.0.0.0/16"
# 65,536 IPs, room for 256 /24 subnets

# CRITICAL: OCI VCNs CANNOT be resized after creation
# Must create new VCN and migrate if too small

Migration cost: Recreating VCN = hours of downtime, IP changes, security rule updates

❌ NEVER use AD-specific subnets (breaks multi-AD HA)

# WRONG - subnet tied to single AD
oci network subnet create \
  --vcn-id <vcn-ocid> \
  --cidr-block "10.0.1.0/24" \
  --availability-domain "fMgC:US-ASHBURN-AD-1"  # AD-specific!

# Problem: Can't launch instances in other ADs, no HA

# RIGHT - regional subnet (works in all ADs)
oci network subnet create \
  --vcn-id <vcn-ocid> \
  --cidr-block "10.0.1.0/24"
  # No --availability-domain flag = regional
  # Instances can be in any AD in region

Gotcha: Some old OCI guides show AD-specific subnets (deprecated pattern)

❌ NEVER confuse Security Lists vs NSGs (different use cases)

OCI has TWO network security mechanisms:

Security Lists (stateful, subnet-level):
- Applied to ALL resources in subnet
- Use for: Broad rules (internet egress, DNS)
- Limit: 5 per subnet
- Changes: Affect all instances in subnet

Network Security Groups (NSG, resource-level):
- Applied to specific resources
- Use for: Granular rules (app tier → DB tier)
- Limit: 5 per resource, 120 rules per NSG
- Changes: Affect only tagged resources

# WRONG - using Security Lists for app-specific rules
Security List: Allow app-tier → database (applies to ENTIRE subnet)

# RIGHT - use NSG for app-tier resources
NSG "app-tier": Allow egress to NSG "db-tier" on port 1521
# Only instances in app-tier NSG can reach DB

Best practice: Security Lists for baseline (internet, DNS), NSGs for application-specific rules

❌ NEVER assume single-AD deployment is acceptable (no SLA)

OCI Availability Domains (ADs):
- 3 ADs per region (most regions)
- Isolated fault domains
- <1ms latency between ADs

# WRONG - all resources in single AD
All instances in AD-1 → AD failure = complete outage

# RIGHT - distribute across ADs
Production instances: AD-1, AD-2, AD-3
Load balancer: Automatically multi-AD
Database: Autonomous (auto 3-AD) or RAC (2+ nodes)

SLA impact:
Single-AD: NO SLA (OCI doesn't guarantee)
Multi-AD: 99.95% SLA

Critical: Oracle refuses SLA claims for single-AD deployments in regions with 3 ADs

❌ NEVER hardcode AD names (tenant-specific)

# WRONG - AD names are tenant-specific, not portable
availability_domain = "fMgC:US-ASHBURN-AD-1"  # Only works in YOUR tenancy!

# Another tenant's AD name for same physical AD:
availability_domain = "xYzA:US-ASHBURN-AD-1"  # Different prefix!

# RIGHT - query AD names dynamically
data "oci_identity_availability_domains" "ads" {
  compartment_id = var.tenancy_ocid
}

resource "oci_core_instance" "web" {
  availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
}

Why: OCI generates unique AD prefixes per tenant for security isolation

❌ NEVER enable Cloud Guard auto-remediation without testing

Cloud Guard = OCI threat detection + auto-response

# DANGER - auto-remediation can break production
Detector: "Public bucket detected"
Auto-remediation: Make bucket private → breaks public website!

Detector: "Security list allows 0.0.0.0/0"
Auto-remediation: Remove rule → breaks internet access!

# SAFER approach:
1. Enable detectors (read-only mode first)
2. Review findings for 1-2 weeks
3. Tune responders to avoid false positives
4. Enable auto-remediation for trusted patterns only

Gotcha: Cloud Guard enabled by default in some tenancies, can auto-break things

❌ NEVER assume you need Oracle Linux (common misconception)

OCI supports:
✓ Oracle Linux (free, optimized)
✓ Ubuntu, CentOS, Rocky Linux (free)
✓ Windows Server (BYOL or license-included)
✓ Custom images (import your own)

# WRONG assumption: "OCI = must use Oracle Linux"
Reality: Any Linux works, Ubuntu has larger community

# Cost: Oracle Linux is FREE (no license cost)
# But if team knows Ubuntu → use Ubuntu

Marketing confusion: Oracle pushes Oracle Linux, but it's not required

OCI Always-Free Tier (Exact Limits)

Generous permanent free tier (no credit card trial, no expiration):

Compute

2 AMD VMs: VM.Standard.E2.1.Micro (1/8 OCPU, 1 GB RAM each)
4 Arm OCPUs: VM.Standard.A1.Flex (allocate as 1×4 OCPU or 4×1 OCPU)
Up to 24 GB total RAM (6 GB per OCPU)
Example: Run 4× 1OCPU/6GB Arm instances free forever

Database

2 Autonomous Databases: 1 OCPU each, 20 GB storage per ADB
Can be ATP or ADW
Limit: 2 total per tenancy across all regions

Storage

Block volumes: 200 GB total (2× 100 GB boot volumes + custom)
Object storage: 10 GB Standard tier
Archive storage: 10 GB Archive tier
Block volume backups: 10 GB

Networking

Load balancer: 1 flexible LB, 10 Mbps bandwidth
VCN: 2 VCNs per region (free, no OCID cost)
Public IPv4: 1 reserved public IP free per region

Observability

Monitoring: 1 billion data points ingested
Logging: 10 GB ingested per month
Notifications: 1 million emails per month

Always-Free Gotchas

CRITICAL limits often missed:

# Gotcha 1: 2 ADB limit is TENANCY-wide, not per region
Can have: 1 ATP in Phoenix + 1 ADW in Ashburn = 2 (limit reached)
Cannot: Add 3rd ADB in any region

# Gotcha 2: Arm instances must be VM.Standard.A1.Flex only
Cannot: Use newer A2 shapes (paid only)

# Gotcha 3: Free tier != trial credits
Free tier: Permanent, no expiration
Trial: $300 credit for 30 days (separate)

# Gotcha 4: Stopped ADB counts toward 2 ADB limit
To free slot: Must DELETE ADB, not just STOP

OCI Terminology vs AWS/Azure

Migrating from AWS/Azure? Terminology traps:

OCI Term	AWS Equivalent	Azure Equivalent
VCN	VPC	Virtual Network
Subnet	Subnet	Subnet
Security List	VPC Security Group	NSG (network-level)
NSG	Security Group	Application Security Group
DRG	Virtual Private Gateway	VPN Gateway
Compartment	Resource Group / OU	Resource Group
Tenancy	Account	Subscription
Region	Region	Region
AD (Availability Domain)	Availability Zone	Availability Zone
Fault Domain	(within AZ)	Availability Set
Dynamic Group	IAM Role (for instances)	Managed Identity
Instance Principal	EC2 Instance Profile	Managed Identity
OCIR	ECR	Container Registry
OKE	EKS	AKS

Critical difference: OCI has both Security Lists (subnet) and NSGs (resource). AWS only has Security Groups (resource-level).

Multi-AD Architecture Patterns

OCI multi-AD specifics:

AD Distribution Strategy

OCI Regions with 3 ADs (most regions):
- US: Phoenix, Ashburn
- UK: London
- DE: Frankfurt
- AU: Sydney, Melbourne

Pattern: Distribute instances across all 3 ADs

AD-1: Web tier (2 instances) + DB primary
AD-2: Web tier (2 instances) + DB standby
AD-3: Web tier (2 instances) + DB standby

Load Balancer: Automatically spans ADs

Gotcha: Some shapes only available in specific ADs (check first)

# Check shape availability by AD
oci compute shape list \
  --compartment-id <ocid> \
  --availability-domain "fMgC:US-ASHBURN-AD-1"

Fault Domain Additional Layer

Within each AD, OCI has Fault Domains (FD):
- 3 FDs per AD
- Separate power, cooling, network
- <1ms latency within AD

Best practice: Spread instances across ADs AND FDs

AD-1:
  FD-1: Web instance 1
  FD-2: Web instance 2
  FD-3: Web instance 3

AD-2:
  FD-1: Web instance 4
  (repeat pattern)

Protection:
- AD failure: 2 ADs survive (66% capacity)
- FD failure: Only 1 instance affected (16% capacity)

When to use FDs: Only for extra-critical apps (adds complexity)

Compartment Strategy Best Practices

Compartment hierarchy (OCI-specific IAM boundary):

Root Compartment (tenancy)
│
├─ SharedServices (networking, security)
│  ├─ Network (VCNs, DRGs)
│  └─ Security (Vault, KMS, Cloud Guard)
│
├─ Production
│  ├─ App1
│  │  ├─ Compute
│  │  ├─ Database
│  │  └─ Storage
│  └─ App2
│
├─ NonProduction
│  ├─ Development
│  ├─ Testing
│  └─ Staging
│
└─ Sandbox (developers, auto-cleanup)

Key principles:
1. Billing separation: Compartment tags enable cost reporting by environment
2. IAM boundaries: Policies scoped to compartments (least privilege)
3. Quota isolation: Separate limits per compartment
4. Lifecycle: Delete entire compartment = deletes all resources inside

Anti-pattern: Flat structure with no hierarchy (AWS account-per-env habit)

Cost Optimization OCI-Specific

Flex Shape Savings (Unique to OCI)

Fixed shapes (legacy):
VM.Standard2.4: 4 OCPUs, 60 GB RAM, $218/month

Flex shapes (right-size RAM independently):
VM.Standard.E4.Flex: 4 OCPUs, 16 GB RAM, $109/month (50% savings)

Flex advantage: Pay only for RAM you need
- 1 OCPU = 1-64 GB RAM configurable
- Most apps don't need 15GB per OCPU (fixed ratio)

Migration: Replace fixed shapes with Flex for 30-50% savings

Arm Instance Savings (Generous Free Tier)

AMD instance: VM.Standard.E4.Flex (1 OCPU) = $0.03/hr
Arm instance: VM.Standard.A1.Flex (1 OCPU) = $0.01/hr (67% cheaper)

Always-Free Arm: 4 OCPUs free forever!

Use case: Web servers, CI/CD runners, dev environments
Limitation: ARM64 only (not all apps compatible)

Gotcha: Free tier is A1 shapes only, newer A2 shapes are paid

Storage Tiering (Exact Prices)

Tier	Cost/GB/Month	Use Case	Retrieval
Standard	$0.0255	Active data, frequent access	Instant, free
Infrequent Access	$0.0125 (51% off)	Backups, logs (accessed monthly)	Instant, $0.01/GB
Archive	$0.0024 (91% off)	Compliance, long-term retention	1 hour, $0.01/GB

Lifecycle policy example:

Day 0-30: Standard ($0.0255/GB/mo)
Day 31-90: Infrequent ($0.0125/GB/mo)
Day 91+: Archive ($0.0024/GB/mo)

1 TB data for 1 year:
Without tiering: $0.0255 × 1000 × 12 = $306/year
With tiering: $0.0255 × 1000 × 1 + $0.0125 × 1000 × 2 + $0.0024 × 1000 × 9 = $72/year
Savings: $234/year (76%)

Security Zones (OCI-Unique)

OCI Security Zones = Infrastructure-level policy enforcement:

Security Zone enforces:
✓ All storage encrypted
✓ No public buckets
✓ No internet gateways in VCN
✓ All databases private endpoint only
✓ Cloud Guard enabled

Enforcement: API rejects violating requests (preventive, not detective)

Example:
oci os bucket create --public-access-type ObjectRead
→ FAILS if compartment is in Security Zone

Use case: Production, PCI-DSS, healthcare (mandatory controls)

Gotcha: Security Zones can break existing automation (test in dev first)

Progressive Loading References

OCI Well-Architected Checklist

WHEN TO LOAD oci-well-architected-checklist.md:
- Running compliance checks against OCI tenancy
- Preparing for CIS OCI Foundations Benchmark audit
- Implementing automated security scanning
- Creating remediation scripts for common findings
- Setting up monitoring for drift detection

Do NOT load for:
- Quick anti-pattern reference (NEVER list above covers it)
- Architecture decisions (covered in this skill)
- Understanding OCI terminology (tables above)

When to Use This Skill

Architecture design: Multi-AD patterns, compartment strategy, VCN sizing
Migration from AWS/Azure: Terminology mapping, anti-pattern avoidance
Cost optimization: Free tier planning, Flex shapes, storage tiering
Security: Cloud Guard tuning, Security Zones, NSG vs Security Lists
Production readiness: SLA requirements, HA patterns, fault tolerance

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.