dkyazzentwatwa

data-anonymizer

7
0
# Install this skill:
npx skills add dkyazzentwatwa/chatgpt-skills --skill "data-anonymizer"

Install specific skill from multi-skill repository

# Description

Detect and mask PII (names, emails, phones, SSN, addresses) in text and CSV files. Multiple masking strategies with reversible tokenization option.

# SKILL.md


name: data-anonymizer
description: Detect and mask PII (names, emails, phones, SSN, addresses) in text and CSV files. Multiple masking strategies with reversible tokenization option.


Data Anonymizer

Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.

Quick Start

from scripts.data_anonymizer import DataAnonymizer

# Anonymize text
anonymizer = DataAnonymizer()
result = anonymizer.anonymize("Contact John Smith at [email protected] or 555-123-4567")
print(result)
# "Contact [NAME] at [EMAIL] or [PHONE]"

# Anonymize CSV
anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")

Features

  • PII Detection: Names, emails, phones, SSN, addresses, credit cards, dates
  • Multiple Strategies: Mask, redact, hash, fake data replacement
  • CSV Processing: Anonymize specific columns or auto-detect
  • Reversible Tokens: Optional mapping for de-anonymization
  • Custom Patterns: Add your own PII patterns
  • Audit Report: List all detected PII with locations

API Reference

Initialization

anonymizer = DataAnonymizer(
    strategy="mask",      # mask, redact, hash, fake
    reversible=False      # Enable token mapping
)

Text Anonymization

# Basic anonymization
result = anonymizer.anonymize(text)

# With specific PII types
result = anonymizer.anonymize(text, pii_types=["email", "phone"])

# Get detected PII report
result, report = anonymizer.anonymize(text, return_report=True)

Masking Strategies

text = "Email [email protected], call 555-1234"

# Mask (default) - replace with type labels
anonymizer.strategy = "mask"
# "Email [EMAIL], call [PHONE]"

# Redact - replace with asterisks
anonymizer.strategy = "redact"
# "Email ***************, call ********"

# Hash - replace with hash
anonymizer.strategy = "hash"
# "Email a1b2c3d4, call e5f6g7h8"

# Fake - replace with realistic fake data
anonymizer.strategy = "fake"
# "Email [email protected], call 555-9876"

CSV Processing

# Auto-detect PII columns
anonymizer.anonymize_csv("input.csv", "output.csv")

# Specify columns
anonymizer.anonymize_csv(
    "input.csv",
    "output.csv",
    columns=["name", "email", "phone"]
)

# Different strategies per column
anonymizer.anonymize_csv(
    "input.csv",
    "output.csv",
    column_strategies={
        "name": "fake",
        "email": "hash",
        "ssn": "redact"
    }
)

Reversible Anonymization

anonymizer = DataAnonymizer(reversible=True)

# Anonymize with token mapping
result = anonymizer.anonymize("John Smith: [email protected]")
mapping = anonymizer.get_mapping()

# Save mapping securely
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")

# Later, de-anonymize
anonymizer.load_mapping("mapping.json", password="secret")
original = anonymizer.deanonymize(result)

Custom Patterns

# Add custom PII pattern
anonymizer.add_pattern(
    name="employee_id",
    pattern=r"EMP-\d{6}",
    label="[EMPLOYEE_ID]"
)

CLI Usage

# Anonymize text file
python data_anonymizer.py --input document.txt --output document_anon.txt

# Anonymize CSV
python data_anonymizer.py --input customers.csv --output customers_anon.csv

# Specific strategy
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake

# Generate audit report
python data_anonymizer.py --input document.txt --report audit.json

# Specific PII types only
python data_anonymizer.py --input doc.txt --types email phone ssn

CLI Arguments

Argument Description Default
--input Input file Required
--output Output file Required
--strategy Masking strategy mask
--types PII types to detect all
--columns CSV columns to process auto
--report Generate audit report -
--reversible Enable token mapping False

Supported PII Types

Type Examples Pattern
name John Smith, Mary Johnson NLP-based
email [email protected] Regex
phone 555-123-4567, (555) 123-4567 Regex
ssn 123-45-6789 Regex
credit_card 4111-1111-1111-1111 Regex + Luhn
address 123 Main St, City, ST 12345 NLP + Regex
date_of_birth 01/15/1990, January 15, 1990 Regex
ip_address 192.168.1.1 Regex

Examples

Anonymize Customer Support Logs

anonymizer = DataAnonymizer(strategy="mask")

log = """
Ticket #1234: Customer John Doe ([email protected]) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""

result = anonymizer.anonymize(log)
print(result)
# Ticket #1234: Customer [NAME] ([EMAIL]) called about
# billing issue. SSN on file: [SSN]. Callback number: [PHONE].
# Address: [ADDRESS].

GDPR Compliance for Database Export

anonymizer = DataAnonymizer(strategy="hash")

# Consistent hashing for joins
anonymizer.anonymize_csv(
    "users.csv",
    "users_anon.csv",
    columns=["email", "name", "phone"]
)

anonymizer.anonymize_csv(
    "orders.csv",
    "orders_anon.csv",
    columns=["customer_email"]  # Same hash as users.email
)

Generate Test Data from Production

anonymizer = DataAnonymizer(strategy="fake")

# Replace real PII with realistic fake data
anonymizer.anonymize_csv(
    "production_data.csv",
    "test_data.csv"
)

# Test data has same structure but fake PII

Dependencies

pandas>=2.0.0
faker>=18.0.0

Limitations

  • Name detection may miss unusual names
  • Address detection works best for US formats
  • Custom patterns may be needed for domain-specific PII
  • Fake data replacement doesn't preserve exact format

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.