SHIP AGAINST PRODUCTION-SHAPED DATA
STEM provisions an isolated, PII-anonymized clone of your production database for every pull request. This guide covers installation, configuration, the anonymization engine, and the AWS architecture underneath it.
OVERVIEW
Testing against an empty seed database hides the bugs that matter: slow queries on real data volumes, migrations that lock hot tables, and edge cases that only exist in production rows. Testing against production itself exposes customer PII to every reviewer on the pull request.
STEM removes that trade-off. When a pull request opens, STEM creates a copy-on-write clone of your Aurora PostgreSQL cluster, runs its anonymization engine over every column classified as sensitive, and posts the connection string to the PR — typically in under 30 seconds.
- Branch
- An isolated database clone tied to one pull request. Created on open, destroyed on merge or close.
- Warm pool
- Pre-provisioned Aurora clone capacity that lets branches activate in seconds instead of minutes.
- Masking pass
- The anonymization run that replaces PII with format-preserving synthetic values before any credentials are issued.
QUICKSTART
STEM is a GitHub App plus a control plane that reaches your AWS account through a scoped IAM role. No CLI, no workflow files — three steps from zero to first branch:
Step 1 — Sign in with GitHub. Go to /login and authenticate. STEM reads your public profile only — repository access is granted in the next step, never through OAuth scopes.
Step 2 — Install the STEM App. From the Connect page, install the GitHub App on the repositories that should get database branches. The App requests Pull requests: write (to post the branch comment) and Contents: read — nothing else. Webhooks fire automatically; there is no workflow file to add.
Step 3 — Connect AWS (two parts). Still on the Connect page: first open AWS CloudShell, paste the single command shown there, and copy back the Role ARN it prints. Then name your source Aurora cluster (cluster ID, subnet group, security group, region) so STEM knows what to clone. Your PR clones provision in your account, billed to you. The command deploys a CloudFormation stack containing one least-privilege IAM role:
describe / list account-wide (RDS has no resource scoping here)
clone create RestoreDBClusterToPointInTime, CreateDBInstance
modify / delete ONLY resources named stem-pr-* — your source
cluster is read-only to STEM, never modified
trust STEM's account only, gated by your ExternalIdOpen a pull request on a connected repo. stem-ci posts a comment with the branch endpoint when the masking pass completes — typically under 30 seconds. Watch it live on the dashboard.
YOUR FIRST BRANCH
Once GitHub and AWS are connected, STEM reacts to pull requests automatically. Walk through the full loop once to see it work end to end:
1 — Make a change. In a repository where you installed the STEM App, create a branch and edit any file (a single README line is enough). Commit and push it.
2 — Open a pull request against the default branch. STEM's webhook fires the instant the PR opens — no workflow file, no CI step to add.
3 — Watch the dashboard. Within roughly 30 seconds a card appears and advances through the states below. On the Vercel free plan the pipeline steps on a daily cron, so use the Advance Pipeline button in the operator console to move it instantly during a demo.
QUEUED clone requested, waiting for the pipeline
PROVISIONING Aurora clone attaching, PII masking running
ACTIVE credentials issued, endpoint posted to the PR
NEEDS MASKING no PII columns detected — clone withheld (fail closed)4 — Use the branch. Expand the card for the masked database endpoint and the list of anonymized columns, or read them from the comment STEM posts on the PR. Point your preview deployment or local app at that connection string.
5 — Close or merge the PR. STEM destroys the clone and every associated AWS resource automatically — there is nothing to clean up and no lingering cost.
If the card shows NEEDS MASKING, STEM scanned the schema and recognized no PII columns, so it withheld the clone by design. Align your column names with the recognized patterns (see Anonymization) or add explicit rules, then reopen the PR.
CONFIGURATION
Roadmap — per-repo configuration ships as a stem.config.json at the repository root. In the current release every branch uses the production-safe defaults shown below; the file is not yet read.
{
"cluster": "prod-aurora-pg",
"region": "us-east-1",
"ttl_hours": 72,
"warm_pool_size": 3,
"masking_profile": "default",
"destroy_on": ["merged", "closed"],
"comment_template": "compact"
}- ttl_hours
- Hard expiry for a branch even if the PR stays open. Prevents forgotten clones from accruing cost.
- warm_pool_size
- Number of pre-warmed clones held ready. Higher values reduce p99 provision time at higher standing cost.
- destroy_on
- PR events that trigger teardown. Branch storage is reclaimed within minutes of destruction.
ANONYMIZATION
The masking pass runs inside the clone before any credentials exist, so raw PII is never reachable from a branch. Columns are classified by name patterns, PostgreSQL type, and a sampling heuristic, then masked with format-preserving strategies:
- Replaced with deterministic synthetic addresses. Uniqueness constraints survive; the same source value maps to the same masked value within a branch.
- card / iban
- Replaced with checksum-valid test numbers. Luhn passes, payment processors reject.
- name / address
- Replaced from synthetic dictionaries, preserving length distribution and character set.
- free text
- Columns flagged as free-form are shredded with token-level replacement to defeat re-identification.
Overrides live in the masking profile. Mark columns as passthrough (never masked) or drop (nulled entirely):
version: 1
rules:
- table: users
column: email
strategy: synthetic_email
- table: support_tickets
column: body
strategy: token_shred
- table: feature_flags
column: "*"
strategy: passthroughARCHITECTURE
STEM is built on two AWS database services with deliberately separated responsibilities:
Aurora PostgreSQL — the data plane. Branches are Aurora copy-on-write clones. A clone shares unchanged pages with the source cluster, so creating one moves no data and costs storage only for pages the branch actually modifies. This is what makes sub-30-second provisioning physically possible on multi-terabyte databases.
Aurora DSQL — the control plane. Branch metadata — state machines, TTL clocks, masking manifests, audit events — lives in Aurora DSQL. Its serverless, multi-region consistency means the control plane has no instances to manage and survives regional failure without losing track of a single branch.
The two planes never share credentials. A compromise of the control plane cannot read branch data; the data plane has no knowledge of GitHub tokens.
BRANCH LIFECYCLE
Every branch moves through a strict state machine, visible live on the dashboard:
queued PR opened, waiting for warm-pool capacity
provisioning Aurora clone attaching, masking pass running
active credentials issued, posted to the PR
destroyed PR merged/closed or TTL expired; storage reclaimedTransitions are recorded in the audit log with actor, timestamp, and cause. There is no manual state — a branch cannot be kept alive past its TTL without a new PR event.
LIMITS & REGIONS
- Clone limit
- Aurora supports up to 15 clones per source cluster. STEM queues PRs beyond that and activates them FIFO as branches are destroyed.
- Regions
- Available in every region offering both Aurora PostgreSQL and Aurora DSQL. The control plane is multi-region by default.
- Engine support
- Aurora PostgreSQL 13 and newer. MySQL support is on the public roadmap.
- Branch size
- No practical limit — copy-on-write means a branch of a 10 TB cluster starts at near-zero incremental storage.