Support

DIAGNOSE IT LIKE WE WOULD

These are the same runbooks our team uses. Most branch issues resolve in under a minute — start with the symptom, follow the resolution, escalate if it persists.

01 / Runbooks

TROUBLESHOOTING

RB-01runbook 01

Branch stuck in QUEUED

Likely cause — The source cluster has hit Aurora's 15-clone limit, or the warm pool is exhausted.

Resolution — Check the dashboard's warm pool panel. Destroy stale branches (merged PRs with TTL remaining) or raise warm_pool_size in stem.config.json. Queued branches activate FIFO automatically.

RB-02runbook 02

stem-ci never comments on the PR

Likely cause — The STEM GitHub App is not installed on the repository, so the pull_request webhook never reaches the control plane.

Resolution — Open Connect from the dashboard account menu and check repository access. Install (or extend) the App on the repo, then reopen the PR — no workflow file is needed; the App's webhook fires automatically.

RB-03runbook 03

Provision time above 30s

Likely cause — Cold start — no warm clone was available, so a fresh Aurora clone had to attach from scratch.

Resolution — Increase warm_pool_size. Each warm slot adds standing cost but removes the cold-attach penalty. The provision-time chart on the dashboard shows which runs were cold.

RB-04runbook 04

Masked values break my tests

Likely cause — Tests assert against specific production values, which the masking pass replaced.

Resolution — Masking is deterministic within a branch — assert on shape, not values. For fixture columns that are genuinely non-sensitive, add a passthrough rule to your masking profile.

RB-05runbook 05

Connection refused on branch endpoint

Likely cause — The branch was destroyed — its PR merged/closed, or TTL expired.

Resolution — Branch credentials die with the branch by design. Push a new commit or reopen the PR to get a fresh branch. Check the audit log for the destruction cause.

RB-06runbook 06

IAM role deployment fails

Likely cause — The deploying principal lacks iam:CreateRole, or SCP guardrails block role creation.

Resolution — Run the CloudShell command as a principal with IAM admin once — the role itself stays minimal (describe + clone create account-wide, delete locked to stem-pr-* resources). Review the role's trust policy against your org's SCPs.

RB-07runbook 07

AWS connect says it can't assume the role

Likely cause — IAM is eventually consistent — a just-created role can take ~30 seconds to become assumable. Otherwise the ExternalId in the stack doesn't match yours, or the stack deployed to a different account.

Resolution — Wait 30 seconds and click Verify & Connect again. Still failing? Re-run the CloudShell command from the Connect page (it carries your current ExternalId) and confirm you're signed into the intended AWS account.

RB-08runbook 08

GitHub sign-in fails with a state-mismatch error

Likely cause — The OAuth state cookie was lost — usually a stale tab that sat on GitHub's authorize page past the 10-minute window, or cookies blocked for the site.

Resolution — Go back to /login and start again in the same tab. If it persists, allow cookies for the dashboard origin — the state cookie is the CSRF proof; sign-in is impossible without it.

02 / FAQ

FREQUENTLY ASKED

That's expected — STEM only creates branches in response to pull requests. Open a PR on a repo where you installed the STEM App: make a change on a new branch, open the PR against the default branch, and within ~30 seconds a card appears and moves QUEUED → PROVISIONING → ACTIVE. On the Vercel free plan the pipeline steps on a daily cron, so use the Advance Pipeline button in the operator console to move it instantly. The dashboard's empty state and the Your First Branch section in the docs walk through every step.

On the Connect page, after installing the GitHub App: (1) open AWS CloudShell and paste the one command shown there — it deploys a least-privilege IAM role and prints its ARN; (2) paste the ARN back; (3) enter your Aurora source cluster ID, DB subnet group, and VPC security group (copy these from the RDS console). STEM verifies it can describe the cluster before saving. No long-lived AWS keys ever leave your account.

No. The masking pass runs inside the clone before any branch credentials are generated. There is no time window in which issued credentials can read unmasked rows.

Aurora clones are copy-on-write: an idle branch pays only for pages it has modified plus its compute when queried. A branch that is never queried after provisioning costs near zero in storage.

The control plane (Aurora DSQL) stores only metadata: branch states, timestamps, masking manifests, and audit events. Row data never leaves your AWS account — the masking pass executes inside your VPC.

Today STEM requires Aurora PostgreSQL 13+ as the source, because sub-30-second provisioning depends on Aurora's copy-on-write clone primitive. RDS and self-hosted PostgreSQL are on the roadmap via snapshot-based branching with slower provision times.

Each branch is fully isolated — run migrations inside it freely. A merged migration reaches future branches automatically since each new branch clones from current production state.

Yes. The control plane, GitHub Action, and masking engine are open source. Self-host the entire stack in your own AWS account using the deployment guide in the docs.