Building a test pyramid that survives AI features

AI adds nondeterminism. Your pyramid should absorb it—not collapse into manual-only QA.

Keep the base deterministic

Unit and integration tests for tools, permissions, billing, and data paths should stay strict. AI sits on top of plumbing that must still be boring and green in CI.

Add an evaluation layer, not “more E2E”

Golden prompts with expected properties (contains, JSON schema, refusal) run on every build. Store scores over time so you detect drift when models or prompts change.

Separate product tests from model tests

Test your wrappers: timeouts, caching, PII redaction, fallbacks. Test the model separately with evaluation harnesses—mixing them creates flaky suites nobody trusts.

Human review is a sampling layer

Review 1–5% of production traffic weekly. Automate everything else. Founders who skip this layer learn about failures from angry customers.

Work with me