Can your AI agent find business-logic bugs?
ShopPay Audit Benchmark is a compact open-source target for testing whether coding agents can read product rules, inspect code, and catch defects that happy-path tests miss.
What it tests
- Refund amount and lifecycle rules
- Order ownership and admin authorization
- Webhook signature trust
- Wallet atomicity and double-spend
- Tax-after-discount calculation
- Profile privilege boundaries
9
Baseline tests documenting seeded defects.
100
Point scoring rubric for comparing audit reports.
0 deps
Runs with Node.js built-ins only.
Try it
git clone https://github.com/Dmatut7/shoppay-audit-benchmark.git
cd shoppay-audit-benchmark
npm test
Then ask an AI coding agent to read SPEC.md, audit src/, and report every implementation behavior that violates the written business rules.
Why it matters
Many evaluations reward agents for fixing syntax errors or passing obvious tests. ShopPay focuses on a harder product-engineering skill: mapping code behavior back to business intent.