Can your AI agent find business-logic bugs?

ShopPay Audit Benchmark is a compact open-source target for testing whether coding agents can read product rules, inspect code, and catch defects that happy-path tests miss.

View on GitHub Benchmark guide Scoring rubric

What it tests

Refund amount and lifecycle rules
Order ownership and admin authorization
Webhook signature trust
Wallet atomicity and double-spend
Tax-after-discount calculation
Profile privilege boundaries

Baseline tests documenting seeded defects.

100

Point scoring rubric for comparing audit reports.

0 deps

Runs with Node.js built-ins only.

Try it

git clone https://github.com/Dmatut7/shoppay-audit-benchmark.git
cd shoppay-audit-benchmark
npm test

Then ask an AI coding agent to read SPEC.md, audit src/, and report every implementation behavior that violates the written business rules.

Why it matters

Many evaluations reward agents for fixing syntax errors or passing obvious tests. ShopPay focuses on a harder product-engineering skill: mapping code behavior back to business intent.