Can your AI agent find business-logic bugs?

ShopPay Audit Benchmark is a compact open-source target for testing whether coding agents can read product rules, inspect code, and catch defects that happy-path tests miss.

What it tests

  • Refund amount and lifecycle rules
  • Order ownership and admin authorization
  • Webhook signature trust
  • Wallet atomicity and double-spend
  • Tax-after-discount calculation
  • Profile privilege boundaries
9

Baseline tests documenting seeded defects.

100

Point scoring rubric for comparing audit reports.

0 deps

Runs with Node.js built-ins only.

Try it

git clone https://github.com/Dmatut7/shoppay-audit-benchmark.git
cd shoppay-audit-benchmark
npm test

Then ask an AI coding agent to read SPEC.md, audit src/, and report every implementation behavior that violates the written business rules.

Why it matters

Many evaluations reward agents for fixing syntax errors or passing obvious tests. ShopPay focuses on a harder product-engineering skill: mapping code behavior back to business intent.