How we test, what we measure, and why we measure it that way. If you think our approach is flawed, tell us. We'll fix it or explain why not.
We give every platform the same API specification: a Notes API with CRUD, search, filtering, and aggregation. The spec defines:
notes (9 columns) and tags (3 columns, foreign key to notes)We chose this design deliberately. No auth means we're testing infrastructure, not each platform's auth implementation. No frontend means we're testing backend performance, not CDN caching. The list endpoint requires a JOIN between notes and tags, filtered by category and sorted by timestamp. The stats endpoint requires GROUP BY and COUNT aggregations across the full dataset.
The same spec is given to each platform as a prompt. They implement it using whatever database and runtime their platform provides. We measure the result, not the implementation.
Load tests run from a dedicated EC2 instance in us-east-1
(m5.xlarge, 4 vCPU, 16GB RAM). We use k6
as the load testing tool because it handles high concurrency without
becoming the bottleneck itself.
All platforms are tested from the same origin, at the same time of day (Tuesday and Wednesday, 10am-4pm EST, avoiding weekends and known maintenance windows). We run each test 200 times to build statistical significance.
Time from request to first byte (TTFB) after a 20-minute idle period.
We let the platform go completely idle — no keep-alive pings, no
background requests — then send a single GET /api/ping
request (no database work) and measure the response time. Repeated
200 times per platform.
We report p50 (median), p95, p99, min, and max. The median tells you the common case. The p99 tells you what your unluckiest users experience. Both matter.
We test at six concurrency levels: 10, 50, 100, 200, 500, and 1,000 simultaneous users. Each level runs as a separate k6 scenario for 5 minutes with no ramp contamination between levels. The traffic mix is 70% filtered list queries (GET with JOIN) and 30% writes (POST with database insert). Each virtual user waits 100ms between requests.
We measure error rate (percentage of 5xx responses, timeouts, and connection refused), p95 response time, and throughput (successful requests per second). The "breaking point" is the concurrency level where error rate exceeds 25%.
Two endpoints are hit simultaneously: GET /api/notes?category=X&limit=100
(a filtered query with JOIN across notes and tags) and GET /api/notes/stats
(GROUP BY + COUNT + AVG aggregation). The mix is 70% list queries, 30%
stats queries, with no think time — the goal is to saturate the
database connection pool. We test at 1, 10, 50, and 100 concurrent
virtual users and measure p50 and p95 latency.
Platforms using SQLite get the same test. If they can't handle concurrent connections (which SQLite can't for writes), the benchmark records the errors and timeouts as-is. We don't special-case any platform — the benchmark measures the real consequence of each platform's database choice.
The overall score (0–100) is a weighted composite of all benchmark categories. Current weights:
We chose these weights based on what matters most in production. Concurrency is weighted highest because it's the first thing that breaks when real users show up. Infrastructure gets a baseline weight because features like CDN and real-time are table stakes for production apps.
We are transparent about the limitations of this benchmark:
The full benchmark suite is open: the API specification (SPEC.md),
k6 scripts, orchestration scripts, and aggregation tooling. Give
the SPEC.md to any vibe coding platform and ask it to implement
the Notes API. Then run ./benchmark/scripts/run-all.sh <platform>
<base_url> against the deployed endpoint. If you get different
results, we want to know. We'd rather be wrong in public than
right in private.