Methodology — Vibe Coding Benchmarks

The test application

We give every platform the same API specification: a Notes API with CRUD, search, filtering, and aggregation. The spec defines:

Two database tables: notes (9 columns) and tags (3 columns, foreign key to notes)
Seven API endpoints: health check, create, read, update, delete, filtered list with pagination, and aggregation stats
A seed endpoint that populates 10,000 notes with ~15,000-25,000 tags
No authentication, no frontend, no WebSocket — pure API infrastructure

We chose this design deliberately. No auth means we're testing infrastructure, not each platform's auth implementation. No frontend means we're testing backend performance, not CDN caching. The list endpoint requires a JOIN between notes and tags, filtered by category and sorted by timestamp. The stats endpoint requires GROUP BY and COUNT aggregations across the full dataset.

The same spec is given to each platform as a prompt. They implement it using whatever database and runtime their platform provides. We measure the result, not the implementation.

Test infrastructure

Load tests run from a dedicated EC2 instance in us-east-1 (m5.xlarge, 4 vCPU, 16GB RAM). We use k6 as the load testing tool because it handles high concurrency without becoming the bottleneck itself.

All platforms are tested from the same origin, at the same time of day (Tuesday and Wednesday, 10am-4pm EST, avoiding weekends and known maintenance windows). We run each test 200 times to build statistical significance.

What we measure

Cold Start Latency

Time from request to first byte (TTFB) after a 20-minute idle period. We let the platform go completely idle — no keep-alive pings, no background requests — then send a single GET /api/ping request (no database work) and measure the response time. Repeated 200 times per platform.

We report p50 (median), p95, p99, min, and max. The median tells you the common case. The p99 tells you what your unluckiest users experience. Both matter.

Concurrent Request Capacity

We test at six concurrency levels: 10, 50, 100, 200, 500, and 1,000 simultaneous users. Each level runs as a separate k6 scenario for 5 minutes with no ramp contamination between levels. The traffic mix is 70% filtered list queries (GET with JOIN) and 30% writes (POST with database insert). Each virtual user waits 100ms between requests.

We measure error rate (percentage of 5xx responses, timeouts, and connection refused), p95 response time, and throughput (successful requests per second). The "breaking point" is the concurrency level where error rate exceeds 25%.

Database Performance

Two endpoints are hit simultaneously: GET /api/notes?category=X&limit=100 (a filtered query with JOIN across notes and tags) and GET /api/notes/stats (GROUP BY + COUNT + AVG aggregation). The mix is 70% list queries, 30% stats queries, with no think time — the goal is to saturate the database connection pool. We test at 1, 10, 50, and 100 concurrent virtual users and measure p50 and p95 latency.

Platforms using SQLite get the same test. If they can't handle concurrent connections (which SQLite can't for writes), the benchmark records the errors and timeouts as-is. We don't special-case any platform — the benchmark measures the real consequence of each platform's database choice.

Production Readiness Score

The overall score (0–100) is a weighted composite of all benchmark categories. Current weights:

Cold Start (20%) — p50 and p99, normalized to 0–100
Concurrency (35%) — breaking point and error rates at 200 and 1,000 users
Database (25%) — query latency at 100 connections, max connections, pooling
Infrastructure (20%) — CDN, real-time support, email, scalability model

We chose these weights based on what matters most in production. Concurrency is weighted highest because it's the first thing that breaks when real users show up. Infrastructure gets a baseline weight because features like CDN and real-time are table stakes for production apps.

What we don't measure (yet)

Code generation quality — other benchmarks cover this. We focus on infrastructure.
Developer experience — subjective and impossible to benchmark objectively.
Pricing — changes too frequently and depends on usage patterns. We may add cost-per-request analysis in a future report.
Security — requires a different methodology (penetration testing, header analysis, dependency auditing). Planned for Q3.
Geographic latency — all tests run from us-east-1. Multi-region testing is planned.

Limitations and biases

We are transparent about the limitations of this benchmark:

OpenKBS is a tested platform. The infrastructure that runs this website (OpenKBS) is also one of the platforms we benchmark. We address this by publishing our methodology, raw data, and test scripts. If our numbers for OpenKBS look wrong, you can verify them yourself.
Single test application. Different apps stress different parts of the infrastructure. A CPU-heavy image processing app would produce different results than our CRUD-heavy test app.
Single region. All tests run from us-east-1. Platforms with edge deployments may perform better from other regions.
Point-in-time data. Platforms update their infrastructure regularly. Our benchmarks reflect performance at the time of testing. We re-run quarterly.

How to replicate

The full benchmark suite is open: the API specification (SPEC.md), k6 scripts, orchestration scripts, and aggregation tooling. Give the SPEC.md to any vibe coding platform and ask it to implement the Notes API. Then run ./benchmark/scripts/run-all.sh <platform> <base_url> against the deployed endpoint. If you get different results, we want to know. We'd rather be wrong in public than right in private.