EvoCode-Bench — per-task results

repo · single-shot, 12 models

26 stateful coding tasks. Each row is one task — the score is the dataset per-task score (passed rounds / total rounds, 1 attempt). Click a task to see per-round, per-model test-case pass rates, which cases each model failed and why, and a difficulty / performance-gap analysis.

Round reward is binary (all key requirements pass → 1). Case-level pass rates reveal partial progress the binary round score hides. Grey / "✗" marks chains that aborted before the final round.