Cases

Benchmark cases

Task-level reliability across all models and runs.

CaseScorePassedFailedErrorsModels
find-file 81.7% 49 11 0 6
read-exact-file 83.3% 50 10 0 6
read-exact-file-with-at-reference 86.7% 52 8 0 6
read-file 63.3% 38 22 0 6
use-skill 55.0% 33 27 0 6
use-skill-with-refs 55.0% 33 27 0 6
use-skill-with-scripts 46.7% 28 32 0 6