Cases
Benchmark cases
Task-level reliability across all models and runs.
| Case | Score | Passed | Failed | Errors | Models |
|---|---|---|---|---|---|
| find-file | 81.7% | 49 | 11 | 0 | 6 |
| read-exact-file | 83.3% | 50 | 10 | 0 | 6 |
| read-exact-file-with-at-reference | 86.7% | 52 | 8 | 0 | 6 |
| read-file | 63.3% | 38 | 22 | 0 | 6 |
| use-skill | 55.0% | 33 | 27 | 0 | 6 |
| use-skill-with-refs | 55.0% | 33 | 27 | 0 | 6 |
| use-skill-with-scripts | 46.7% | 28 | 32 | 0 | 6 |