Leaderboard

Model rankings

Showing best run per model. Updated when new runs are published.

#ModelScoreBasic File ReadingBasic SkillsAvg turnsAvg toolsAvg tokensContext avg / max
1qwen/qwen3.6-35b-a3b100.0%100.0%100.0%3.83.06,0795.2% / 6.0%
2qwen/qwen3.5-9b92.9%100.0%83.3%7.66.654,7446.3% / 52.6%
3granite-4.1-8b85.7%100.0%66.7%4.03.05,4594.4% / 5.3%
4google/gemma-4-e4b81.4%95.0%63.3%4.23.36,7762.8% / 6.0%
5google/gemma-4-e2b41.4%72.5%0.0%2.81.83,9894.7% / 7.2%
6lfm2.5-350m2.9%5.0%0.0%1.70.82,1503.9% / 4.8%