Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

نظرة عامة

رصد مجتمع Hacker News هذا الخبر الذي حصد 104 نقطة و32 تعليق خلال ساعات قليلة، مما يجعله من أبرز أخبار الذكاء الاصطناعي اليوم. المصدر الأصلي: github.com.

في هذا المقال نستعرض أبرز ما جاء في هذا الخبر، تحليله من منظور عربي، وما يعنيه للمستخدمين العرب المهتمين بأدوات الذكاء الاصطناعي.

التفاصيل

Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (<a href="https://debugml.github.io/cheating-agents/" rel="nofollow">https://debugml.github.io/cheating-agents/</a>), I would like to also clarify a few things1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.HF PR: <a href="https://huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard/discussions/145" rel="nofollow">https://huggingface.co/datasets/harborframework/terminal-ben...</a>It is astounding how much the harness matters, based on this and other experiments I have done.

المصدر الأصلي

هذا الخبر مأخوذ من منصة Hacker News — المجتمع التقني الأكثر متابعة في العالم.

قراءة المصدر ← النقاشات على HN

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

نظرة عامة

التفاصيل

المصدر الأصلي

مقالات ذات صلة