نظرة عامة
رصد مجتمع Hacker News هذا الخبر الذي حصد 104 نقطة و32 تعليق خلال ساعات قليلة، مما يجعله من أبرز أخبار الذكاء الاصطناعي اليوم. المصدر الأصلي: github.com.
في هذا المقال نستعرض أبرز ما جاء في هذا الخبر، تحليله من منظور عربي، وما يعنيه للمستخدمين العرب المهتمين بأدوات الذكاء الاصطناعي.
التفاصيل
Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.<p>Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (<a href="https://debugml.github.io/cheating-agents/" rel="nofollow">https://debugml.github.io/cheating-agents/</a>), I would like to also clarify a few things<p>1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever<p>2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)<p>3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.<p>I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.<p>HF PR: <a href="https://huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard/discussions/145" rel="nofollow">https://huggingface.co/datasets/harborframework/terminal-ben...</a><p>It is astounding how much the harness matters, based on this and other experiments I have done.
المصدر الأصلي
هذا الخبر مأخوذ من منصة Hacker News — المجتمع التقني الأكثر متابعة في العالم.