نظرة عامة

رصد مجتمع Hacker News هذا الخبر الذي حصد 104 نقطة و32 تعليق خلال ساعات قليلة، مما يجعله من أبرز أخبار الذكاء الاصطناعي اليوم. المصدر الأصلي: github.com.

في هذا المقال نستعرض أبرز ما جاء في هذا الخبر، تحليله من منظور عربي، وما يعنيه للمستخدمين العرب المهتمين بأدوات الذكاء الاصطناعي.

التفاصيل

Scored 65.2% vs google&#x27;s official 47.8%, and the existing top closed source model Junie CLI&#x27;s 64.3%.<p>Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (<a href="https:&#x2F;&#x2F;debugml.github.io&#x2F;cheating-agents&#x2F;" rel="nofollow">https:&#x2F;&#x2F;debugml.github.io&#x2F;cheating-agents&#x2F;</a>), I would like to also clarify a few things<p>1. Absolutely no {agents&#x2F;skills}.md files were inserted at any point. No cheating mechanisms whatsoever<p>2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)<p>3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.<p>I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.<p>HF PR: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;harborframework&#x2F;terminal-bench-2-leaderboard&#x2F;discussions&#x2F;145" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;harborframework&#x2F;terminal-ben...</a><p>It is astounding how much the harness matters, based on this and other experiments I have done.

المصدر الأصلي

هذا الخبر مأخوذ من منصة Hacker News — المجتمع التقني الأكثر متابعة في العالم.