Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Influencer Advertising and marketing in Numbers: Key Stats

    March 15, 2026

    INC Ransom Menace Targets Australia And Pacific Networks

    March 15, 2026

    NYT Connections Sports activities Version hints and solutions for March 15: Tricks to remedy Connections #538

    March 15, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Emerging Tech»Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
    Emerging Tech

    Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

    Sophia Ahmed WilsonBy Sophia Ahmed WilsonNovember 8, 2025No Comments4 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link



    The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched model 2.0 alongside Harbor, a brand new framework for testing, enhancing and optimizing AI brokers in containerized environments.

    The twin launch goals to deal with long-standing ache factors in testing and optimizing AI brokers, significantly these constructed to function autonomously in life like developer environments.

    With a tougher and rigorously verified job set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

    Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout hundreds of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

    “Harbor is the package deal we want we had had whereas making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, mannequin, and benchmark builders and researchers who wish to consider and enhance brokers and fashions."

    Greater Bar, Cleaner Knowledge

    Terminal-Bench 1.0 noticed speedy adoption after its launch in Could 2025, turning into a default benchmark for evaluating agent efficiency throughout the sector of AI-powered brokers working in developer-style terminal environments. These brokers work together with techniques via the command line, mimicking how builders work behind the scenes of the graphical consumer interface.

    Nonetheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the group as poorly specified or unstable as a consequence of exterior service adjustments.

    Model 2.0 addresses these points immediately. The up to date suite consists of 89 duties, every subjected to a number of hours of handbook and LLM-assisted validation. The emphasis is on making duties solvable, life like, and clearly specified, elevating the issue ceiling whereas enhancing reliability and reproducibility.

    A notable instance is the download-youtube job, which was eliminated or refactored in 2.0 as a consequence of its dependence on unstable third-party APIs.

    “Astute Terminal-Bench followers could discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is tougher,” Shaw famous on X. “We imagine it is because job high quality is considerably larger within the new benchmark.”

    Harbor: Unified Rollouts at Scale

    Alongside the benchmark replace, the crew launched Harbor, a brand new framework for working and evaluating brokers in cloud-deployed containers.

    Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

    Designed to generalize throughout agent architectures, Harbor helps:

    • Analysis of any container-installable agent

    • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

    • Customized benchmark creation and deployment

    • Full integration with Terminal-Bench 2.

    Harbor was used internally to run tens of hundreds of rollouts through the creation of the brand new benchmark. It’s now publicly obtainable by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

    Early Outcomes: GPT-5 Leads in Activity Success

    Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success charge — the best amongst all brokers examined to date.

    Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

    Prime 5 Agent Outcomes (Terminal-Bench 2.0):

    1. Codex CLI (GPT-5) — 49.6%

    2. Codex CLI (GPT-5-Codex) — 44.3%

    3. OpenHands (GPT-5) — 43.8%

    4. Terminus 2 (GPT-5-Codex) — 43.4%

    5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

    The shut clustering amongst prime fashions signifies energetic competitors throughout platforms, with no single agent fixing greater than half the duties.

    Submission and Use

    To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes will be emailed to the builders together with job directories for validation.

    harbor run -d terminal-bench@2.0 -m "<mannequin>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

    Terminal-Bench 2.0 is already being built-in into analysis workflows targeted on agentic reasoning, code technology, and gear use. In line with co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress overlaying the verification course of and design methodology behind the benchmark.

    Aiming for Standardization

    The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

    These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Sophia Ahmed Wilson
    • Website

    Related Posts

    NYT Connections Sports activities Version hints and solutions for March 15: Tricks to remedy Connections #538

    March 15, 2026

    Easy methods to Purchase Used or Refurbished Electronics (2026)

    March 14, 2026

    Why I take advantage of Apple’s and Google’s password managers – and do not thoughts the chaos

    March 14, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Influencer Advertising and marketing in Numbers: Key Stats

    By Amelia Harper JonesMarch 15, 2026

    Influencer advertising and marketing has grown into probably the most data-driven division of digital advertising…

    INC Ransom Menace Targets Australia And Pacific Networks

    March 15, 2026

    NYT Connections Sports activities Version hints and solutions for March 15: Tricks to remedy Connections #538

    March 15, 2026

    The Essential Management Ability Most Leaders Do not Have!

    March 15, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.