Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Carney set to fulfill cupboard, Saskatchewan premier, canola business leaders – Nationwide

    September 16, 2025

    MoonshotAI Launched Checkpoint-Engine: A Easy Middleware to Replace Mannequin Weights in LLM Inference Engines, Efficient for Reinforcement Studying

    September 16, 2025

    Govt to reply to Qatar on LNG cargo deferment past 2030

    September 16, 2025
    Facebook X (Twitter) Instagram
    Tuesday, September 16
    Trending
    • Carney set to fulfill cupboard, Saskatchewan premier, canola business leaders – Nationwide
    • MoonshotAI Launched Checkpoint-Engine: A Easy Middleware to Replace Mannequin Weights in LLM Inference Engines, Efficient for Reinforcement Studying
    • Govt to reply to Qatar on LNG cargo deferment past 2030
    • China’s Greatest Bitcoin Treasury Agency Plans $500M Inventory Sale To Purchase Extra Crypto
    • The Paper, The New Workplace Spin-Off, Is Good, Truly
    • Software Type DHA Karachi Jobs 2025 Defence Housing Authority
    • President requires Pak-China joint ventures in EVs and mini vehicles
    • Frank relishing Champions League debut with Spurs
    • Male sufferer dies after taking pictures in Toronto’s east finish
    • Amazon to host Prime Massive Offers Day gross sales occasion on October 7 and eight
    Facebook X (Twitter) Instagram Pinterest Vimeo
    The News92The News92
    • Home
    • World
    • National
    • Sports
    • Crypto
    • Travel
    • Lifestyle
    • Jobs
    • Insurance
    • Gaming
    • AI & Tech
    • Health & Fitness
    The News92The News92
    Home»AI & Tech»Stanford Researchers Launched MedAgentBench: A Actual-World Benchmark for Healthcare AI Brokers
    AI & Tech

    Stanford Researchers Launched MedAgentBench: A Actual-World Benchmark for Healthcare AI Brokers

    Naveed AhmadBy Naveed AhmadSeptember 16, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    A workforce of Stanford College researchers have launched MedAgentBench, a brand new benchmark suite designed to guage giant language mannequin (LLM) brokers in healthcare contexts. Not like prior question-answering datasets, MedAgentBench supplies a digital digital well being document (EHR) setting the place AI methods should work together, plan, and execute multi-step scientific duties. This marks a major shift from testing static reasoning to assessing agentic capabilities in dwell, tool-based medical workflows.

    https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

    Why Do We Want Agentic Benchmarks in Healthcare?

    Current LLMs have moved past static chat-based interactions towards agentic habits—decoding high-level directions, calling APIs, integrating affected person knowledge, and automating complicated processes. In drugs, this evolution might assist deal with workers shortages, documentation burden, and administrative inefficiencies.

    Whereas general-purpose agent benchmarks (e.g., AgentBench, AgentBoard, tau-bench) exist, healthcare lacked a standardized benchmark that captures the complexity of medical knowledge, FHIR interoperability, and longitudinal affected person information. MedAgentBench fills this hole by providing a reproducible, clinically related analysis framework.

    What Does MedAgentBench Comprise?

    How Are the Duties Structured?

    MedAgentBench consists of 300 duties throughout 10 classes, written by licensed physicians. These duties embody affected person data retrieval, lab outcome monitoring, documentation, take a look at ordering, referrals, and medicine administration. Duties common 2–3 steps and mirror workflows encountered in inpatient and outpatient care.

    What Affected person Knowledge Helps the Benchmark?

    The benchmark leverages 100 life like affected person profiles extracted from Stanford’s STARR knowledge repository, comprising over 700,000 information together with labs, vitals, diagnoses, procedures, and medicine orders. Knowledge was de-identified and jittered for privateness whereas preserving scientific validity.

    How Is the Atmosphere Constructed?

    The setting is FHIR-compliant, supporting each retrieval (GET) and modification (POST) of EHR knowledge. AI methods can simulate life like scientific interactions equivalent to documenting vitals or inserting remedy orders. This design makes the benchmark instantly translatable to dwell EHR methods.

    How Are Fashions Evaluated?

    • Metric: Process success fee (SR), measured with strict move@1 to replicate real-world security necessities.
    • Fashions Examined: 12 main LLMs together with GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek-V3, Qwen2.5, and Llama 3.3.
    • Agent Orchestrator: A baseline orchestration setup with 9 FHIR features, restricted to eight interplay rounds per job.

    Which Fashions Carried out Greatest?

    • Claude 3.5 Sonnet v2: Greatest total with 69.67% success, particularly sturdy in retrieval duties (85.33%).
    • GPT-4o: 64.0% success, displaying balanced retrieval and motion efficiency.
    • DeepSeek-V3: 62.67% success, main amongst open-weight fashions.
    • Commentary: Most fashions excelled at question duties however struggled with action-based duties requiring protected multi-step execution.
    https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

    What Errors Did Fashions Make?

    Two dominant failure patterns emerged:

    1. Instruction adherence failures — invalid API calls or incorrect JSON formatting.
    2. Output mismatch — offering full sentences when structured numerical values had been required.

    These errors spotlight gaps in precision and reliability, each crucial in scientific deployment.

    Abstract

    MedAgentBench establishes the primary large-scale benchmark for evaluating LLM brokers in life like EHR settings, pairing 300 clinician-authored duties with a FHIR-compliant setting and 100 affected person profiles. Outcomes present sturdy potential however restricted reliability—Claude 3.5 Sonnet v2 leads at 69.67%—highlighting the hole between question success and protected motion execution. Whereas constrained by single-institution knowledge and EHR-focused scope, MedAgentBench supplies an open, reproducible framework to drive the subsequent era of reliable healthcare AI brokers


    Take a look at the PAPER and Technical Blog. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter.


    Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking complicated datasets into actionable insights.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleBritain slips to sixth place in international innovation league desk, Wipo report finds
    Next Article Lone survivor of East Vancouver triple stabbing dies in hospital – BC
    Naveed Ahmad
    • Website

    Related Posts

    AI & Tech

    MoonshotAI Launched Checkpoint-Engine: A Easy Middleware to Replace Mannequin Weights in LLM Inference Engines, Efficient for Reinforcement Studying

    September 16, 2025
    AI & Tech

    Amazon to host Prime Massive Offers Day gross sales occasion on October 7 and eight

    September 16, 2025
    AI & Tech

    Matthew Prince Needs AI Firms to Pay for Their Sins

    September 16, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Women cricketers send unity and hope on August 14

    August 14, 20256 Views

    Particular Training Division Punjab Jobs 2025 Present Openings

    August 17, 20253 Views

    Lawyer ‘very assured’ a overseas adversary attacked Canadian diplomats in Cuba – Nationwide

    August 17, 20253 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Demo
    Most Popular

    Women cricketers send unity and hope on August 14

    August 14, 20256 Views

    Particular Training Division Punjab Jobs 2025 Present Openings

    August 17, 20253 Views

    Lawyer ‘very assured’ a overseas adversary attacked Canadian diplomats in Cuba – Nationwide

    August 17, 20253 Views
    Our Picks

    Carney set to fulfill cupboard, Saskatchewan premier, canola business leaders – Nationwide

    September 16, 2025

    MoonshotAI Launched Checkpoint-Engine: A Easy Middleware to Replace Mannequin Weights in LLM Inference Engines, Efficient for Reinforcement Studying

    September 16, 2025

    Govt to reply to Qatar on LNG cargo deferment past 2030

    September 16, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Advertise
    • Disclaimer
    © 2025 TheNews92.com. All Rights Reserved. Unauthorized reproduction or redistribution of content is strictly prohibited.

    Type above and press Enter to search. Press Esc to cancel.