Tag: benchmark

49 articles available

View all tags View all posts

Rethinking Medical LLM Evaluation for Clinical Reasoning

hardware

SourceJul 12, 20262026-07-12

Rethinking Medical LLM Evaluation for Clinical Reasoning

A survey argues medical LLMs should be judged by clinical reasoning capacity, not just benchmark accuracy.

Do Higher LLM Scores Really Signal Approaching AGI

agi

CommunityJul 11, 20262026-07-11

Do Higher LLM Scores Really Signal Approaching AGI

Public research suggests rising LLM scores reflect tools, memory, and planning systems, not a simple march toward AGI.

When LLM Agreement Fails as a Reliability Signal

hardware

SourceJul 11, 20262026-07-11

When LLM Agreement Fails as a Reliability Signal

Why LLM agreement can mislead evaluation, with correlated errors, shared wrong answers, and safer judging protocols.

Three Axes for Comparing Korean LLM Performance

hardware

CommunityJul 10, 20262026-07-10

Three Axes for Comparing Korean LLM Performance

Korean LLMs are better judged by naturalness, pragmatic understanding, and instruction following than by one rank.

PCBWorld Redefines Evaluation for Engine-Grounded PCB Routing AI

agi

SourceJul 9, 20262026-07-09

PCBWorld Redefines Evaluation for Engine-Grounded PCB Routing AI

An overview of PCBWorld, a KiCad-based environment for evaluating PCB routing AI with native actions and DRC feedback.

How Question AIs Shift Search Toward Accuracy

hardware

CommunityJul 7, 20262026-07-07

How Question AIs Shift Search Toward Accuracy

Question-based AI speeds research, but answer accuracy and source verification remain critical for reliable work.

Why Coding Leads LLM Positioning And Evaluation Today

llm

CommunityJul 6, 20262026-07-06

Why Coding Leads LLM Positioning And Evaluation Today

Why LLM firms foreground coding as a core benchmark, and how that bias helps developers but raises barriers for nondevelopers.

Medical AI Beyond Tests to Clinical Reasoning

llm

SourceJul 4, 20262026-07-04

Medical AI Beyond Tests to Clinical Reasoning

As multiple-choice medical benchmarks saturate, open-ended clinical reasoning and safety are becoming key measures.

PACE Tests Cheap Proxies For Agent Benchmark Performance

llm

SourceJul 4, 20262026-07-04

PACE Tests Cheap Proxies For Agent Benchmark Performance

PACE examines whether low-cost non-agent benchmarks can predict expensive agent benchmark performance.

How To Compare Code Models Beyond Benchmark Scores

hardware

CommunityJul 3, 20262026-07-03

How To Compare Code Models Beyond Benchmark Scores

Code model evaluation should weigh real task success, retries, latency, and token cost, not benchmark scores alone.

Do Language Models Really Build Stable World Models

llm

CommunityJun 29, 20262026-06-29

Do Language Models Really Build Stable World Models

Strong language performance may not imply a stable world model. Reassessing LLMs through failures in time, space, and physics.

MMG-Pop Rethinks Social Popularity Prediction Across Platforms

llm

SourceJun 29, 20262026-06-29

MMG-Pop Rethinks Social Popularity Prediction Across Platforms

MMG-Pop uses multimodal and temporal graph signals from Bluesky and Reddit to reassess social popularity prediction.

Why Benchmarks Miss Much of LLM Performance

llm

SourceJun 28, 20262026-06-28

Why Benchmarks Miss Much of LLM Performance

How single-run LLM benchmarks can miss usable performance, and why model choice, retries, and cost matter.

OpenFinGym Reframes How Financial AI Systems Are Evaluated

agi

SourceJun 27, 20262026-06-27

OpenFinGym Reframes How Financial AI Systems Are Evaluated

OpenFinGym shifts financial AI evaluation from single-task accuracy to workflow-level testing across prediction, trading, and risk.

Temporal Validity Challenges in RAG and Evolving Knowledge

agi

SourceJun 26, 20262026-06-26

Temporal Validity Challenges in RAG and Evolving Knowledge

How RAG mixes past and current facts, causing stale-fact errors, and why temporal validity matters in retrieval.

Automating Benchmarks for Neural Relational Reasoning Generalization

hardware

SourceJun 25, 20262026-06-25

Automating Benchmarks for Neural Relational Reasoning Generalization

Why automated LLM-built benchmarks for relational reasoning need difficulty control, reliable answers, and bias checks.

Beyond RAG for Domain-Specific LLM Decision Tasks

llm

CommunityJun 25, 20262026-06-25

Beyond RAG for Domain-Specific LLM Decision Tasks

RAGBench and LegalBench show why enterprise LLM evaluation must separate retrieval quality from domain-specific judgment.

HOLMES Challenges LLMs With Higher-Order Logic Reasoning

llm

SourceJun 24, 20262026-06-24

HOLMES Challenges LLMs With Higher-Order Logic Reasoning

HOLMES probes higher-order logic reasoning beyond final answers, exposing limits in LLM rule, predicate, and constraint handling.

IV-CoT Separates Structure Planning From Visual Rendering

llm

SourceJun 24, 20262026-06-24

IV-CoT Separates Structure Planning From Visual Rendering

IV-CoT targets structural prompt fidelity in text-to-image generation by separating layout planning from appearance rendering.

How Close Chinese LLMs Are to Frontier Models

agi

CommunityJun 19, 20262026-06-19

How Close Chinese LLMs Are to Frontier Models

Chinese LLM progress is best judged by benchmarks, independent evaluations, and cost efficiency rather than executive claims.

Why LLM Reasoning Needs More Than Correct Answers

hardware

CommunityJun 18, 20262026-06-18

Why LLM Reasoning Needs More Than Correct Answers

LLM reasoning should be judged not only by accuracy, but also by consistency, constraint tracking, and self-checking.

Rethinking Protein AI Evaluation With TadA-Bench Replay

agi

SourceJun 3, 20262026-06-03

Rethinking Protein AI Evaluation With TadA-Bench Replay

TadA-Bench shifts protein AI evaluation from static prediction scores to experiment selection and chronology-preserving replay.

CodeGolf Bench Tests Concise Code Beyond Correctness Metrics

hardware

SourceJun 1, 20262026-06-01

CodeGolf Bench Tests Concise Code Beyond Correctness Metrics

CodeGolf Bench measures concise code generation across 60 languages, but its scores should not be read as real-world engineering productivity.

SCALE and the Shift Toward Self-Exploring Web Agents

llm

SourceJun 1, 20262026-06-01

SCALE and the Shift Toward Self-Exploring Web Agents

SCALE examines whether web agents can reduce reliance on expert demonstrations and learn through self-exploration.