Evaluating AI Agents for E-Commerce Dispute Resolution Tasks
CyberJurors evaluates agent systems on multi-round, multimodal evidence handling and platform rule adaptation in e-commerce disputes.
CyberJurors evaluates agent systems on multi-round, multimodal evidence handling and platform rule adaptation in e-commerce disputes.
MOV-Bench highlights evaluation gaps in multi-hop audio-visual reasoning and shows consistent gains from agentic search.
How under-specified applied ML papers can become executable benchmarks through agentic workflows and slot-based reporting.
A neuroimaging benchmark comparing vision-enabled LLMs on MRI and CT, focusing on clinical reasoning, errors, and safety tradeoffs.
View LLM agents as runtime-adaptive computation graphs to optimize accuracy, cost, latency, debugging, and control.
A LatAm-focused QA set (26k+) links Wikidata and Wikipedia to measure LLM gaps by country and cultural context.
Don’t equate tokens/sec or speedups with research automation; fix success, time budget, retries, and verification to forecast.
RM-R1 proposes reward models that reason before scoring, reporting up to 4.9% gains on public RM benchmarks and highlighting safety evaluation gaps.
Why tiny benchmark gaps mislead: evaluation settings, reproducible logs, and multi-metric, roadmap-driven model selection.
Explain why 4-bit quantized models can show lower PPL than FP16, and outline a reproducible evaluation protocol.
A 3.5B-token combustion knowledgebase and CombustionQA benchmark unify knowledge injection and evaluation into one pipeline.
EVMbench evaluates agent smart-contract security across detection, patching with tests, and exploit attempts in a sandboxed EVM.
AgentSelect defines narrative-query to end-to-end agent configuration recommendation, proposing a benchmark with queries, agents, and interactions.
How LLMs create difficulty illusions, and how to design evaluation gates with scenarios, protocols, and multi-metric reporting.
Separate humanlike mimicry from self-consistency in LLMs, and evaluate long-term memory and persona drift with benchmarks and protocols.
Resizing, tiling, and tokenization can shift what models see, turning map/geography misreads into repeatable product risk.
How to turn AGI arrival-year claims into testable forecasts by specifying definitions, metrics, probabilities, and scoring rules.
Run MLX mxfp4 local LLMs with identical commands and prompts, logging tokens-per-sec and peak memory for reproducible comparisons.
A decision memo separating reasoning, long-term memory, and continual learning into testable metrics to reduce AGI narrative confusion.
How small prompt shifts can amplify into risky robot actions, and why alignment alone can’t guarantee physical safety.
Static benchmark gains may not translate to real work quality. Covers contamination risks and a practical evaluation framework.
Tight leaderboard scores can hide uncertainty and evaluation drift. Public data alone rarely confirms 3–6 month trend slowdowns.
Break coding agent latency into output, prefill, tool time, and network overhead to measure end-to-end duration.
Explore why METR metrics for autonomous capability are more crucial than simple benchmark scores for evaluating AI models.