Designing Long-Form LLM Workflows Beyond Large Context Windows
For long policy reports, context and upload limits push chunked workflows that separate evidence retrieval from drafting, improving traceability and quality.
845 articles · Page 8 / 36
For long policy reports, context and upload limits push chunked workflows that separate evidence retrieval from drafting, improving traceability and quality.
A shift from IDE plugins to terminal-native CLI coding agents, highlighting AGENTS.md and context pipelines that shape reliability and verification loops.
Study compares six post-training 2-bit methods on a Polish 11B LLM, highlighting gaps between benchmarks and generation stability.
Why single success rates fail for long-running agents, and how to measure goal drift, consistency, and governance stability in HAT.
arXiv:2603.04407v1 reports EM can be semantically contained: near 0% without triggers, but 12.2–22.8% with them.
Interpret continual learning forgetting via structural collapse and loss of plasticity, monitoring effective rank to catch early warning signals.
VANGUARD estimates GSD from monocular UAV video using small vehicles as anchors to recover metric scale without GPS or telemetry.
AgentSelect defines narrative-query to end-to-end agent configuration recommendation, proposing a benchmark with queries, agents, and interactions.
A curated link roundup from recently collected official updates and tech news.
CoT perturbations can sharply reduce accuracy. Unit conversion remains hard at scale; isolate checks and use self-consistency.
Retiring legacy ChatGPT models may shift tone, refusals, and creativity, reshaping the balance between expression and safety guardrails.
Examines multi-rater 3D lesion segmentation, limits of vanilla diffusion, and VDD anchored to consensus priors improving GED/CI.
How LLMs create difficulty illusions, and how to design evaluation gates with scenarios, protocols, and multi-metric reporting.
GIPO targets scarce, stale interaction data by replacing hard importance-ratio clipping with log-ratio Gaussian trust weights for stable reuse.
Reframes agentic AI failures as governance issues, proposing dual-helix governance with a Knowledge/Behavior/Skills architecture.
How LLM signals can shape belief in partially observable TAMP, and why calibration, uncertainty, and safety filters matter for reliability.
How to use LLM agents for research formalization with guardrails: log everything, run continuous evaluation, and score tool selection and argument precision.
How ambiguity detection, clarification, and sycophancy control shape managerial AI advice quality, risk, and evaluation metrics.
MASS trains LLMs to synthesize per-problem data and self-update at test time, raising auditability, integrity, and reproducibility needs.
Optimize AI subscriptions by checking usage limits, terms restrictions, and uptime transparency to minimize workflow disruption risk.
LLM-based conversational recommenders may infer sensitive triggers from dialogue, risking personalized safety violations unless constraints are enforced.
PlugMem externalizes long-term memory as a plug-in to reduce retrieval bloat and relevance loss, while highlighting persistent injection risks.
Tool-free visual puzzle claims depend on fixed constraints: lock tools, image preprocessing, prompts, and logs for reproducibility.
NVML, DCGM, and nvidia-smi report window-averaged power and utilization. Learn how sampling affects LLM inference graphs.