Tag: inference

11 articles available

NewsJul 10, 20262026-07-10

Meta’s September AI Chip Push Signals Infrastructure Control

Meta’s planned AI chip production from September highlights tighter control over training and inference infrastructure, not just models.

FreqDepthKV for Robust KV Cache Compression in Long Contexts

agi

SourceJul 8, 20262026-07-08

FreqDepthKV for Robust KV Cache Compression in Long Contexts

A concise look at FreqDepthKV, a method targeting KV cache bottlenecks in long-context LLM inference.

OpenAI And Broadcom Plan 10GW Inference Infrastructure Rollout

hardware

CommunityJun 24, 20262026-06-24

OpenAI And Broadcom Plan 10GW Inference Infrastructure Rollout

OpenAI and Broadcom's 10GW rollout highlights a shift toward inference-first AI infrastructure and system-level optimization.

Reading AI Pricing Through Limits and Infrastructure Costs

agi

CommunityMay 29, 20262026-05-29

Reading AI Pricing Through Limits and Infrastructure Costs

AI pricing is better understood through usage caps, fallback rules, and inference infrastructure efficiency, not subscription fees alone.

Extreme 2-Bit Quantization Can Break LLM Generation

agi

SourceMar 6, 20262026-03-06

Extreme 2-Bit Quantization Can Break LLM Generation

Study compares six post-training 2-bit methods on a Polish 11B LLM, highlighting gaps between benchmarks and generation stability.

When LLM Inference Becomes Memory-Bound Under Roofline

hardware

CommunityMar 3, 20262026-03-03

When LLM Inference Becomes Memory-Bound Under Roofline

Use Roofline (I ≤ π/β) to classify LLM inference kernels as memory- or compute-bound, and guide bandwidth, cache, and interconnect decisions.

Decomposing LLM Inference Latency for Better Serving Performance

agi

GuideFeb 15, 20262026-02-15

Decomposing LLM Inference Latency for Better Serving Performance

Break down LLM latency into queue/compute and prefill/decode, then tune batching, KV cache limits, scheduling, and quantization.

OpenAI Codex Spark Runs on Cerebras WSE-3 Chips

hardware

NewsFeb 14, 20262026-02-14

OpenAI Codex Spark Runs on Cerebras WSE-3 Chips

TechCrunch says Codex Spark inference runs on Cerebras WSE-3, highlighting serving bottlenecks and PoC latency metrics.

LLM Inference Acceleration Techniques for Enhanced Efficiency and Throughput

llm

CommunityFeb 1, 20262026-02-01

LLM Inference Acceleration Techniques for Enhanced Efficiency and Throughput

Explore key LLM inference acceleration techniques like FlashAttention and PagedAttention to overcome memory bottlenecks and optimize system performance.

openai

NewsJan 27, 20262026-01-27

OpenAI and Cerebras Partner for $10 Billion Inference Acceleration

OpenAI signs a $10 billion deal with Cerebras to use WSE-3, boosting inference speeds by up to 15x for AI models.

CommunityJan 10, 20262026-01-10

AI Inference Scaling: The Next Exponential Curve in 1-2 Years

Many worry that AI model progress has plateaued. However, OpenAI and experts predict a revolution in 'Inference'. We analyze the sharp rise coming in the next 1-2 years.