Why Your AI Is Getting More Expensive — and Smarter
Site Owner
发布于 2026-05-26
The arms race in AI has shifted from pre-training to inference time. Test-time compute scaling lets models think harder on hard problems, but introduces new cost and latency tradeoffs. Chinese labs like Kimi are pushing token efficiency as a first-class concern.

Why Your AI Is Getting More Expensive — and Smarter
The compute war has moved. Pre-training was just the warm-up.
In 2020, scaling laws told us one thing: more parameters, more data, more compute at training time, and the model gets better. Simple. Brutal. True.
By 2025, every major lab had absorbed this lesson. GPT-4 class models require nine-figure training budgets. The top labs are reportedly spending $1B+ on a single training run. Energy consumption for a single big model equals a small town's annual usage.
And yet — the models keep improving. Not just from bigger pre-training runs.
Something else is happening. Something more interesting.
The New Frontier Lives at Inference Time
The arms race has shifted. The battle now is fought in the microseconds between your prompt hitting the API and the response arriving. This is test-time compute — the idea that you can make a model think harder when it needs to, rather than baking all intelligence into weights at training time.
Think of it like a math student. You can memorize 10,000 solved equations during a semester (pre-training). Or you can learn how to think through a new problem during the exam (test-time compute). Same student, same knowledge base, wildly different outputs depending on how much "thinking" budget you give them.
OpenAI's o1 and o3 models are the most visible examples. When you ask o1 a hard problem, it doesn't just generate a response. It thinks. For seconds. Sometimes minutes. It runs a long internal chain of thought, exploring multiple reasoning paths, self-correcting, backtracking. The tokens it burns during this process aren't free — each additional "thinking" token costs money and latency.
But the results are categorically different. o3 scored 87.5% on ARC-AGI, a benchmark designed to resist rote memorization. That's not a marginal improvement — it's a qualitative leap that previous models couldn't touch.
The Chinese Labs Are Paying Attention
While the Western narrative fixates on OpenAI, Chinese labs are running their own race with striking results.
DeepSeek-R1 demonstrated that reasoning capability can be distilled and transferred — their "thinking" approach spread rapidly across the open-source ecosystem. Kimi's K2.5 model pushed further on token efficiency: the ability to get the same reasoning quality with fewer thinking tokens. This matters enormously at scale. If a million users are running reasoning-heavy tasks, halving the average token consumption halves your inference bill.