The Token Is Dead, Long Live The Vector: Why LLMs Might Ditch Discrete Text Forever

How Tencent Is Rethinking Tokens.

Jan 24, 2026

∙ Paid

LLMs predict one token at a time. Always have. It’s the foundation everything’s built on.

A team from Tencent just proposed we scrap the whole thing.

Their paper introduces CALM, Continuous Autoregressive Language Models, and it’s not an incremental improvement. It’s a different way of thinking about what language models do.

Instead of predicting the next token, predict the next vector. Compress 4 tokens into one continuous representation, generate that, decompress back to text. Rinse, repeat.

Sounds simple. The implications are anything but.

Why Tokens Are Holding Us Back

Every token in a modern LLM carries 15-18 bits of information. That’s it.

A 32K vocabulary token? log₂(32768) = 15 bits. A 256K vocabulary? 18 bits.

Want to increase that? You’d need to grow the vocabulary exponentially. And the softmax layer that computes probabilities over that vocabulary becomes the bottleneck. You can’t just keep scaling vocabulary size to pack more information per prediction.

Meanwhile, models have scaled to hundreds of billions of parameters. Massive representational power, all focused on laboriously predicting these tiny 15-bit units, one at a time.

The mismatch is obvious once you see it. We’re deploying nuclear reactors to power lightbulbs.

The historical path makes sense, characters were too long, subwords were the compromise. But that compromise has limits baked in.

CALM’s bet: continuous vectors can scale information density in ways discrete tokens never could.

How It Actually Works

Continue reading this post for free, courtesy of State of AI.

Or purchase a paid subscription.