Accepted to NeurIPS 2025!

Many people view large language models (LLMs) as next token prediction machines.

→

...

That's actually a misconception.

When an LLM processes text, it doesn't output a token. It actually outputs a probability distribution, like the one you see above.

LLMs are actually deterministic!

If we run the model with the same input text, this resulting probability distribution will remain the same.

But when you use ChatGPT, you never see this probability distribution!

You only see the model continually appending tokens to its response. So how do LLMs pick a single token from this distribution?

The easiest way: pick the token with the highest probability!

This is called greedy decoding.

Greedy decoding makes next token selection deterministic.

But this is likely in stark contrast to your experience with tools like ChatGPT: the same prompt definitely doesn't lead to the same output. So why isn't ChatGPT using greedy decoding?

In practice, greedy decoding is hardly ever used.

Empirically, it doesn't lead to great performance from language models, as it tends to produce boring and repetitive outputs.

So, what's the alternative?

Sampling!

Instead of always picking the token with the highest probability, we intentionally introduce stochasticity into the process of selecting a token!

Think of sampling like drawing a colored marble from a bag with a variety of colored marbles.

Each color represents a different token, and the number of marbles of each color corresponds to that token's probability. We randomly draw one marble (token) from the bag based on these proportions.

But sampling isn't without issues!

Occasionally, the very nature of sampling can work to our detriment: we might end up picking a token that was assigned a very low probability. In the worst of cases, this can lead to incoherent or nonsensical text generation.

We need to make some adjustments to the sampling process.

Introducing... sampling parameters!

By introducing parameters that modify the underlying probability distribution before sampling actually takes place, we can ensure that the model generates more relevant and coherent text.

Temperature

Let's introduce our first sampling parameter: temperature. Temperature controls the sharpness of the overall probability distribution. A lower temperature value accentuates the differences between token probabilities, while a higher value make the distribution more uniform.

But this doesn't completely solve our problem.

Low probability tokens can still be occasionally sampled; it's just that now, they're sampled at a far smaller frequency.

Why not restrict our sampling to the top K tokens?

If we do this (and subsequently renormalize token probabilities), we can ensure that only the most likely candidates are considered. This is yet another sampling parameter, one that's aptly named top-k.

But can you see where top-k might also lead to suboptimal results?

Top-k doesn't pay attention to the underlying probability distribution.

This means that using too small of a top-k value can be overly restrictive. In the graphic above, you can see that tokens D and E are excluded from sampling, even though the model assigns relatively high probabilities to them.

In many cases, we actually want to only sample from tokens that make up the top X% of the distribution.

Enter top-p!

Top-p sampling (also called nucleus sampling) selects from the smallest possible set of tokens whose cumulative probability exceeds the threshold p. This maintains diversity while avoiding unlikely tokens.

Now, it's your turn to explore!

On the next slide, you can generate and sample tokens using the GPT-2 small model.

Explore sampling with GPT-2!

Modify the above input text to your liking, and click View Distribution to visualize the next token probability distribution.

The Art of Picking the Next Token

How do large language models select the next token in a sequence? This interactive "scrollytelling" experience teaches you about the wonderful world of sampling!