Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

How LLMs Actually Work — for Prompt Writers (2026)

Just enough of how large language models really work — tokens, context windows, sampling, training vs inference, and hallucinations — to make you measurably better at writing prompts.

By The DDH Team at Digital Dashboard HubUpdated

A large language model is a next-token predictor: given the text so far, it produces a probability distribution over the next token and samples one, repeatedly, until it stops. Everything that feels like understanding — reasoning, style, refusal, hallucination — falls out of that one loop plus how the model was trained. You don't need the math to prompt well, but you do need the mechanics, because each one has a direct, practical implication for how you write prompts.

This guide explains tokens, context windows, sampling controls (temperature and top_p), the difference between training and inference, and why models hallucinate — and after each, what it means for your prompts. A useful anchor to start: 1 token ≈ 4 characters ≈ 0.75 words in English (per OpenAI and Anthropic tokenization docs). To put any of this into practice, our ChatGPT Prompt Generator and Code Prompt Builder bake the implications in.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

LLM mechanics and what each means for your prompts

Feature
What it is
Prompting implication
TokenSub-word unit; ~4 chars ≈ 0.75 wordsBudget in tokens; keep context lean
Context windowMax tokens considered at onceKey instructions first/last; retrieve relevant chunks
Prediction loopRepeated next-token samplingReason before answering; fix issues upstream
TemperatureRandomness of token choiceLow for factual, higher for creative
Top_pNucleus: smallest set summing to pAdjust one dial, not both
Training vs inferenceFrozen weights at call timeSupply current facts; few-shot is temporary
HallucinationConfident, unsupported outputGround in sources; require 'I don't know'

Mechanics summarized from foundational research and provider docs: [Wei et al. 2022 (CoT)](https://arxiv.org/abs/2201.11903), [Brown et al. 2020 (few-shot)](https://arxiv.org/abs/2005.14165), [Yao et al. 2023 (ReAct)](https://arxiv.org/abs/2210.03629), and sampling docs in the [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat). Token rule of thumb per OpenAI/Anthropic tokenization docs. Current as of June 2026.

What's in this guide

Each section explains one mechanism, then the prompting takeaway. Sections:

1. Tokens — the unit the model actually reads.

2. Context windows — the model's working memory.

3. The prediction loop — why models are next-token predictors.

4. Sampling: temperature and top_p — the randomness dials.

5. Training vs inference — what the model knows and when.

6. Why hallucinations happen — and how prompting reduces them.

7. What all of this means for writing prompts (the summary).

8. Sources & further reading.


Tokens: the unit the model actually reads

Models don't see words or characters — they see tokens, sub-word chunks produced by a tokenizer. Common words are often one token; rare words, long words, and unusual strings split into several. As a rule of thumb, 1 token ≈ 4 characters ≈ 0.75 words in English (per OpenAI and Anthropic docs). So ~1,000 tokens is about 750 words, and a 10-page document is roughly 5,000-6,000 tokens.

Why prompt writers care: (1) cost and limits are measured in tokens, not words — see our Cost Per Token Across All Major AI Models for the pricing side. (2) Tokenization is language- and content-dependent: non-English text, code, and unusual formatting can cost far more tokens per 'word' than plain English. (3) The model's sense of structure is token-level, which is why consistent formatting and clear delimiters help — you're shaping the token stream the model predicts from.

Practical takeaway: budget prompts in tokens, not words; keep context lean because every token is read (and paid for) on every call; and don't be surprised when a short snippet of dense code or a non-English passage uses more tokens than its length suggests.


Context windows: the model's working memory

The context window is the maximum number of tokens the model can consider at once — your prompt, the conversation history, any attached documents, and the output it's generating all share that budget. In 2026, windows are large: Anthropic includes a 1M-token context window at standard pricing on its Opus 4.6+, Sonnet 4.6, and Fable 5 models, for example.

Two facts matter for prompting. First, everything outside the window doesn't exist to the model — in a long conversation, early turns can fall out of context, and the model genuinely cannot 'remember' them. Second, even within the window, position matters: models tend to attend most reliably to the beginning and end of the context, so burying a critical instruction in the middle of a huge prompt is risky.

Practical takeaways: put your most important instructions at the start (and optionally restate the key constraint at the end); for long documents, retrieve and include only the relevant chunks rather than pasting everything; and in long chats, periodically restate critical context because old turns may have scrolled out of the window. A bigger window is a capacity, not a reason to fill it — lean context generally produces sharper, cheaper output.


The prediction loop: why models are next-token predictors

At inference, the model repeats one step: read all tokens so far, compute a probability distribution over the next token, pick one, append it, repeat — until it emits a stop token or hits a length limit. There is no separate 'planning' phase; the apparent reasoning is the model generating tokens that, statistically, tend to follow good reasoning in its training data.

This explains several behaviors. Chain-of-thought works because writing the reasoning steps as tokens conditions the later answer tokens on that reasoning — the model literally does better when it 'thinks out loud,' as shown in Wei et al., 2022 (arXiv:2201.11903). It also explains why models can paint themselves into a corner: an early wrong token shifts the probabilities for everything after it.

Practical takeaways: ask for reasoning before the answer on hard tasks (the order matters — reasoning must come first to condition the answer); and when output goes off the rails, the fix is often earlier in the prompt, because everything downstream is conditioned on what came before. For agent loops that interleave reasoning with actions, see ReAct (Yao et al., 2023, arXiv:2210.03629).


Sampling: temperature and top_p

The model outputs a probability distribution over the next token, but how it picks from that distribution is controlled by sampling parameters — chiefly temperature and top_p (documented in the OpenAI API reference).

Temperature scales the distribution's sharpness. Low temperature (near 0) makes the model pick high-probability tokens, producing more deterministic, focused, repeatable output. High temperature flattens the distribution, making lower-probability tokens more likely — more varied, creative, and unpredictable output. Top_p (nucleus sampling) instead restricts choices to the smallest set of tokens whose probabilities sum to p; a low top_p keeps only the most likely options.

Practical takeaways: for factual extraction, classification, structured output, and anything you need to be repeatable, use a low temperature (often 0 or near it). For brainstorming, creative copy, and varied alternatives, raise it. General guidance is to adjust one of temperature or top_p, not both at once. Note that low temperature reduces variability — it does not make the model correct, and it does not stop hallucination. If a prompt only works at temperature 0, the prompt is fragile; fix the prompt, don't just pin the dial.


Training vs inference: what the model knows and when

There are two distinct phases. Training is when the model learns its weights from large text corpora (pretraining) and is then aligned to be helpful and safe (fine-tuning / RLHF). Inference is when you call the model: the weights are frozen, and the model uses only those fixed weights plus whatever is in your prompt's context window. Your prompt does not teach the model anything permanent.

This distinction resolves a lot of confusion. The model's 'knowledge' is whatever was in its training data up to its cutoff — it has no live awareness of events after that, and it cannot look anything up unless you give it tools or retrieved context. In-context learning (few-shot examples) is not training; it's the model conditioning on examples within the prompt, as described in Brown et al., 2020 (arXiv:2005.14165). The effect vanishes when the context ends.

Practical takeaways: never assume the model knows current facts — supply them in context or via retrieval/tools. Treat few-shot examples as temporary instructions, not permanent learning. And when you need authoritative, up-to-date information, ground the model in sources you provide rather than trusting recalled facts (the next section explains why).


Why hallucinations happen

A hallucination is fluent, confident output that is factually wrong or unsupported. It's a direct consequence of the prediction loop: the model is optimized to produce plausible-sounding next tokens, and plausibility is not the same as truth. When the model lacks the relevant fact, it doesn't know it lacks it — it generates the most probable-looking continuation, which can be a confident fabrication.

Contributing factors: the fact wasn't in training data (or was rare/contradictory); the question is outside the model's knowledge cutoff; the prompt invites speculation without permitting 'I don't know'; or sampling at high temperature surfaces a low-probability, wrong token. Crucially, the model has no built-in signal that distinguishes 'I'm recalling a fact' from 'I'm generating a plausible guess' — both come out equally fluent.

Prompting reduces hallucination but cannot fully eliminate it. The high-leverage moves: (1) ground the model in supplied context and instruct it to use only that context; (2) explicitly permit and require 'not specified / I don't know' rather than guessing; (3) lower temperature for factual tasks; and (4) for anything high-stakes, keep a human in the loop and cite real sources. Retrieval-grounded prompts with a strict uncertainty rule are the single most effective pattern — see the negative-constraint pattern in our 12 Prompt Patterns That Convert.


What all of this means for writing prompts

Pulling the mechanics together into prompting rules:

**Tokens →** budget in tokens; keep context lean; expect code and non-English to cost more per word.

**Context window →** put key instructions at the start, restate at the end, retrieve only relevant chunks, and refresh context in long chats.

**Prediction loop →** ask for reasoning before the answer on hard tasks; fix problems upstream in the prompt, since everything downstream is conditioned on it.

**Sampling →** low temperature for factual/repeatable work, higher for creative; adjust one dial, not both; don't mistake temperature 0 for correctness.

**Training vs inference →** supply current facts in context; treat few-shot as temporary; never assume live knowledge.

**Hallucination →** ground in sources, require 'I don't know,' lower temperature, and keep humans in the loop for high-stakes output.

These rules are why the techniques in our Complete Guide to Prompt Engineering work the way they do. Understanding the mechanism turns prompting from trial-and-error into something you can reason about. Start applying it with the ChatGPT Prompt Generator or Code Prompt Builder.


Sources & further reading

References for the mechanics above (as of June 2026):

Chain-of-Thought / why reasoning-first helps (Wei et al., 2022): https://arxiv.org/abs/2201.11903

In-context / few-shot learning (Brown et al., 2020): https://arxiv.org/abs/2005.14165

ReAct, reasoning interleaved with actions (Yao et al., 2023): https://arxiv.org/abs/2210.03629 ; Tree of Thoughts (Yao et al., 2023): https://arxiv.org/abs/2305.10601

Sampling parameters (temperature, top_p) — OpenAI API reference: https://platform.openai.com/docs/api-reference/chat ; provider prompting guidance: https://platform.openai.com/docs/guides/prompt-engineering , https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/overview , https://ai.google.dev/gemini-api/docs/prompting-strategies

Token economics (context budgeting): see our Cost Per Token guide and the live provider pricing pages it links.

Token rule of thumb (1 token ≈ 4 characters ≈ 0.75 words): per OpenAI and Anthropic tokenization documentation.

Frequently Asked Questions

What is a token in an LLM?

A token is the sub-word chunk the model actually reads — common words are often one token, while rare or long words split into several. The rule of thumb is 1 token ≈ 4 characters ≈ 0.75 words in English (per OpenAI and Anthropic docs), so ~1,000 tokens is about 750 words. Cost and context limits are measured in tokens, not words, and dense code or non-English text uses more tokens per word than plain English.

What is a context window and why does it matter for prompts?

The context window is the maximum number of tokens the model can consider at once — your prompt, history, attached documents, and the generated output all share it. Anything outside it effectively doesn't exist to the model. In practice: put key instructions at the start (models attend most reliably to the beginning and end), retrieve only relevant chunks of long documents, and restate critical context in long conversations because early turns can scroll out of the window.

What does temperature do, and should I set it to 0?

Temperature controls how randomly the model picks the next token. Low temperature (near 0) gives focused, repeatable output; high temperature gives varied, creative output. Use low for factual extraction, classification, and structured output; raise it for brainstorming. But temperature 0 makes output deterministic, not correct — it does not stop hallucination. If a prompt only works at 0, the prompt is fragile and should be fixed. See the OpenAI API reference.

Why do LLMs hallucinate?

Because they're optimized to produce plausible next tokens, and plausibility isn't truth. When a model lacks a fact, it doesn't know it lacks it — it generates the most probable-looking continuation, which can be a confident fabrication, with no internal signal separating recall from guessing. Prompting reduces this: ground the model in supplied context, require it to say 'not specified' rather than guess, lower temperature for factual tasks, and keep a human in the loop for high-stakes output.

Does my prompt teach the model anything permanently?

No. Training (learning weights) and inference (calling the model) are separate phases. At inference the weights are frozen, and the model uses only those plus what's in your context window. Few-shot examples are in-context learning — temporary conditioning that vanishes when the context ends, per Brown et al. 2020 — not permanent learning. The model also has no live knowledge past its training cutoff unless you supply current facts via context or tools.

Why does asking the model to 'think step by step' improve answers?

Because the model is a next-token predictor: the tokens it writes condition the tokens that follow. When it writes out reasoning first, the final answer is conditioned on that reasoning, which measurably improves accuracy on multi-step problems — the chain-of-thought effect from Wei et al. 2022. The order matters: reasoning must come before the answer to have any effect. Modern reasoning-tuned models often do this internally, so it helps less on top-tier models.

Turn the mechanics into better prompts.

The free ChatGPT Prompt Generator and Code Prompt Builder apply context, format, and grounding best practices for you — no signup, part of 40+ free prompt tools.

Browse all prompt tools →