Foundations: LLMs as the Agent Engine

The whole bootcamp builds agents. Before we wire up tools, memory, RAG (Retrieval-Augmented Generation), and multi-agent systems, you need a precise mental model of the one component at the center of all of it: the LLM (Large Language Model). Get this right and everything later (the ReAct — Reasoning and Acting — loop, context windows, prompt caching, injection defense) follows from first principles instead of feeling like trivia.

Learning objectives

By the end you can, cold, in an interview:

Describe what an LLM actually is for engineering purposes: a stateless, autoregressive next-token predictor over tokens, bounded by a finite context window, steered by sampling (temperature/top-p) and by the system / user / assistant role structure.
Explain why “multi-turn chat” is an illusion the client maintains by re-sending the full history every call — the model remembers nothing.
Define an agent: an LLM in a loop, with tools and a goal, choosing its own next step from feedback — and articulate why a loop is the necessary shape.
Place any system on the agent ⇄ workflow spectrum (Anthropic’s distinction) and run the “should I build an agent?” checklist (complexity, value, viability, cost-of-error) — including the compounding error math that argues against unnecessary autonomy.

Run alongside this lesson:

python 00_foundations/demo.py          # offline, deterministic
python 00_foundations/demo.py --live   # optional: real model via OpenRouter
pytest 00_foundations/ -q              # the worked code is tested

1. What an LLM actually is (for agent purposes)

Strip away the marketing. A large language model is a function:

            ┌─────────────────────────────────────────────┐
  tokens →  │  f(context) -> probability over next token  │  → one token
            └─────────────────────────────────────────────┘
                         (then append, repeat)

Everything below is a consequence of that single fact.

1.1 Next-token predictor (autoregressive)

The model does not “answer questions.” It predicts a probability distribution over the next token, given everything so far. The runtime samples one token, appends it, and feeds the longer sequence back in — autoregression. A paragraph is just this loop run a few hundred times. “Reasoning,” “tool calls,” and “JSON” (JavaScript Object Notation) are all the same machinery: text that was statistically likely given the context and the training distribution.

Why this matters for agents: there is no separate “planning module.” When an agent reasons (ReAct’s Thought:) or decides to call a tool, that is generated text, sampled the same way as any other token. This is exactly why prompt format, examples, and tool descriptions move the needle so hard — you are shaping a probability distribution, not configuring a program.

1.2 Tokens (not words, not characters)

Text is chopped into tokens (subword chunks ≈ 3–4 characters / ~0.75 words of English on average). Tokens are the unit of:

the context window (the limit is in tokens, not characters),
billing (you pay per input + output token), and
latency (output tokens are generated one at a time — long outputs are slow).

Consequences engineers trip on: token budgets are approximate and model-specific; non-English and code tokenize less efficiently; “just dump the whole document in” has a real, metered cost; and you cannot reason about cost or truncation without thinking in tokens.

1.3 The context window = the only working memory

The context window is the maximum number of tokens the model can attend to in a single call (mid-2026: roughly ~128K for GPT-4o-class models, ~200K for Claude, with 1M-token variants available). Inside that window, everything competes for the same finite budget: system instructions, the conversation history, tool definitions, retrieved documents, and tool outputs.

This is the single most load-bearing constraint in agent engineering:

   [ system prompt | tool specs | history | retrieved docs | tool outputs ]
   └──────────────── one finite token budget per call ────────────────────┘

Whole later modules exist because of this limit: memory (what to keep vs. externalize), RAG (fetch only the relevant slice), context engineering / compaction (summarize history before it overflows), prompt caching (don’t re-pay to process a static prefix). Attention cost also grows ~quadratically with length, so quality and speed degrade non-linearly as the window fills — more context is not free and not always better (“context rot”).

1.4 Stateless — the model remembers nothing between calls

Each API (Application Programming Interface) call is independent. The model holds no session, no memory of the previous turn. Its entire knowledge of “this conversation” is the messages list you send this call.

So how does ChatGPT “remember” your name? The client does: it stores the running transcript and re-sends the whole history on every call.

Turn 2 request you must send:
  [ {user: "My name is Ada."},
    {assistant: "Nice to meet you, Ada."},
    {user: "What is my name?"} ]      ← turn-1 history re-sent, or it's gone

demo.py makes this concrete: drop the history and the model literally cannot recall the name (it isn’t in the request); re-send it and it “remembers.” It was never memory — it was context reconstruction.

Interview one-liner: “LLMs are stateless. ‘Memory’ is the client re-sending the full transcript every call; long-term memory is just deciding what to put back into that context and what to store externally.” This single sentence is the seed of the entire memory module.

1.5 Sampling: temperature and top-p

The model emits a distribution; sampling picks the actual token.

temperature — flattens (high, e.g. 0.8–1.0: more random/creative/varied) or sharpens (low, e.g. 0: pick the most likely token, “greedy”) the distribution.
top-p (nucleus) — sample only from the smallest set of tokens whose probabilities sum to p (e.g. 0.9), truncating the long tail.

Rules of thumb: low temperature for extraction / classification / tool-arg generation (you want consistency); higher for brainstorming / drafting.

Gotcha (great interview catch): temperature=0 is not truly deterministic. Floating-point non-associativity in parallel GPU (Graphics Processing Unit) attention can change which token wins a near-tie, so identical inputs can still diverge. For real determinism you constrain the format (structured/JSON mode) and, in tests, stub the model by input hash — you don’t rely on temperature=0.

1.6 Roles: system / user / assistant

Chat models take a list of messages, each with a role:

Role	Who / what it is	Use it for
`system`	Operator instructions, persona, rules, tool policy	Behavior, guardrails, “how to act”
`user`	The human’s (or upstream caller’s) input	The task / question
`assistant`	The model’s prior replies (and tool-call requests)	History; what the model “said”

Roles are a soft priority signal the model was trained to honor — system generally outranks user. Two consequences you must internalize early:

system is not a security boundary. A determined user turn — or, worse, malicious text smuggled in via a tool result / retrieved document (indirect prompt injection) — can override it. Treat every role’s content as potentially adversarial (the security module goes deep here).
In agentkit, system is passed separately: llm.complete(messages, system="...", tools=[...]). Providers differ on the wire (Anthropic has a top-level system; OpenAI uses a system/developer message) — agentkit normalizes this so your loop never cares.

1.7 Prompting, in one breath

A prompt is just the assembled context (system + messages [+ tools]) you hand the predictor. Because output is sampled text conditioned on that context, the levers are: clear instructions, few-shot examples (show the output grammar you want), role structure, and putting the relevant material in context (retrieval). Prompting is programming the distribution — there is no other API surface to the model’s behavior.

2. From a model to an agent

A single .complete(...) call does exactly one step: read the context, produce the next chunk of text (possibly a request to call a tool). That’s it. Most real goals need many steps whose number and shape aren’t known up front — you can’t tell in advance how many web searches a research question needs, or how many edits a bug fix takes.

So you wrap the one-step model in a loop:

agentic loop (the shape everything later refines):

  context = [system, goal]
  for step in range(MAX_STEPS):          # <-- the guard is non-negotiable
      response = llm.complete(context)   # model decides the NEXT step
      if response.is_final:              # model says "I'm done"
          return response.answer
      result = run(response.tool_call)   # act on the world
      context += [response, result]      # feed reality back in (observe)
  raise RanOutOfSteps                    # safety net if it never finishes

Definition (say it exactly like this): An agent is an LLM in a loop with tools and a goal, where the model — not hard-coded control flow — decides the next action from feedback, until it judges the goal met (or a guard stops it). Anthropic’s gloss: “agents are typically just LLMs using tools in a loop based on environmental feedback.”

Why a loop? (make this intuition land)

Three reasons, each worth stating out loud:

One call = one step. The model is a single-step function. Multi-step work requires re-invoking it. The loop is the only way to get step 2.
The path isn’t predetermined. If you knew the steps ahead of time you’d write a script (a workflow, §3). You reach for a loop precisely when the model must choose what to do next based on what it just learned.
Grounding beats guessing. Each iteration feeds a real observation (a tool result) back into the context, so the next step builds on verified data instead of an earlier guess. (This is exactly why the ReAct loop reduces hallucination vs. pure chain-of-thought — a closed reasoning loop propagates a fabricated fact; an open loop replaces speculation with a real result after each step. We prove this in the ReAct module.)

The litmus test you can recite: “If you find yourself writing while not done: response = call_llm(...) — where the model decides done and what to do next — you have an agent. If the control flow is fixed and you decide the order, you have a workflow.”

demo.py Part 4 shows the bare skeleton: a scripted “model” takes two steps and then emits DONE — the model decided when to stop, not the code — and a max_steps guard catches a model that never stops. (Removing that guard is the real-world incident where an agent looped for 11 days and burned ~$47K; always bound the loop.)

3. Agent vs. workflow, and “should I build an agent?”

Anthropic (“Building Effective Agents”) draws the load-bearing line:

Workflow = LLMs and tools orchestrated through predefined code paths. The programmer decides the order. Deterministic, debuggable, cheap.
Agent = the LLM dynamically directs its own process and tool use. The model decides the order. Flexible, open-ended, harder to bound.

It’s a spectrum, and both are “agentic systems.” The single differentiator is autonomy: who decides what’s next — your code, or the model?

 less autonomy ─────────────────────────────────────────────► more autonomy
 single call → prompt chain → routing → orchestrator-workers → autonomous agent
 (workflows: you wire the path)        |        (agent: the model wires the path)

A practical ladder (climb only as far as the task forces you):

One optimized call (+ retrieval, + few-shot) — try this first, always.
A workflow — a predefined multi-step pipeline (chain / route / parallel / orchestrator-workers / evaluator-optimizer) when steps are knowable.
An agent — only when the steps cannot be predetermined and the model genuinely needs to decide its own path at runtime.

The “should I build an agent?” checklist

Don’t reach for an agent by default. Run four questions:

Question	Build an agent when…	Prefer a workflow / single call when…
Complexity	Steps/paths can’t be enumerated ahead of time; needs runtime decisions	The flow is fixed and predictable
Value	The task is valuable enough to justify higher latency + token cost	Cheap/simple; agent overhead isn’t worth it
Viability	The model can actually do it reliably and you can give it good, low-overlap tools (a clean ACI — Agent-Computer Interface)	Capability or tooling is shaky → it’ll thrash
Cost of error	Errors are cheap, reversible, and discoverable (or gated by a human)	Mistakes are costly/irreversible → bound it, add approval gates

Interview angle — “workflow vs. agent, when?” Frame the answer around autonomy and four tradeoffs: predictability of the steps, trust in the model’s judgment for the task, the cost/latency budget, and how hard errors are to recover from. Workflows win on reliability, cost, and debuggability; agents win on open-endedness. “Start at the lowest complexity that works and only add autonomy when the steps genuinely can’t be predetermined.” Strong candidates volunteer that agents are not automatically better — they’re a cost and a risk you take on deliberately.

Compounding error — the math that argues against over-engineering

Multi-step autonomy multiplies per-step reliability:

  90% correct per step, 5 sequential steps:  0.9 ^ 5 ≈ 0.59   (~59% end-to-end)
  95% per step, 10 steps:                    0.95 ^ 10 ≈ 0.60

Every step you let the model take is a place it can go wrong, and errors compound. That’s the quantitative core of “don’t add autonomy you don’t need,” and it’s why workflows (fewer model-decided branches) are more reliable. It also motivates the rest of the bootcamp: guardrails, verifiers, deterministic checks before model-judged ones, and human-in-the-loop on irreversible actions.

Common pitfalls / gotchas

Assuming the model has memory. It doesn’t. Forgetting to re-send history is the #1 “why did it forget?” bug. The client owns state.
Thinking temperature=0 is deterministic. It isn’t (FP — Floating-Point — non-associativity). Constrain format; stub by input hash in tests.
Treating system as a trust boundary. It’s a soft priority, not security. User turns and (especially) tool/retrieval content can override it → prompt injection.
Confusing tokens with words/characters. Budgets, billing, and truncation are all in tokens; non-English/code tokenize less efficiently.
“More context is always better.” Finite budget + quadratic attention + lost-in-the-middle → relevant context beats more context.
Reaching for an agent by default. Most tasks are a single call or a workflow. Autonomy multiplies error (0.9^5 ≈ 59%) and cost.
A loop with no max_steps guard. Verifier stalls / non-terminating models become runaway cost. Always bound the loop.
Conflating “agentic” with “fully autonomous.” It’s a spectrum; most production “agents” are bounded workflows with a little model-directed flex.

Key takeaways

An LLM is a stateless, autoregressive next-token predictor over tokens, bounded by a finite context window, steered by sampling and roles.
Context is the only memory. Multi-turn chat = the client re-sending the full history every call. This seeds memory, RAG, caching, and context engineering.
An agent = an LLM in a loop with tools + a goal, where the model chooses the next step from feedback. The loop exists because one call is one step and the path isn’t predetermined; feeding real observations back is what grounds it.
Agent vs. workflow is a spectrum defined by who decides what’s next. Use the complexity / value / viability / cost-of-error checklist, respect compounding error (0.9^5 ≈ 59%), and climb the complexity ladder only as far as the task forces you.

Next module builds the loop for real: the ReAct Thought → Action → Observation cycle, parsed from the model’s own output.