Foundations: LLMs as the Agent Engine
The whole bootcamp builds agents. Before we wire up tools, memory, RAG (Retrieval-Augmented Generation), and multi-agent systems, you need a precise mental model of the one component at the center of all of it: the LLM (Large Language Model). Get this right and everything later (the ReAct — Reasoning and Acting — loop, context windows, prompt caching, injection defense) follows from first principles instead of feeling like trivia.
Learning objectives
Section titled “Learning objectives”By the end you can, cold, in an interview:
- Describe what an LLM actually is for engineering purposes: a stateless, autoregressive next-token predictor over tokens, bounded by a finite context window, steered by sampling (temperature/top-p) and by the system / user / assistant role structure.
- Explain why “multi-turn chat” is an illusion the client maintains by re-sending the full history every call — the model remembers nothing.
- Define an agent: an LLM in a loop, with tools and a goal, choosing its own next step from feedback — and articulate why a loop is the necessary shape.
- Place any system on the agent ⇄ workflow spectrum (Anthropic’s distinction) and run the “should I build an agent?” checklist (complexity, value, viability, cost-of-error) — including the compounding error math that argues against unnecessary autonomy.
Run alongside this lesson:
python 00_foundations/demo.py # offline, deterministicpython 00_foundations/demo.py --live # optional: real model via OpenRouterpytest 00_foundations/ -q # the worked code is tested1. What an LLM actually is (for agent purposes)
Section titled “1. What an LLM actually is (for agent purposes)”Strip away the marketing. A large language model is a function:
┌─────────────────────────────────────────────┐ tokens → │ f(context) -> probability over next token │ → one token └─────────────────────────────────────────────┘ (then append, repeat)Everything below is a consequence of that single fact.
1.1 Next-token predictor (autoregressive)
Section titled “1.1 Next-token predictor (autoregressive)”The model does not “answer questions.” It predicts a probability distribution over the next token, given everything so far. The runtime samples one token, appends it, and feeds the longer sequence back in — autoregression. A paragraph is just this loop run a few hundred times. “Reasoning,” “tool calls,” and “JSON” (JavaScript Object Notation) are all the same machinery: text that was statistically likely given the context and the training distribution.
Why this matters for agents: there is no separate “planning module.” When an agent reasons (ReAct’s
Thought:) or decides to call a tool, that is generated text, sampled the same way as any other token. This is exactly why prompt format, examples, and tool descriptions move the needle so hard — you are shaping a probability distribution, not configuring a program.
1.2 Tokens (not words, not characters)
Section titled “1.2 Tokens (not words, not characters)”Text is chopped into tokens (subword chunks ≈ 3–4 characters / ~0.75 words of English on average). Tokens are the unit of:
- the context window (the limit is in tokens, not characters),
- billing (you pay per input + output token), and
- latency (output tokens are generated one at a time — long outputs are slow).
Consequences engineers trip on: token budgets are approximate and model-specific; non-English and code tokenize less efficiently; “just dump the whole document in” has a real, metered cost; and you cannot reason about cost or truncation without thinking in tokens.
1.3 The context window = the only working memory
Section titled “1.3 The context window = the only working memory”The context window is the maximum number of tokens the model can attend to in a single call (mid-2026: roughly ~128K for GPT-4o-class models, ~200K for Claude, with 1M-token variants available). Inside that window, everything competes for the same finite budget: system instructions, the conversation history, tool definitions, retrieved documents, and tool outputs.
This is the single most load-bearing constraint in agent engineering:
[ system prompt | tool specs | history | retrieved docs | tool outputs ] └──────────────── one finite token budget per call ────────────────────┘Whole later modules exist because of this limit: memory (what to keep vs. externalize), RAG (fetch only the relevant slice), context engineering / compaction (summarize history before it overflows), prompt caching (don’t re-pay to process a static prefix). Attention cost also grows ~quadratically with length, so quality and speed degrade non-linearly as the window fills — more context is not free and not always better (“context rot”).
1.4 Stateless — the model remembers nothing between calls
Section titled “1.4 Stateless — the model remembers nothing between calls”Each API (Application Programming Interface) call is independent. The model holds no session, no memory of the
previous turn. Its entire knowledge of “this conversation” is the messages
list you send this call.
So how does ChatGPT “remember” your name? The client does: it stores the running transcript and re-sends the whole history on every call.
Turn 2 request you must send: [ {user: "My name is Ada."}, {assistant: "Nice to meet you, Ada."}, {user: "What is my name?"} ] ← turn-1 history re-sent, or it's gonedemo.py makes this concrete: drop the history and the model literally cannot
recall the name (it isn’t in the request); re-send it and it “remembers.” It was
never memory — it was context reconstruction.
Interview one-liner: “LLMs are stateless. ‘Memory’ is the client re-sending the full transcript every call; long-term memory is just deciding what to put back into that context and what to store externally.” This single sentence is the seed of the entire memory module.
1.5 Sampling: temperature and top-p
Section titled “1.5 Sampling: temperature and top-p”The model emits a distribution; sampling picks the actual token.
- temperature — flattens (high, e.g. 0.8–1.0: more random/creative/varied) or sharpens (low, e.g. 0: pick the most likely token, “greedy”) the distribution.
- top-p (nucleus) — sample only from the smallest set of tokens whose probabilities sum to p (e.g. 0.9), truncating the long tail.
Rules of thumb: low temperature for extraction / classification / tool-arg generation (you want consistency); higher for brainstorming / drafting.
Gotcha (great interview catch):
temperature=0is not truly deterministic. Floating-point non-associativity in parallel GPU (Graphics Processing Unit) attention can change which token wins a near-tie, so identical inputs can still diverge. For real determinism you constrain the format (structured/JSON mode) and, in tests, stub the model by input hash — you don’t rely ontemperature=0.
1.6 Roles: system / user / assistant
Section titled “1.6 Roles: system / user / assistant”Chat models take a list of messages, each with a role:
| Role | Who / what it is | Use it for |
|---|---|---|
system | Operator instructions, persona, rules, tool policy | Behavior, guardrails, “how to act” |
user | The human’s (or upstream caller’s) input | The task / question |
assistant | The model’s prior replies (and tool-call requests) | History; what the model “said” |
Roles are a soft priority signal the model was trained to honor — system
generally outranks user. Two consequences you must internalize early:
systemis not a security boundary. A determined user turn — or, worse, malicious text smuggled in via a tool result / retrieved document (indirect prompt injection) — can override it. Treat every role’s content as potentially adversarial (the security module goes deep here).- In agentkit,
systemis passed separately:llm.complete(messages, system="...", tools=[...]). Providers differ on the wire (Anthropic has a top-levelsystem; OpenAI uses asystem/developermessage) — agentkit normalizes this so your loop never cares.
1.7 Prompting, in one breath
Section titled “1.7 Prompting, in one breath”A prompt is just the assembled context (system + messages [+ tools]) you hand the predictor. Because output is sampled text conditioned on that context, the levers are: clear instructions, few-shot examples (show the output grammar you want), role structure, and putting the relevant material in context (retrieval). Prompting is programming the distribution — there is no other API surface to the model’s behavior.
2. From a model to an agent
Section titled “2. From a model to an agent”A single .complete(...) call does exactly one step: read the context,
produce the next chunk of text (possibly a request to call a tool). That’s it.
Most real goals need many steps whose number and shape aren’t known up
front — you can’t tell in advance how many web searches a research question
needs, or how many edits a bug fix takes.
So you wrap the one-step model in a loop:
agentic loop (the shape everything later refines):
context = [system, goal] for step in range(MAX_STEPS): # <-- the guard is non-negotiable response = llm.complete(context) # model decides the NEXT step if response.is_final: # model says "I'm done" return response.answer result = run(response.tool_call) # act on the world context += [response, result] # feed reality back in (observe) raise RanOutOfSteps # safety net if it never finishesDefinition (say it exactly like this): An agent is an LLM in a loop with tools and a goal, where the model — not hard-coded control flow — decides the next action from feedback, until it judges the goal met (or a guard stops it). Anthropic’s gloss: “agents are typically just LLMs using tools in a loop based on environmental feedback.”
Why a loop? (make this intuition land)
Section titled “Why a loop? (make this intuition land)”Three reasons, each worth stating out loud:
- One call = one step. The model is a single-step function. Multi-step work requires re-invoking it. The loop is the only way to get step 2.
- The path isn’t predetermined. If you knew the steps ahead of time you’d write a script (a workflow, §3). You reach for a loop precisely when the model must choose what to do next based on what it just learned.
- Grounding beats guessing. Each iteration feeds a real observation (a tool result) back into the context, so the next step builds on verified data instead of an earlier guess. (This is exactly why the ReAct loop reduces hallucination vs. pure chain-of-thought — a closed reasoning loop propagates a fabricated fact; an open loop replaces speculation with a real result after each step. We prove this in the ReAct module.)
The litmus test you can recite: “If you find yourself writing
while not done: response = call_llm(...)— where the model decidesdoneand what to do next — you have an agent. If the control flow is fixed and you decide the order, you have a workflow.”
demo.py Part 4 shows the bare skeleton: a scripted “model” takes two steps and
then emits DONE — the model decided when to stop, not the code — and a
max_steps guard catches a model that never stops. (Removing that guard is the
real-world incident where an agent looped for 11 days and burned ~$47K; always
bound the loop.)
3. Agent vs. workflow, and “should I build an agent?”
Section titled “3. Agent vs. workflow, and “should I build an agent?””Anthropic (“Building Effective Agents”) draws the load-bearing line:
- Workflow = LLMs and tools orchestrated through predefined code paths. The programmer decides the order. Deterministic, debuggable, cheap.
- Agent = the LLM dynamically directs its own process and tool use. The model decides the order. Flexible, open-ended, harder to bound.
It’s a spectrum, and both are “agentic systems.” The single differentiator is autonomy: who decides what’s next — your code, or the model?
less autonomy ─────────────────────────────────────────────► more autonomy single call → prompt chain → routing → orchestrator-workers → autonomous agent (workflows: you wire the path) | (agent: the model wires the path)A practical ladder (climb only as far as the task forces you):
- One optimized call (+ retrieval, + few-shot) — try this first, always.
- A workflow — a predefined multi-step pipeline (chain / route / parallel / orchestrator-workers / evaluator-optimizer) when steps are knowable.
- An agent — only when the steps cannot be predetermined and the model genuinely needs to decide its own path at runtime.
The “should I build an agent?” checklist
Section titled “The “should I build an agent?” checklist”Don’t reach for an agent by default. Run four questions:
| Question | Build an agent when… | Prefer a workflow / single call when… |
|---|---|---|
| Complexity | Steps/paths can’t be enumerated ahead of time; needs runtime decisions | The flow is fixed and predictable |
| Value | The task is valuable enough to justify higher latency + token cost | Cheap/simple; agent overhead isn’t worth it |
| Viability | The model can actually do it reliably and you can give it good, low-overlap tools (a clean ACI — Agent-Computer Interface) | Capability or tooling is shaky → it’ll thrash |
| Cost of error | Errors are cheap, reversible, and discoverable (or gated by a human) | Mistakes are costly/irreversible → bound it, add approval gates |
Interview angle — “workflow vs. agent, when?” Frame the answer around autonomy and four tradeoffs: predictability of the steps, trust in the model’s judgment for the task, the cost/latency budget, and how hard errors are to recover from. Workflows win on reliability, cost, and debuggability; agents win on open-endedness. “Start at the lowest complexity that works and only add autonomy when the steps genuinely can’t be predetermined.” Strong candidates volunteer that agents are not automatically better — they’re a cost and a risk you take on deliberately.
Compounding error — the math that argues against over-engineering
Section titled “Compounding error — the math that argues against over-engineering”Multi-step autonomy multiplies per-step reliability:
90% correct per step, 5 sequential steps: 0.9 ^ 5 ≈ 0.59 (~59% end-to-end) 95% per step, 10 steps: 0.95 ^ 10 ≈ 0.60Every step you let the model take is a place it can go wrong, and errors compound. That’s the quantitative core of “don’t add autonomy you don’t need,” and it’s why workflows (fewer model-decided branches) are more reliable. It also motivates the rest of the bootcamp: guardrails, verifiers, deterministic checks before model-judged ones, and human-in-the-loop on irreversible actions.
Common pitfalls / gotchas
Section titled “Common pitfalls / gotchas”- Assuming the model has memory. It doesn’t. Forgetting to re-send history is the #1 “why did it forget?” bug. The client owns state.
- Thinking
temperature=0is deterministic. It isn’t (FP — Floating-Point — non-associativity). Constrain format; stub by input hash in tests. - Treating
systemas a trust boundary. It’s a soft priority, not security. User turns and (especially) tool/retrieval content can override it → prompt injection. - Confusing tokens with words/characters. Budgets, billing, and truncation are all in tokens; non-English/code tokenize less efficiently.
- “More context is always better.” Finite budget + quadratic attention + lost-in-the-middle → relevant context beats more context.
- Reaching for an agent by default. Most tasks are a single call or a workflow. Autonomy multiplies error (0.9^5 ≈ 59%) and cost.
- A loop with no
max_stepsguard. Verifier stalls / non-terminating models become runaway cost. Always bound the loop. - Conflating “agentic” with “fully autonomous.” It’s a spectrum; most production “agents” are bounded workflows with a little model-directed flex.
Key takeaways
Section titled “Key takeaways”- An LLM is a stateless, autoregressive next-token predictor over tokens, bounded by a finite context window, steered by sampling and roles.
- Context is the only memory. Multi-turn chat = the client re-sending the full history every call. This seeds memory, RAG, caching, and context engineering.
- An agent = an LLM in a loop with tools + a goal, where the model chooses the next step from feedback. The loop exists because one call is one step and the path isn’t predetermined; feeding real observations back is what grounds it.
- Agent vs. workflow is a spectrum defined by who decides what’s next. Use the complexity / value / viability / cost-of-error checklist, respect compounding error (0.9^5 ≈ 59%), and climb the complexity ladder only as far as the task forces you.
Next module builds the loop for real: the ReAct Thought → Action → Observation cycle, parsed from the model’s own output.