In partnership with

Intro

AI architecture conversations are moving fast, and the vocabulary shifts every quarter. Senior engineers drop terms like context window, eval set, and agent loop the way they used to drop eventual consistency and sharding. Most of the engineering leaders I talk to want the same thing: a tighter, working mental model of what an LLM actually is. Not a course or a paper. Enough to understand a design review with confidence and ask the right questions.

This kicks off Engineering Leader's Foundations, a recurring series on the AI and CS concepts behind the decisions your team is already making. We start with the foundation: five things about LLMs, designed to fit in your head and pay off the next time a real decision is on the table.

Why a Mental Model Helps Here

LLMs are unusually easy to use without forcing comprehension. You can prompt one productively for months and never need to know what's happening underneath. That's actually a credit to the tools, and it's a different curve than Postgres or Kubernetes, both of which teach you their internals by punishing you when you skip them. LLMs are friendlier teachers, which is great for adoption and quietly means the underlying model is something you have to go get on purpose.

The payoff for going to get it is real. Architecture choices, cost estimates, reliability claims: these all rest on properties of the model that don't reveal themselves through prompting. A small, correct mental model lets you read those conversations the way you read a system design doc: knowing where to slow down, where to ask, and where to trust.

The good news is that the model is small. You don't need to train one. You don't need transformer math. These five concepts will carry you a long way.

The Five-Thing Mental Model

1. An LLM is a stateless function. Every call starts from zero. The model has no memory of the previous request, the previous user, or the previous turn of a conversation. When it appears to "remember," that's because something in your stack re-sent the relevant history as part of the next request. Memory is engineering, not magic. So when your team talks about "building memory," what they're really designing is a system that decides what to re-send.

2. The unit of everything is the token. Tokens are roughly word-fragments, about four characters of English on average. The model sees tokens, predicts tokens, charges per token, and is bounded by tokens. Cost, latency, quality, and limits are all token conversations underneath. When you hear "100K context," that's tokens. When you see "$0.003 per 1K," that's tokens. Train your ear to translate everything into this unit.

3. The context window is a fixed-size workspace. Everything the model considers (system prompt, conversation history, retrieved documents, tool definitions, the user's current message) has to fit inside this window. It's not infinite, and it's not free even when it's large. Two consequences worth holding onto: packing more in costs more and slows things down, and models pay disproportionate attention to the start and end of the window. The middle gets fuzzy. Researchers call this the lost-in-the-middle effect, and it's why retrieval often beats "stuff the whole document in" even when the bigger window technically fits.

4. The output is sampled, not computed. The model produces a probability distribution over possible next tokens, then samples from it. The same input can produce different outputs. Temperature controls how adventurous the sampling is. The takeaway: an AI feature is a stochastic system, not a deterministic one. A demo that worked once is encouraging, not conclusive. This single fact reshapes how to think about testing, evals, and reliability.

5. Cost and latency scale with tokens, but asymmetrically. Input tokens are cheap and fast (the model reads them in parallel). Output tokens are expensive and slow (the model generates them one at a time). A response that's 10x longer is roughly 10x slower for the user, regardless of how clever the prompt is. When an AI feature feels sluggish, the first place to look is rarely the model itself. It's how many tokens we're asking it to produce.

That's the model. Stateless function. Tokens in, tokens out. Fixed-size workspace with a soft middle. Stochastic sampling. Asymmetric cost.

What This Changes in How You Show Up

A few concrete ways the model pays off in your week.

Scope features around what the model actually is. If a feature needs the model to "remember" something across sessions, the real work is the memory system, not the model call. Worth confirming with your team that that's what's been scoped.

Read cost estimates with the right unit in mind. A useful question: "What's the average input and output token count per request, and what's our requests-per-day at scale?" If those numbers aren't ready yet, the estimate isn't ready yet, and that's a useful thing to surface early, before commitments harden.

Treat "let's just use a longer context window" as a design choice, not a default. Sometimes it's exactly right. Other times retrieval would do the job for less cost and better quality. A good follow-up is: "What did we try retrieval-wise first, and what did the eval show?" The lost-in-the-middle effect is well-documented, and it's worth a minute of conversation.

Treat output length as a UX decision. When something feels slow, the most actionable lever is often "produce less." Streaming improves perceived latency, constraining output length improves both perceived and real latency, and either is usually faster than chasing a faster model.

Make space for evals. Because outputs are sampled, "it worked when I tried it" is a starting point, not a conclusion. The most credible reliability statement is one backed by a repeatable evaluation set with measured pass rates. If your team is still building toward that, supporting the work to get there is one of the highest-leverage things a leader can do this quarter.

The work of an engineering leader has always included staying conversant with the substance of what your team is building. The substance is just moving faster than usual right now. A small, sturdy mental model is how you keep up without burning your evenings, and it's the foundation everything else in this series will build on.

Leadership Action Item of the Week

A decision question: Which of the five concepts above would feel best to be able to explain crisply in your next design review? Pick one, spend an hour with it this week, and notice the difference in the next architecture conversation.

An action step: in a current AI feature your team is building, ask three friendly questions.

1) What's the average token count per request, in and out?

2) What does the eval set look like?

3) Where does state live?

The point isn't to test anyone. It's to make those three things explicit, because once they're explicit they're easier to plan around together.

A reflection prompt: Of these five concepts, which one would I most like to be the person on my team who can explain it simply to a peer?

What’s Next?

RAG vs Fine-Tuning vs Long Context: A Decision Tree for Leaders
What "Eval" Actually Means: The Smallest Version That Pays Off
Agent Architectures Explained Through Org Design
Embeddings for Engineering Managers
Want something covered? Hit reply and tell me. I love hearing what you’re dealing with.

Work With Me

Resume Review

A detailed review of your resume with specific, actionable feedback to strengthen your story, highlight impact, and position you for Engineering IC or Leadership roles.

Mock Interviews

A practice session tailored to Engineering IC or Leadership roles. You’ll get structured feedback, real scenarios, and clarity on what interviewers actually look for.

1:1 Mentorship

A session focused on your career growth, navigating leadership challenges and building a roadmap toward your next role.

📬 Reply back to this email to book a 30 min session (free for subscribers!)

Meme of The Week

We've all been on both sides of this. 😅

Your prompts are leaving out 80% of what you're thinking.

When you type a prompt, you summarize. When you speak one, you explain. Wispr Flow captures your full reasoning — constraints, edge cases, examples, tone — and turns it into clean, structured text you paste into ChatGPT, Claude, or any AI tool. The difference shows up immediately. More context in, fewer follow-ups out.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Try Wispr Flow free — works on Mac, Windows, and iPhone.

Start flowing free

That’s a wrap for this week’s issue of CodingBeenz! 👩‍💻

The vocabulary around AI is going to keep moving. The underlying mental model moves much more slowly. These five concepts will carry you through a lot of design reviews this year, and the same five things will still be true when the next round of frameworks rolls in. 🚀

Until next time,

Sabeen 🐝

P.S. In the next article we're putting the mental model to work on the first real architecture question it unlocks: RAG vs fine-tuning vs long context. If your team has had any version of this debate in the last quarter, the decision tree should make the next one a lot shorter. 👋

The Engineering Leader's Mental Model of LLMs