P1-C4 · Neural Networks / LLM Intuition (No Math)¶

Core Takeaway

If you don't understand how LLMs work, you can't invest in the AI industry — because you won't be able to answer "why do we need GPU / HBM / nuclear power?"

AI Industry Knowledge — History → Technology → Supply Chain → Business → Applications → Geopolitics

P1-C4 (Part 1, Chapter 4). After this chapter, you'll be able to explain LLM training / inference / token / context window without math — laying the foundation for the next chapter (hardware reverse engineering).

1. The Problem: You've Heard "H100 trains LLMs" / "Context window 200K" — But What Do They Mean?¶

You write a thesis: "NVDA bull because hyperscalers need more compute to train LLMs" — but you can't explain:

Why does LLM training need so much compute? Can't you train with CPUs?
What is a context window? Why is the gap between Claude 200K vs ChatGPT 32K so significant?
What is a token? Why does OpenAI charge by token?
Training vs. Inference — which consumes more compute?

If you can't answer = your thesis has numbers floating on the surface, you don't know how NVDA's products are used by customers → you can't see changes in customer demand.

2. The Solution: 3 Analogies to Demystify LLMs¶

Analogy	Corresponding LLM Concept
A child reading billions of books to learn to speak	Training — finding statistical relationships between words
A word chain game	Inference — given a start, generate one word at a time
Looking back at previous text while writing an essay	Attention — selecting which prior words are important

Once you grasp these 3 analogies, LLMs become demystified — you know they are fundamentally statistical models, not magic.

3. How It Works: Training / Inference / Key Terms¶

3.1 Training = A Child Reading Books to Learn to Speak¶

Imagine a child. You have them read all the text on the entire internet (~10 trillion tokens): - They see "cat" and "animal" together 100 million times → they learn "cats are animals" - They see "Apple Q4 revenue $124B" 10 million times → they learn financial report sentence structures - They see Python code 1 million times → they learn how to write functions

LLM training is this process — but the child uses a brain, the LLM uses parameters (billions to trillions of numbers in a neural network). More parameters → more detailed patterns they can remember → smarter.

Compute: Training GPT-4 estimated at ~$100M (~25,000 A100 GPUs ~3 months; H100 was not yet in mass production when training completed) — ⚠️ GPT-4 model size / hardware / training compute / dataset / cost was NOT disclosed by OpenAI (per GPT-4 Technical Report: "this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method"); $100M / 25K A100 are external estimates (industry estimates, not OpenAI official). This is where half of the hyperscaler's $725B capex goes.

3.2 Inference = The Word Chain Game¶

After training, you give the LLM a starting sentence ("The weather today is"), and it generates one token at a time in a chain: - "The weather today is" → "nice" (highest probability) - "The weather today is nice" → "," (highest probability) - "The weather today is nice," → "I" (highest probability) - ...

Every time it generates 1 token, it has to recalculate the entire context (this is why a larger context window → slower + more expensive inference).

Compute: Inference is much cheaper (single Q&A < $0.01), but with many users, the scale is huge — assuming DAU ~135-225M (based on OpenAI 2026/03/31 official >900M WAU × typical 0.15-0.25 DAU/WAU ratio; OpenAI does not disclose DAU), at 10 queries/day per DAU ≈ 1.35-2.25 billion inferences/day (rough estimate). This is where the other half of capex is spent.

3.3 5 Key Terms¶

Term	Intuitive Explanation	Investment Significance
Token	A word or word fragment (Chinese ~1 character = 1 token, English ~0.75 words = 1 token)	OpenAI / Anthropic API charges by token; 1B tokens ≈ $0.5-30 (varies by model)
Parameter	Model size. 175B (GPT-3 disclosed), ~1.7T (GPT-4 external estimate — OpenAI did NOT disclose) ⚠️	More parameters → more expensive training (scaling laws), but also more expensive inference
Context window	How many tokens the model can see at once (Claude 200K-1M, ChatGPT 32K-128K)	Large context = can process entire books, but inference compute scales quadratically (KV cache)
Training compute	Total compute to train a model once	GPT-4 ~$100M (external estimate, OpenAI did NOT disclose GPT-4 training compute / cost), hyperscalers spend the bulk here
Inference compute	Compute to run a single user request	Proportional to user count. ChatGPT's total inference compute > training (due to scale)

3.4 Reasoning Models (o1 / o3) — The Next Generation¶

2024+ OpenAI o1, Anthropic Claude (extended thinking), DeepSeek R1 — reasoning models.

Difference from standard LLMs: They don't give a direct answer. Instead, they internally think for thousands of tokens, then provide the answer. This pushes inference compute from $0.01 per query to $1+ (100x).

→ Investment Significance: Inference compute suddenly becomes a new growth curve. Hyperscaler capex is no longer just for training, but also for massive inference. NVDA Blackwell optimizes for inference, capturing the inference market.

4. vs. What You Already Know from C3¶

Dimension	C3 Gives You	C4 Gives You More
NVDA's historical position	✓	Doesn't explain product internals
LLM internal mechanisms	✗	Training / inference / token / context / parameters
Investment significance	Knows NVDA's moat	Knows where hyperscaler capex is spent (2 buckets) (training + inference) — analyzing capex split between train vs. infer is a new dimension

C3 = The company. C4 = Product internals. Without C4, you can't tell if hyperscaler capex is short-term (training) or long-term (inference infrastructure) — these are different theses.

5. Try It: Verify Your LLM Intuition with ChatGPT¶

Task (15 minutes): Open ChatGPT or Claude, ask 3 questions:

"What is your context window size? Tell me in tokens."
"What is your training data cutoff date? What events does this mean you don't know about?"
"Explain Transformer attention in one sentence, for a 5-year-old."

Then observe: - It can answer 1 / 2 (told during training) - It will use an analogy for 3 — see if it resonates with your §3 analogies

Self-check (3 items met → proceed to P1-C5):

You can explain in one sentence why a larger context window → more expensive inference (KV cache quadratic scaling)
You can distinguish the cost drivers of training compute vs. inference compute
You can explain "why reasoning models (o1) are 100x more expensive than standard LLMs"

6. What's Next¶

You now understand LLM internals. Now, reverse the logic: How does the way LLMs work → reverse-engineer why we need GPU / HBM / NVLink / liquid cooling / nuclear power?

Each piece of hardware addresses one bottleneck of LLMs.

→ P1-C5 · Why GPU / HBM / Liquid Cooling / Nuclear Power Reverse-engineer the entire hardware stack from LLM compute demands.

7. Deep Dive (optional): RLHF / Temperature / Sampling / Agentic Loop¶

Click to see 4 advanced LLM concepts

RLHF (Reinforcement Learning from Human Feedback): The key step from GPT-3 → ChatGPT. A standard LLM is trained only to "continue text"; after RLHF, it learns to "follow instructions." → The InstructGPT (2022) paper was the start. Anthropic's Constitutional AI is another variant.

Temperature: A parameter controlling the "creativity" of LLM output. 0 = deterministic (same input → same output every time). 0.7 = balanced. 1.5 = creative. → Required in API calls. You need to know this if your thesis involves LLM applications (e.g., code generation → low temperature, poetry → high temperature).

Sampling: Given the next-token probability distribution, how to choose one? Greedy (pick highest) / Top-k / Top-p / Beam search. Different sampling methods produce vastly different quality.

Agentic loop: LLM + tool-calling loop (your input → LLM thinks → calls a tool → gets result → continues → answers you). → Claude Code / Cursor / Devin all follow this pattern. Inference compute is 10-100x more than a single turn — this is a new growth curve (compounding inference compute).

8. Further Reading (this chapter — neural network / LLM intuition)¶

All free sources, aligned with P5 0-paid policy

Classic papers / primary sources:

Vaswani et al. "Attention Is All You Need" (2017) — The 8-page Transformer paper
Ouyang et al. "InstructGPT / RLHF" (OpenAI 2022) — The alignment method behind ChatGPT
Anthropic "Toy Models of Superposition" (2022) — Interpretability intro: see what's actually packed inside the model
Anthropic "Mapping the Mind of a Large Language Model" (2024) — Feature extraction inside Claude 3 Sonnet

Wikipedia (3-10 min):

"Artificial neural network" — Basic neural network concepts
"Large language model" — Full LLM coverage + scaling curves
"Reinforcement learning from human feedback" — RLHF methods + history

Videos / public lectures (~1-3 hr each):

3Blue1Brown "Neural networks" 4-video series — Visualized neural networks (~1 hr)
3Blue1Brown "But what is a GPT?" (~30 min) — Transformer intuition, visualized
Andrej Karpathy "Let's build GPT from scratch" (2 hr) — Hand-coded nano-GPT
Andrej Karpathy "Neural Networks: Zero to Hero" series — From micrograd to GPT, full set
Andrej Karpathy "Deep Dive into LLMs like ChatGPT" (3.5 hr, 2025) — Full training / inference / RLHF pipeline

Blogs (classic authors):

Lilian Weng "Prompt Engineering" — Concepts + practice
Lilian Weng "LLM Powered Autonomous Agents" — Agentic loop survey
Sebastian Raschka "Magazine: Ahead of AI" — Monthly LLM technical review, free subscription

Podcasts (1-3 hr each):

Lex Fridman #333 — Andrej Karpathy — 2.5 hr, LLM training intuition

Books (library):

Sebastian Raschka "Build a Large Language Model (From Scratch)" (2024) — Build an LLM line-by-line
Michael Nielsen "Neural Networks and Deep Learning" (free online at neuralnetworksanddeeplearning.com) — Textbook-level intro

Pair with this chapter's self-check:

After 3Blue1Brown's 4 episodes + Karpathy's "Intro to LLM" + Lilian Weng's "LLM Powered Agents," you should be able to answer "the 3 LLM analogies" and "why the agentic loop changes the compute curve."