P1-C4 · Neural Networks / LLM Intuition (No Math)¶
Core Takeaway
If you don't understand how LLMs work, you can't invest in the AI industry — because you won't be able to answer "why do we need GPU / HBM / nuclear power?"
AI Industry Knowledge — History → Technology → Supply Chain → Business → Applications → Geopolitics
P1-C4 (Part 1, Chapter 4). After this chapter, you'll be able to explain LLM training / inference / token / context window without math — laying the foundation for the next chapter (hardware reverse engineering).
1. The Problem: You've Heard "H100 trains LLMs" / "Context window 200K" — But What Do They Mean?¶
You write a thesis: "NVDA bull because hyperscalers need more compute to train LLMs" — but you can't explain:
- Why does LLM training need so much compute? Can't you train with CPUs?
- What is a context window? Why is the gap between Claude 200K vs ChatGPT 32K so significant?
- What is a token? Why does OpenAI charge by token?
- Training vs. Inference — which consumes more compute?
If you can't answer = your thesis has numbers floating on the surface, you don't know how NVDA's products are used by customers → you can't see changes in customer demand.
2. The Solution: 3 Analogies to Demystify LLMs¶
| Analogy | Corresponding LLM Concept |
|---|---|
| A child reading billions of books to learn to speak | Training — finding statistical relationships between words |
| A word chain game | Inference — given a start, generate one word at a time |
| Looking back at previous text while writing an essay | Attention — selecting which prior words are important |
Once you grasp these 3 analogies, LLMs become demystified — you know they are fundamentally statistical models, not magic.
3. How It Works: Training / Inference / Key Terms¶
3.1 Training = A Child Reading Books to Learn to Speak¶
Imagine a child. You have them read all the text on the entire internet (~10 trillion tokens): - They see "cat" and "animal" together 100 million times → they learn "cats are animals" - They see "Apple Q4 revenue $124B" 10 million times → they learn financial report sentence structures - They see Python code 1 million times → they learn how to write functions
LLM training is this process — but the child uses a brain, the LLM uses parameters (billions to trillions of numbers in a neural network). More parameters → more detailed patterns they can remember → smarter.
Compute: Training GPT-4 is estimated to cost ~$100M (10,000+ H100 GPUs running for 6 months). This is where half of the hyperscaler's $725B capex goes.
3.2 Inference = The Word Chain Game¶
After training, you give the LLM a starting sentence ("The weather today is"), and it generates one token at a time in a chain: - "The weather today is" → "nice" (highest probability) - "The weather today is nice" → "," (highest probability) - "The weather today is nice," → "I" (highest probability) - ...
Every time it generates 1 token, it has to recalculate the entire context (this is why a larger context window → slower + more expensive inference).
Compute: Inference is much cheaper (single Q&A < $0.01), but with many users, the scale is huge — ChatGPT 300M MAU × 10 times/day = 3 billion inferences/day. This is where the other half of capex is spent.
3.3 5 Key Terms¶
| Term | Intuitive Explanation | Investment Significance |
|---|---|---|
| Token | A word or word fragment (Chinese ~1 character = 1 token, English ~0.75 words = 1 token) | OpenAI / Anthropic API charges by token; 1B tokens ≈ $0.5-30 (varies by model) |
| Parameter | Model size. 175B (GPT-3), ~1.7T (GPT-4 est.) | More parameters → more expensive training (scaling laws), but also more expensive inference |
| Context window | How many tokens the model can see at once (Claude 200K-1M, ChatGPT 32K-128K) | Large context = can process entire books, but inference compute scales quadratically (KV cache) |
| Training compute | Total compute to train a model once | GPT-4 ~$100M, hyperscalers spend the bulk here |
| Inference compute | Compute to run a single user request | Proportional to user count. ChatGPT's total inference compute > training (due to scale) |
3.4 Reasoning Models (o1 / o3) — The Next Generation¶
2024+ OpenAI o1, Anthropic Claude (extended thinking), DeepSeek R1 — reasoning models.
Difference from standard LLMs: They don't give a direct answer. Instead, they internally think for thousands of tokens, then provide the answer. This pushes inference compute from $0.01 per query to $1+ (100x).
→ Investment Significance: Inference compute suddenly becomes a new growth curve. Hyperscaler capex is no longer just for training, but also for massive inference. NVDA Blackwell optimizes for inference, capturing the inference market.
4. vs. What You Already Know from C3¶
| Dimension | C3 Gives You | C4 Gives You More |
|---|---|---|
| NVDA's historical position | ✓ | Doesn't explain product internals |
| LLM internal mechanisms | ✗ | Training / inference / token / context / parameters |
| Investment significance | Knows NVDA's moat | Knows where hyperscaler capex is spent (2 buckets) (training + inference) — analyzing capex split between train vs. infer is a new dimension |
C3 = The company. C4 = Product internals. Without C4, you can't tell if hyperscaler capex is short-term (training) or long-term (inference infrastructure) — these are different theses.
5. Try It: Verify Your LLM Intuition with ChatGPT¶
Task (15 minutes): Open ChatGPT or Claude, ask 3 questions:
- "What is your context window size? Tell me in tokens."
- "What is your training data cutoff date? What events does this mean you don't know about?"
- "Explain Transformer attention in one sentence, for a 5-year-old."
Then observe: - It can answer 1 / 2 (told during training) - It will use an analogy for 3 — see if it resonates with your §3 analogies
Self-check (3 items met → proceed to P1-C5):
- You can explain in one sentence why a larger context window → more expensive inference (KV cache quadratic scaling)
- You can distinguish the cost drivers of training compute vs. inference compute
- You can explain "why reasoning models (o1) are 100x more expensive than standard LLMs"
6. What's Next¶
You now understand LLM internals. Now, reverse the logic: How does the way LLMs work → reverse-engineer why we need GPU / HBM / NVLink / liquid cooling / nuclear power?
Each piece of hardware addresses one bottleneck of LLMs.
→ P1-C5 · Why GPU / HBM / Liquid Cooling / Nuclear Power Reverse-engineer the entire hardware stack from LLM compute demands.
7. Deep Dive (optional): RLHF / Temperature / Sampling / Agentic Loop¶
Click to see 4 advanced LLM concepts
RLHF (Reinforcement Learning from Human Feedback): The key step from GPT-3 → ChatGPT. A standard LLM is trained only to "continue text"; after RLHF, it learns to "follow instructions." → The InstructGPT (2022) paper was the start. Anthropic's Constitutional AI is another variant.
Temperature: A parameter controlling the "creativity" of LLM output. 0 = deterministic (same input → same output every time). 0.7 = balanced. 1.5 = creative. → Required in API calls. You need to know this if your thesis involves LLM applications (e.g., code generation → low temperature, poetry → high temperature).
Sampling: Given the next-token probability distribution, how to choose one? Greedy (pick highest) / Top-k / Top-p / Beam search. Different sampling methods produce vastly different quality.
Agentic loop: LLM + tool-calling loop (your input → LLM thinks → calls a tool → gets result → continues → answers you). → Claude Code / Cursor / Devin all follow this pattern. Inference compute is 10-100x more than a single turn — this is a new growth curve (compounding inference compute).