Skip to content

P1-C5 · Why GPU / HBM / Liquid Cooling / Nuclear Power

Core Takeaway

Every piece of hardware is one bottleneck for LLMs — whichever bottleneck gets unblocked, that company's stock moves.

AI Industry Knowledge — History → Technology → Supply Chain → Business → Application → Geopolitics

P1-C5 (Part 1, Chapter 5). After this chapter, you can reverse-engineer why the entire hardware stack is designed this way from LLM working principles, without memorizing supply chain tickers.


1. The Problem: Why Can't You Train LLMs with CPUs?

You see hyperscalers spending $725B on GPUs, but never ask why they can't use cheaper CPUs. You see SK Hynix's stock soaring, but don't know how HBM differs from regular DRAM. You see Vertiv up 200% and think it's an air conditioning company.

How LLMs work (learned in C4) dictates every hardware requirement — you can derive the entire hardware stack from first principles, and then you won't need to memorize 60 supply chain tickers.


2. The Solution: LLM's 4 Core Requirements → 4 Hardware Categories

LLM Needs Physical Bottleneck Solving Hardware Key Players
Massive parallel matrix multiplication (training) CPU serial is slow GPU / ASIC NVDA · AMD · Google TPU
Fast data feeding (don't let GPU wait) DRAM bandwidth insufficient HBM high-bandwidth memory SK Hynix · Micron · Samsung
GPU-to-GPU communication (1000+ card clusters) Standard Ethernet is slow NVLink / InfiniBand / Optical modules NVDA Mellanox · ANET · COHR
Cooling + stable high power Air cooling can't handle 800W+ chips Liquid cooling + Nuclear / Gas VRT · CEG · VST · ETN

Each link has a "physical bottleneck → solving hardware → key company". The hardware stack maps one-to-one with companies.


3. How It Works: 4 Bottlenecks Explained in Detail

3.1 GPU vs CPU — Parallel Matrix Multiplication

LLM training spends 99% of time on matrix multiplication (neural networks are essentially matrices).

  • CPU: 8-128 cores, each core handles complex tasks independently (like 100 PhDs)
  • GPU: 10,000+ cores, each core does simple arithmetic (like 10,000 elementary students doing addition/subtraction)
  • Matrix multiplication: 10,000 elementary students doing arithmetic is 100x faster than 100 PhDs

**NVDA H100**: 1 card with 80GB HBM, 700W power, $30K-40K. A training cluster has 1024-8192 cards.

**AMD MI300X / Google TPU / AWS Trainium: Same concept, different implementations. The CUDA ecosystem (NVDA's 20-year moat) keeps NVDA at 80%+ of the training market**.

3.2 HBM vs Regular Memory — Data Throughput

GPUs compute fast, but before computing, data must be read from memory into the GPU. Regular DRAM bandwidth is insufficient → GPU spends 80% of time waiting for data → wasted.

HBM (High Bandwidth Memory): 3D stacked memory, bandwidth 10x that of DDR5.

  • SK Hynix: Primary HBM3e supplier, NVDA uses 70%+ from SK Hynix
  • Micron: Ramped in 2024, gaining share
  • Samsung: Slow to qualify with NVDA (technology / yield / strike triple whammy), losing market share

HBM shortage is NVDA's shipment ceiling. Monitoring HBM capacity is monitoring NVDA's revenue ceiling.

One LLM is too large for a single GPU → distributed across 1000+ GPUs. They need high-speed communication (gradient synchronization).

  • NVLink: Between NVDA's own GPUs, 1.8TB/s (Blackwell)
  • InfiniBand: Between clusters (NVDA acquired Mellanox in 2019 to secure this)
  • Optical modules: Data center cabling, speeds from 400G → 800G → 1.6T → CPO (Co-Packaged Optics)

**COHR / LITE / AAOI**: Optical modules. NVDA invested $2B strategically in COHR / LITE to lock supply. ANET: Network switching (META's primary supplier, used for east-west fabric).

Optical module price increases = leading indicator of AI capex acceleration (as cluster scales up, optical module demand grows quadratically).

3.4 Liquid Cooling + Nuclear Power — Cooling + Sustained High Power

H100: 700W. Blackwell B200: 1200W. Air cooling can't handle it → liquid cooling is a must.

1 Stargate data center = 1 GW. That's the power output of 1 nuclear power plant.

  • VRT (Vertiv): Liquid cooling + data center electrical king
  • CEG (Constellation): MSFT's 20-year nuclear PPA (Three Mile Island restart)
  • VST (Vistra): Natural gas + nuclear
  • ETN (Eaton) / HUBB: Power distribution
  • GEV (GE Vernova): Gas turbines (backup + peak)

Energy is the real bottleneck for 2026+. You can buy GPUs, but you can't buy electricity (building a nuclear plant takes 10 years). That's why CEG / VST / GEV stocks soared in 2024+.


4. vs C4 — What You Already Know

Dimension C4 Gives You C5 Adds
LLM working principles Doesn't explain hardware
Hardware stack LLM → 4 bottlenecks → 4 hardware categories → key companies
Investment significance Knows training vs inference compute Knows which bottleneck unblocking moves which company's stock; monitoring HBM / optical modules / power is a leading indicator

C4 = Software. C5 = Hardware + Physics. Without C5, you don't know the true physical logic behind each link in the 60-ticker supply chain.


5. Try It: Estimate GPT-4's Electricity Usage for One Training Run

Task (10 minutes):

GPT-4 training estimate:
- 10,000 H100s, each 700W = 7 MW (peak)
- Train for 6 months = 4380 hours
- Compute utilization ~50% average
- Total electricity = 7 MW × 4380 × 0.5 = 15.3 GWh

Reference:
- 1 US household annual electricity ~10 MWh = 0.01 GWh
- 15.3 GWh = 1530 household-years

But this is one run. GPT-4 was trained multiple times (experiments + failures + final), total electricity estimated ~50 GWh = 5000 household-years.

Self-check (3 items met → proceed to P1-C6):

  • You can explain **why SK Hynix's stock flies in sync with NVDA**
  • You can explain why CEG (nuclear) surged 200%+ in 2024+
  • You can predict which link will rally next from hardware bottlenecks: HBM4 (2026)? Liquid cooling penetration (2026-27)? 1.6T optical modules?

6. What's Next

You can now reverse-engineer the hardware stack from LLMs. Now map the hardware stack to specific companies — which role each of the 60 tickers plays, and what they depend on.

→ P1-C6 · Supply Chain 5 Roles + 60 Ticker Map Upgrade the existing supply chain diagram; with C1-C5 as foundation, you're no longer learning in isolation.


Click to see 5 hardware trends

CPO (Co-Packaged Optics) — 2025+: Optical modules go from pluggable to packaged together with the switch chip. Power consumption drops 50%, bandwidth doubles 2x. But CPO yield is difficult, mass production slow. Key players: TSM (packaging), AVGO (switch), Coherent (optical). → If CPO truly mass-produces in 2026, the entire optical module paradigm shifts, reshuffling existing players.

NVLink vs InfiniBand vs Ethernet: NVDA pushes NVLink (between its own GPUs) + InfiniBand (between clusters). But the Ultra Ethernet Consortium (Cisco/Arista/Intel/AMD/MSFT) is jointly promoting standard Ethernet for AI fabric. Long term, NVDA's networking advantage may be diluted.

TPU Economics (Google internal): TPU v5p performance is comparable to H100, but Google uses it internally (not sold externally). This diverts 30-50% of Google's demand from NVDA, but the total market is unchanged (Google uses the same compute even without buying NVDA).

Inference Hardware Divergence — Training vs inference hardware will separate in the future: Training: Massive clusters (NVDA Blackwell dominates) Inference: Single card / edge / small chips (Groq / Cerebras / SambaNova / Apple NPU). NVDA Blackwell also optimizes inference but competitors have a chance.

HBM4 (2026) — Next generation: SK Hynix mass production timeline, bandwidth doubles again. NVDA Rubin (2026 H2) uses HBM4. This is the starting point for the next HBM shortage cycle.