The Package Became the Computer.Original analysisNot investment advice
Dojo was not just a chip. It was Tesla’s attempt to make a training computer out of packaging, power delivery, cooling, interconnect, SRAM, and software. In 2026, the dedicated Dojo story looks weaker because Tesla shifted toward AI5 and AI6. But the core insight aged well: AI scaling is moving from chip-level performance to package-level and rack-level integration. The package is becoming the computer.
In 2021, Tesla revealed Dojo. The easy headline was: Tesla built an AI chip. That was not the real story. The real story was stranger. Tesla built a training surface.
A D1 chip mattered, but the chip was not the unit of scale. The uploaded SemiAnalysis article made this clear: Dojo’s real unit of scale was the training tile, a 25-chip fan-out wafer package designed to behave like one giant compute plane.1
That is why Dojo mattered. Not because Tesla made another accelerator. Because Tesla tried to make packaging, power delivery, cooling, and software part of the accelerator itself.
Tesla Dojo was an early, extreme version of the problem every AI hardware company now faces: scaling AI compute is not just about faster chips. It is about packaging, interconnect, memory locality, power delivery, cooling, software, and workload fit. The dedicated Dojo path looks less central after Tesla shifted toward AI5 and AI6, but the system-level lesson aged well. AI hardware is becoming package-scale and rack-scale.
I. The 2021 thesis was about the system
In August 2021, Dylan Patel published a SemiAnalysis piece written after Tesla AI Day. The piece did not only describe the D1 chip. It described the training tile, the power delivery, the cabinet, the ExaPOD, and the software stack — arguing that Tesla had designed Dojo because GPU-cluster scaling was not enough, and that Dojo’s distributed compute plane needed high bandwidth, low latency, spatial and temporal locality, and a mesh of compute units connected by fabric. The deeper insight was that the package and system were the unit of scale.1
Dojo was not just an accelerator chip. It was Tesla’s attempt to make packaging, bandwidth, power, cooling, and software into one training architecture.
II. D1 was built to be networked
The 2021 piece described D1 as a chip designed for movement, not only for compute. Each training node carried roughly 1.25MB of SRAM with CPU-like flexibility, SIMD, matrix multiply, and ML-focused custom instructions. The die delivered 362 TFLOPS in BF16 / CFP8 across 354 functional units, used 50B transistors on a roughly 645mm² die, drew about 400W TDP, supported 10 TB/s of directional on-chip bandwidth, and exposed 576 SerDes at 112 GT/s for roughly 8 TB/s of total off-chip bandwidth.1
Most AI chips are described by compute. Dojo needs to be described by movement.
III. The tile was the breakthrough
Twenty-five D1 chips were packaged in a fan-out wafer process into one training tile delivering roughly 9 PFLOPS of BF16 / CFP8 compute and 36 TB/s of off-tile bandwidth.1 Cadence’s technical summary corroborates the framing: 25 known-good D1 dies on a fanout-wafer process that preserves bandwidth between adjacent chips.2 In a normal accelerator story, the chip is the product. In Dojo, the tile was the product.
The package became the computer.
IV. Fan-out wafer packaging, simply
A normal package connects chips through a substrate or interposer. Fan-out wafer packaging redistributes chip connections through a wafer-like structure, allowing many known-good dies to be connected close together with dense wiring.
Why Dojo chose it
- Shorter chip-to-chip paths.
- Lower latency and higher bandwidth density.
- Better scale-up inside the tile.
- Dense interconnect without monolithic die yield risk.1
What it costs
- Yield complexity at the tile level.
- Thermal and serviceability challenges.
- Power-delivery difficulty at scale.
- Software must understand the topology.
Dojo was not trying to make one impossible die. It was trying to make many dies behave like one training surface.
V. Power delivery was architecture
The 2021 SemiAnalysis piece described a tile consuming over 10kW at the package level and about 15kW when power delivery, IO, and wafer wiring are included, with power entering vertically from the bottom, heat leaving from the top, and custom VRMs reflowed directly onto the fan-out wafer.1 Cadence’s summary corroborates: the tile took 52V DC, drew 18,000A, dissipated 15kW of heat, and delivered 9 PFLOPS in less than one cubic foot.2
At Dojo scale, the question is not just “how fast is the chip?” It is “can the system feed enough power and remove enough heat to keep the fabric alive?”
VI. SRAM locality was the memory bet
Each training node carried 1.25MB of SRAM, and Dojo emphasised spatial and temporal locality rather than the HBM-heavy approach most accelerators took.1 HBM-heavy systems bring enormous memory bandwidth, but depend on expensive memory stacks and interposers. Dojo emphasised many local SRAM pools, a tightly connected compute fabric, and data movement through the tile.
HBM-heavy
High-bandwidth memory stacks beside the accelerator. Broad workload fit, deep ecosystem, dominant for general-purpose training.
SRAM locality
Many local SRAM pools, tight inter-die fabric. Narrower workload fit, but lower memory-movement penalty if the workload maps cleanly to the tile.1
Dojo was not a generic GPU clone. It was a locality and interconnect bet.
VII. Dojo was also a software bet
A beautiful tile is useless if the software stack cannot map real models onto it. The 2021 SemiAnalysis piece described a Dojo software stack with a PyTorch extension at the top, a Dojo compiler engine in the middle, and an LLVM backend at the bottom, with multi-host and multi-partition support, model and data graph parallelism, and the ability to scale work across chip and tile boundaries.1
Custom AI hardware only matters if the software stack makes the hardware usable.
VIII. The 2021 claim was massive
Tesla’s headline framing in 2021 was bold: roughly 4× performance, 1.3× performance per watt, 5× smaller footprint, and nearly an order-of-magnitude TCO advantage versus an Nvidia AI solution. The SemiAnalysis author was enthusiastic but explicitly cautioned that the real test would be production deployment.1
The claim was not obviously impossible. But the burden of proof was enormous: stable hardware, working compiler, high utilisation, reliable cooling, strong yield, software migration, and production-scale deployment that beat the economics of Nvidia clusters. None of those land from a slide.1
The claim was not obviously impossible. But the burden of proof was enormous.
IX. The 2026 update is not a clean victory lap
Reuters reported that Bloomberg said Tesla was disbanding its Dojo supercomputer team, with Musk saying Tesla should not divide resources across two different AI chip designs and that Tesla’s effort was focused on AI5, AI6, and subsequent chips, framed as excellent for inference and at least pretty good for training.3
Dojo’s reorganisation is not a verdict on packaging.
The reporting describes a streamlining of Tesla’s AI chip teams toward AI5 / AI6, not a rejection of fan-out wafer-scale ideas industry-wide. It tells you about Tesla’s priorities. It does not tell you that the package-as-system thesis is wrong; TSMC’s SoW-X roadmap suggests the opposite direction.7
Dojo was technically fascinating, but Tesla appears to have chosen a more unified inference-first chip roadmap.
X. AI5 and AI6 are the new center
Reuters reported Tesla signed a ~$16.5B supply deal with Samsung, with Musk saying Samsung’s Taylor (Texas) factory would make Tesla’s next-generation AI6 chip; Samsung currently makes Tesla’s AI4 chips; TSMC is slated to make AI5 first in Taiwan and then Arizona according to Musk; and the chips are intended for self-driving vehicles, Optimus robots, and broader AI applications.4 Reuters also reported Musk saying Tesla may tape out AI6 in December 2026, with a Samsung executive saying Tesla chips based on Samsung’s advanced 2nm process were planned for production in the second half of 2027.5
Cost / priority test
Dojo was the training-supercomputer bet. AI5 and AI6 are the deployment-scale AI-chip bets.
XI. Why the packaging thesis aged better than the Dojo thesis
TSMC’s 2026 North America Technology Symposium materials describe a packaging roadmap with 5.5-reticle CoWoS today, 14-reticle CoWoS by 2028 supporting roughly 10 large compute dies and 20 HBM stacks, and a 40-reticle SoW-X System-on-Wafer targeted for 2029, alongside SoIC 3D stacking and COUPE co-packaged optics.7
Dojo itself became complicated. The Dojo problem became universal.
XII. The package is becoming the computer
AI scaling used to look like: faster chip → faster model training. Now it looks like: die + package + HBM / SRAM + interconnect + power + cooling + rack + software partitioning. The bottlenecks are chip-to-chip bandwidth, HBM capacity, HBM bandwidth, SRAM locality, package size, reticle limits, substrate / interposer complexity, power delivery, cooling, optical I/O, rack networking, compiler / runtime, and workload partitioning.
Chip → server → cluster
- Bottleneck · single-chip performance.
- Scaling unit · one accelerator.
- Constraint · transistor count.
- Result · servers full of independent chips.
Die → tile → rack → data center
- Bottleneck · chip-to-chip bandwidth, power, cooling.
- Scaling unit · package + tile + rack.7
- Constraint · interconnect, memory, energy.
- Result · AI factory designed as one system.
The package is becoming the unit of AI scaling.
XIII. Dojo vs Nvidia was vertical integration vs ecosystem
The Dojo-vs-Nvidia comparison is not just "custom chip vs GPU." It is platform vs vertical integration. Nvidia wins by ecosystem leverage. Dojo tried to win by workload-specific integration.
Nvidia sells a platform. Dojo was Tesla trying to build a machine for one company’s data loop.
XIV. The software problem
A custom accelerator fails quietly when the software team cannot make it convenient. To make Dojo work, Tesla needed a compiler, runtime, training-framework support, model partitioning, debugging tools, scheduling, fault tolerance, reliability management, data pipeline integration, developer productivity, and a migration path from existing GPU workflows. None of those are easy. None of them get cheaper at custom-hardware scale.
The hardware can be brilliant and still lose if the software path is too painful.
XV. The business lesson
Hardware ambition is not enough. The architecture has to map to business leverage. Dojo had a clear technical reason: Tesla-specific training from fleet video. But the business test is whether it reduced cost enough, sped training enough, justified a separate team, kept up with Nvidia’s roadmap, justified custom software, scaled reliably, and helped cars and robots ship faster. Those are a lot of bars to clear in parallel.
Dojo’s enemy was not only Nvidia. Dojo’s enemy was the cost of becoming Nvidia, a packaging company, a compiler company, and a data-center operator at the same time.
XVI. What could break the thesis?
Dojo showed technical ambition, but not enough proven business leverage to keep the dedicated training path central.
- Custom too narrow. Dojo may have been too custom for Tesla’s evolving AI roadmap.3
- GPU still wins on cost. External GPUs and AI5 / AI6 may be better uses of Tesla resources.
- Training ↔ inference shift. Dedicated training chips can become obsolete if inference dominates.3
- Schedule risk. Samsung 2nm or TSMC AI5 schedules may slip.5
- Nvidia ecosystem. Nvidia’s software / network / cloud stack remains very hard to beat.
- Software burden. Custom stacks are expensive to maintain.
- Fan-out hard problems. Yield, thermal, and serviceability challenges scale with package size.1
- Compiler ceiling. A beautiful package is wasted if the compiler / runtime is not productive.
- Autonomy ≠ compute-bound. Tesla’s autonomy progress may not be bottlenecked by training hardware alone.
XVII. What could break the bear case?
Even if Dojo changes form, the lesson survives.
- Early lesson advantage. Tesla learned package-scale AI hardware earlier than most.1
- Reusable IP. Dojo lessons feed AI5, AI6, robotics, autonomy, and future infrastructure.
- Data + workload control. Tesla has unique data and workload control.
- Training ↔ inference convergence. Future chips may serve both well.3
- Package-scale mainstreams. SoW-X / CoWoS 14R / SoIC validate the direction.7
- Power / cooling reusable. Lessons travel even if the tile architecture changes form.2
- Robotics scale. Optimus and physical AI add new training-compute demand.
- Vertical integration still rare. Few companies can co-design data, model, and silicon.
Even if Dojo changes form, the lesson survives: AI compute is constrained by bandwidth, power, cooling, packaging, and software.
XVIII. What to watch
- Whether Tesla continues using Dojo systems internally.3
- AI5 tape-out and production timing.4
- AI6 tape-out timing.5
- Samsung 2nm yield and schedule.5
- TSMC AI5 production in Taiwan / Arizona.4
- Tesla’s actual training-compute spending.
- Nvidia usage inside Tesla.
- AMD usage inside Tesla, if any.
- Whether Dojo software survives in AI5 / AI6 workflows.
- Optimus training requirements.
- Autonomy model size and data growth.
- Tesla’s FSD training cadence.
- Package-scale AI hardware trends.7
- TSMC SoW-X roadmap.7
- CoWoS capacity and package size evolution.7
- HBM dependence vs SRAM-heavy architectures.1
- Power and cooling architecture for AI clusters.2
- Whether custom accelerators beat GPU clusters in narrow workloads.
Glossary
A short reference for the vocabulary used above. Definitions are simplified.
- Dojo
- Tesla’s custom AI training supercomputer project.
- D1
- Tesla’s Dojo training chip.
- Training node
- Local compute block inside D1.
- Training tile
- Package-level unit containing 25 D1 dies.
- Known-good die
- A chip die tested before integration.
- Fan-out wafer
- Packaging process that redistributes connections outward from dies to enable dense integration.
- SRAM
- Fast on-chip memory.
- HBM
- High-bandwidth memory used near AI accelerators.
- BF16
- bfloat16, a low-precision format often used in AI training.
- CFP8
- Configurable 8-bit floating point format referenced in Dojo coverage.
- SerDes
- Serializer / deserializer links used for high-speed data movement.
- TDP
- Thermal design power.
- VRM
- Voltage regulator module.
- ExaPOD
- Tesla’s larger Dojo system concept built from multiple tiles / cabinets.
- CoWoS
- TSMC advanced packaging technology often used for AI / HPC chips.
- SoIC
- TSMC 3D stacking technology.
- SoW-X
- TSMC System-on-Wafer roadmap technology.
- COUPE
- TSMC co-packaged optics technology.
- Tape-out
- Final chip design handoff before manufacturing.
- Rack-scale AI
- AI compute designed at rack or data-center scale rather than single-chip scale.
XIX. The package became the computer
Dojo was not just a chip.
It was Tesla’s attempt to make a training computer out of packaging, power delivery, cooling, interconnect, SRAM, and software.
The 2021 SemiAnalysis piece was bold about Tesla’s ~4× / 1.3× / 5× / nearly-order-of-magnitude TCO claim, and explicit that production deployment was the real test. The 2026 reality is more complicated than that claim implied and more aligned with that caution than either side of the bull-bear debate likes to admit. Tesla learned an enormous amount about package-scale AI hardware. Tesla also decided that a dedicated Dojo training-supercomputer team was not how it wanted to spend that learning.
The Dojo team was streamlined toward AI5 and AI6. TSMC’s CoWoS, SoIC, COUPE, and SoW-X roadmap is moving the industry toward exactly the kind of package-scale and rack-scale integration Dojo previewed in 2021. The companies that win the next AI cycle will be the ones that treat the package, the rack, and the software stack as one design problem. Tesla saw that early. Others are catching up now.
That is how the order-of-magnitude bet aged. Not into a clean victory for any one company. Into the AI infrastructure playbook.
That is how the package became the computer.
1 Patel, D. (Aug 2021). Tesla’s Dojo, 1 Order Of Magnitude Better Cost, Performance, Scale Than Nvidia Solutions. SemiAnalysis. Historical anchor for the Dojo system framing, including D1 specs (354 functional units, 362 TFLOPS BF16/CFP8, 50B transistors, 645mm², 400W TDP, 1.25MB SRAM per training node, 10 TB/s on-chip directional bandwidth, 576 SerDes at 112 GT/s, ~8 TB/s off-chip bandwidth), the 25-die training tile (9 PFLOPS, 36 TB/s off-tile bandwidth), the ~10kW / ~15kW tile power framing, vertical power delivery, custom VRMs reflowed onto the fan-out wafer, the cabinet / ExaPOD architecture, the PyTorch / Dojo compiler / LLVM software stack, and Tesla’s ~4x / 1.3x / 5x / ~10x TCO claim versus Nvidia AI solutions with the explicit caution that production deployment was the real test. Used as inspiration only. No content, structure, or charts reproduced.
2 Cadence. Not chips: Tesla’s Dojo. Independent technical summary corroborating 25 known-good D1 dies on a fanout-wafer process, ~9 PFLOPS / tile, ~36 TB/s off-tile bandwidth, ~52V DC, ~18,000A current draw, ~15kW heat dissipation, and the tile / cabinet / ExaPOD framing.
3 Reuters (Aug 2025). Tesla to streamline its AI chip design work, Musk says. Bloomberg-via-Reuters reporting that Tesla was disbanding its Dojo supercomputer team, with Musk saying it did not make sense to divide resources across two very different AI chip designs and that Tesla’s effort was focused on AI5, AI6, and subsequent chips, described as excellent for inference and at least pretty good for training.
4 Reuters (Jul 2025). Tesla / Samsung $16.5B chip supply deal. Samsung’s Taylor (Texas) factory framed to make Tesla’s next-generation AI6 chip per Musk; Samsung currently making AI4; TSMC slated to make AI5 first in Taiwan and then Arizona per Musk; self-driving / Optimus / broader AI context.
5 Reuters (Mar 2026). Musk says Tesla may tape out AI6 in December 2026. Samsung executive cited saying Tesla chips based on Samsung’s advanced 2nm process were planned for production in the second half of 2027.
6 Tesla AI Day 2021 official material and subsequent Tesla technical presentations remain the original sources for D1 chip, training tile, cabinet, ExaPOD, and software stack framing. This essay relies on the SemiAnalysis 2021 framing (fn1) and the Cadence technical summary (fn2) for the specific numbers used; no Tesla material is reproduced.
7 TSMC (2026). 2026 North America Technology Symposium. CoWoS expansion (5.5-reticle today, 14-reticle by 2028 with ~10 large compute dies and 20 HBM stacks), SoIC 3D stacking, COUPE co-packaged optics, and a 40-reticle SoW-X System-on-Wafer technology targeted for 2029.
8 Public Hot Chips 34 Dojo System material and other credible Dojo technical coverage is referenced in this essay only at the level the SemiAnalysis 2021 framing (fn1) and Cadence summary (fn2) already disclose. Specific microarchitecture claims are made only where verifiable.
9 Public Nvidia developer / DGX materials are referenced only as comparative platform context. This essay does not reproduce Nvidia visuals or marketing.
10 TSMC packaging materials (CoWoS, InFO, SoIC, SoW) provide additional context for the system-level packaging direction. The essay uses the 2026 symposium framing (fn7) as the primary citation rather than restating individual product pages.
- The Wafer-Scale Training Bet. Companion essay built around the 2021 SemiAnalysis Tesla AI Day teaser and InFO_SoW speculation — reads the package-as-architecture thesis from a different 2021 anchor.
- The Wafer-Scale Latency Bet. Cerebras and the case for removing chip boundaries (a useful Cerebras vs Dojo contrast).
- The Foundry Toll Road. Why TSMC’s pricing power got stronger in the AI era.
- The GAA Credibility Test. Samsung Foundry’s 2nm comeback as a trust test, not a transistor story.
- When AI Runs Out of Copper. Optical I/O, co-packaged optics, and the race to replace copper with light.
- Nvidia Built the AI Factory Anyway. Vertical system integration as the new moat.
- The Bubble That Became Infrastructure. Why Nvidia’s 2021 overvaluation story turned into the AI factory thesis.
- The Custom Silicon Flywheel. Hyperscalers turning their biggest workloads into chips.
- The Back-End Bottleneck. Wire bonding, TCB, hybrid bonding, and power assembly as AI packaging.
- The Power Efficiency Layer. Power Integrations and the hidden power-conversion stack.
- The Networked AI Bet. Tenstorrent’s open, Ethernet-native attack on the AI compute stack.
- The AI Chip Software Wall. Why specialised silicon alone was not enough to beat Nvidia.
- The AI Field Manual. Reference layer for the AI stack: hardware, memory, models, agents, safety, economics.
This is Essay No. 040. The topics: intelligence, AI, systems, knowledge, and the questions underneath the questions everyone else is asking. If you read this far and disagreed with any part of it, write to me. I read everything.