← Back to blog
Essay No. 040  ·  AI Infrastructure  ·  Melbourne, Australia
AI Infrastructure Tesla Dojo D1 Advanced Packaging Fan-Out Wafer TSMC AI5 AI6 Samsung CoWoS SoIC SoW-X SRAM Power Delivery Cooling

The Package Became the Computer.Original analysisNot investment advice

How Tesla Dojo’s order-of-magnitude training bet aged into the AI infrastructure playbook.
PM
Pugalenthi Magendran
April 2026  ·  Melbourne, Australia
12 min read

Dojo was not just a chip. It was Tesla’s attempt to make a training computer out of packaging, power delivery, cooling, interconnect, SRAM, and software. In 2026, the dedicated Dojo story looks weaker because Tesla shifted toward AI5 and AI6. But the core insight aged well: AI scaling is moving from chip-level performance to package-level and rack-level integration. The package is becoming the computer.

In 2021, Tesla revealed Dojo. The easy headline was: Tesla built an AI chip. That was not the real story. The real story was stranger. Tesla built a training surface.

A D1 chip mattered, but the chip was not the unit of scale. The uploaded SemiAnalysis article made this clear: Dojo’s real unit of scale was the training tile, a 25-chip fan-out wafer package designed to behave like one giant compute plane.1

That is why Dojo mattered. Not because Tesla made another accelerator. Because Tesla tried to make packaging, power delivery, cooling, and software part of the accelerator itself.

Key idea

Tesla Dojo was an early, extreme version of the problem every AI hardware company now faces: scaling AI compute is not just about faster chips. It is about packaging, interconnect, memory locality, power delivery, cooling, software, and workload fit. The dedicated Dojo path looks less central after Tesla shifted toward AI5 and AI6, but the system-level lesson aged well. AI hardware is becoming package-scale and rack-scale.


I. The 2021 thesis was about the system

In August 2021, Dylan Patel published a SemiAnalysis piece written after Tesla AI Day. The piece did not only describe the D1 chip. It described the training tile, the power delivery, the cabinet, the ExaPOD, and the software stack — arguing that Tesla had designed Dojo because GPU-cluster scaling was not enough, and that Dojo’s distributed compute plane needed high bandwidth, low latency, spatial and temporal locality, and a mesh of compute units connected by fabric. The deeper insight was that the package and system were the unit of scale.1

2021 thesis

Dojo was not just an accelerator chip. It was Tesla’s attempt to make packaging, bandwidth, power, cooling, and software into one training architecture.


II. D1 was built to be networked

The 2021 piece described D1 as a chip designed for movement, not only for compute. Each training node carried roughly 1.25MB of SRAM with CPU-like flexibility, SIMD, matrix multiply, and ML-focused custom instructions. The die delivered 362 TFLOPS in BF16 / CFP8 across 354 functional units, used 50B transistors on a roughly 645mm² die, drew about 400W TDP, supported 10 TB/s of directional on-chip bandwidth, and exposed 576 SerDes at 112 GT/s for roughly 8 TB/s of total off-chip bandwidth.1

Card · D1, simplified
D1 chip specs
354
functional units1
362 TF
BF16 / CFP81
50B
transistors1
645mm²
die area1
400W
TDP1
10 TB/s
on-chip BW (directional)1
576
SerDes @ 112 GT/s1
~8 TB/s
off-chip BW1
A simplified, original card. Figures as reported by Tesla via SemiAnalysis and corroborated by Cadence’s technical summary.12

Most AI chips are described by compute. Dojo needs to be described by movement.


III. The tile was the breakthrough

Twenty-five D1 chips were packaged in a fan-out wafer process into one training tile delivering roughly 9 PFLOPS of BF16 / CFP8 compute and 36 TB/s of off-tile bandwidth.1 Cadence’s technical summary corroborates the framing: 25 known-good D1 dies on a fanout-wafer process that preserves bandwidth between adjacent chips.2 In a normal accelerator story, the chip is the product. In Dojo, the tile was the product.

Diagram · 25-die training tile, schematic
Training tile · package as compute surface
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
D1
9 PFLOPS fan-out tile 36 TB/s off-tile
25 known-good D1 dies on a fanout-wafer process, ~9 PFLOPS, ~36 TB/s off-tile bandwidth.12
A schematic, original visual. The Tesla AI Day image is not reproduced; cell counts and frame labels are stylised.

The package became the computer.


IV. Fan-out wafer packaging, simply

A normal package connects chips through a substrate or interposer. Fan-out wafer packaging redistributes chip connections through a wafer-like structure, allowing many known-good dies to be connected close together with dense wiring.

Diagram · Fan-out wafer — advantages vs tradeoffs
Advantages

Why Dojo chose it

  • Shorter chip-to-chip paths.
  • Lower latency and higher bandwidth density.
  • Better scale-up inside the tile.
  • Dense interconnect without monolithic die yield risk.1
Tradeoffs

What it costs

  • Yield complexity at the tile level.
  • Thermal and serviceability challenges.
  • Power-delivery difficulty at scale.
  • Software must understand the topology.
A simplified, original split. Dojo was not trying to make one impossible die; it was trying to make many dies behave like one training surface.

Dojo was not trying to make one impossible die. It was trying to make many dies behave like one training surface.


V. Power delivery was architecture

The 2021 SemiAnalysis piece described a tile consuming over 10kW at the package level and about 15kW when power delivery, IO, and wafer wiring are included, with power entering vertically from the bottom, heat leaving from the top, and custom VRMs reflowed directly onto the fan-out wafer.1 Cadence’s summary corroborates: the tile took 52V DC, drew 18,000A, dissipated 15kW of heat, and delivered 9 PFLOPS in less than one cubic foot.2

Diagram · Power and cooling, simplified
01
Input
52V DC2
02
Custom VRMs
reflowed on tile1
03
D1 fabric
18,000 A draw2
04
Heat path
top extraction1
05
Cold plate
15kW removed2
06
Cabinet
ExaPOD scaling1
A simplified, original 6-step flow. Power, cooling, and the tile are the same architecture.

At Dojo scale, the question is not just “how fast is the chip?” It is “can the system feed enough power and remove enough heat to keep the fabric alive?”


VI. SRAM locality was the memory bet

Each training node carried 1.25MB of SRAM, and Dojo emphasised spatial and temporal locality rather than the HBM-heavy approach most accelerators took.1 HBM-heavy systems bring enormous memory bandwidth, but depend on expensive memory stacks and interposers. Dojo emphasised many local SRAM pools, a tightly connected compute fabric, and data movement through the tile.

Diagram · SRAM locality vs HBM-heavy memory bet
Mainstream

HBM-heavy

High-bandwidth memory stacks beside the accelerator. Broad workload fit, deep ecosystem, dominant for general-purpose training.

Dojo emphasis

SRAM locality

Many local SRAM pools, tight inter-die fabric. Narrower workload fit, but lower memory-movement penalty if the workload maps cleanly to the tile.1

A simplified, original split. Architectural emphasis, not absolute absence of external memory.

Dojo was not a generic GPU clone. It was a locality and interconnect bet.


VII. Dojo was also a software bet

A beautiful tile is useless if the software stack cannot map real models onto it. The 2021 SemiAnalysis piece described a Dojo software stack with a PyTorch extension at the top, a Dojo compiler engine in the middle, and an LLVM backend at the bottom, with multi-host and multi-partition support, model and data graph parallelism, and the ability to scale work across chip and tile boundaries.1

Diagram · Dojo software stack, simplified
01
PyTorch model
developer entry1
02
Dojo compiler
graph partition1
03
LLVM backend
code generation1
04
Tile fabric
multi-host runtime1
05
Cabinet / ExaPOD
training scale1
A simplified, original visual based on the public 2021 Dojo software-stack framing.1

Custom AI hardware only matters if the software stack makes the hardware usable.


VIII. The 2021 claim was massive

Tesla’s headline framing in 2021 was bold: roughly 4× performance, 1.3× performance per watt, 5× smaller footprint, and nearly an order-of-magnitude TCO advantage versus an Nvidia AI solution. The SemiAnalysis author was enthusiastic but explicitly cautioned that the real test would be production deployment.1

Tesla 2021 claim · vs proof required
~4×
performance vs Nvidia1
~1.3×
performance per watt1
~5×
smaller footprint1
~10×
TCO advantage (claim)1

The claim was not obviously impossible. But the burden of proof was enormous: stable hardware, working compiler, high utilisation, reliable cooling, strong yield, software migration, and production-scale deployment that beat the economics of Nvidia clusters. None of those land from a slide.1

The claim was not obviously impossible. But the burden of proof was enormous.


IX. The 2026 update is not a clean victory lap

Reuters reported that Bloomberg said Tesla was disbanding its Dojo supercomputer team, with Musk saying Tesla should not divide resources across two different AI chip designs and that Tesla’s effort was focused on AI5, AI6, and subsequent chips, framed as excellent for inference and at least pretty good for training.3

Reading the shift

Dojo’s reorganisation is not a verdict on packaging.

The reporting describes a streamlining of Tesla’s AI chip teams toward AI5 / AI6, not a rejection of fan-out wafer-scale ideas industry-wide. It tells you about Tesla’s priorities. It does not tell you that the package-as-system thesis is wrong; TSMC’s SoW-X roadmap suggests the opposite direction.7

Dojo was technically fascinating, but Tesla appears to have chosen a more unified inference-first chip roadmap.


X. AI5 and AI6 are the new center

Reuters reported Tesla signed a ~$16.5B supply deal with Samsung, with Musk saying Samsung’s Taylor (Texas) factory would make Tesla’s next-generation AI6 chip; Samsung currently makes Tesla’s AI4 chips; TSMC is slated to make AI5 first in Taiwan and then Arizona according to Musk; and the chips are intended for self-driving vehicles, Optimus robots, and broader AI applications.4 Reuters also reported Musk saying Tesla may tape out AI6 in December 2026, with a Samsung executive saying Tesla chips based on Samsung’s advanced 2nm process were planned for production in the second half of 2027.5

Diagram · Dojo → AI5 / AI6 strategy shift
2021

Dojo AI Day

D1, training tile, ExaPOD, custom software.1
2022–24

Dojo build-out

25 D1 dies, 9 PF tile, fan-out wafer.2
2024–25

Cost / priority test

Custom training path competes with GPU spend.
2025

Dojo streamlined

Resources move to AI5 / AI6.3
2026–27

AI5 / AI6

TSMC + Samsung; inference-first chips.45
A simplified, original timeline. Years are approximate; key milestones per cited reporting.

Dojo was the training-supercomputer bet. AI5 and AI6 are the deployment-scale AI-chip bets.


XI. Why the packaging thesis aged better than the Dojo thesis

TSMC’s 2026 North America Technology Symposium materials describe a packaging roadmap with 5.5-reticle CoWoS today, 14-reticle CoWoS by 2028 supporting roughly 10 large compute dies and 20 HBM stacks, and a 40-reticle SoW-X System-on-Wafer targeted for 2029, alongside SoIC 3D stacking and COUPE co-packaged optics.7

Diagram · TSMC packaging roadmap, simplified
Today

CoWoS 5.5R

compute + HBM7
2028

CoWoS 14R

~10 dies + 20 HBM7
Cross-cut

SoIC + COUPE

3D stacking, optics7
2029

SoW-X 40R

system-on-wafer7
A simplified, original visual of TSMC’s public roadmap framing. Dojo became complicated; the Dojo problem became universal.

Dojo itself became complicated. The Dojo problem became universal.


XII. The package is becoming the computer

AI scaling used to look like: faster chip → faster model training. Now it looks like: die + package + HBM / SRAM + interconnect + power + cooling + rack + software partitioning. The bottlenecks are chip-to-chip bandwidth, HBM capacity, HBM bandwidth, SRAM locality, package size, reticle limits, substrate / interposer complexity, power delivery, cooling, optical I/O, rack networking, compiler / runtime, and workload partitioning.

Diagram · Old vs new scaling lens
Old lens

Chip → server → cluster

  • Bottleneck · single-chip performance.
  • Scaling unit · one accelerator.
  • Constraint · transistor count.
  • Result · servers full of independent chips.
New lens

Die → tile → rack → data center

  • Bottleneck · chip-to-chip bandwidth, power, cooling.
  • Scaling unit · package + tile + rack.7
  • Constraint · interconnect, memory, energy.
  • Result · AI factory designed as one system.
A simplified, original split. The unit of AI scaling moved from chip to package to rack.

The package is becoming the unit of AI scaling.


XIII. Dojo vs Nvidia was vertical integration vs ecosystem

The Dojo-vs-Nvidia comparison is not just "custom chip vs GPU." It is platform vs vertical integration. Nvidia wins by ecosystem leverage. Dojo tried to win by workload-specific integration.

Dimension
Nvidia path
Tesla Dojo path
Hardware
Mature GPU + HBM + NVLink + DGX rack-scale systems.
D1 dies + fan-out training tile + custom cabinet.2
Memory
HBM-heavy.
SRAM locality + tile fabric.1
Networking
NVLink + InfiniBand + Spectrum-X.
On-tile fabric + custom interconnect.1
Software
CUDA + libraries + ecosystem.
Tesla PyTorch ext + Dojo compiler + LLVM.1
Workload fit
Broad, flexible.
Optimised for Tesla’s vision data loop.
Iteration
Through external supplier scale.
Through internal control.

Nvidia sells a platform. Dojo was Tesla trying to build a machine for one company’s data loop.


XIV. The software problem

A custom accelerator fails quietly when the software team cannot make it convenient. To make Dojo work, Tesla needed a compiler, runtime, training-framework support, model partitioning, debugging tools, scheduling, fault tolerance, reliability management, data pipeline integration, developer productivity, and a migration path from existing GPU workflows. None of those are easy. None of them get cheaper at custom-hardware scale.

The hardware can be brilliant and still lose if the software path is too painful.


XV. The business lesson

Hardware ambition is not enough. The architecture has to map to business leverage. Dojo had a clear technical reason: Tesla-specific training from fleet video. But the business test is whether it reduced cost enough, sped training enough, justified a separate team, kept up with Nvidia’s roadmap, justified custom software, scaled reliably, and helped cars and robots ship faster. Those are a lot of bars to clear in parallel.

Dojo’s enemy was not only Nvidia. Dojo’s enemy was the cost of becoming Nvidia, a packaging company, a compiler company, and a data-center operator at the same time.


XVI. What could break the thesis?

Dojo showed technical ambition, but not enough proven business leverage to keep the dedicated training path central.

Bear case · what could break the thesis
  1. Custom too narrow. Dojo may have been too custom for Tesla’s evolving AI roadmap.3
  2. GPU still wins on cost. External GPUs and AI5 / AI6 may be better uses of Tesla resources.
  3. Training ↔ inference shift. Dedicated training chips can become obsolete if inference dominates.3
  4. Schedule risk. Samsung 2nm or TSMC AI5 schedules may slip.5
  5. Nvidia ecosystem. Nvidia’s software / network / cloud stack remains very hard to beat.
  6. Software burden. Custom stacks are expensive to maintain.
  7. Fan-out hard problems. Yield, thermal, and serviceability challenges scale with package size.1
  8. Compiler ceiling. A beautiful package is wasted if the compiler / runtime is not productive.
  9. Autonomy ≠ compute-bound. Tesla’s autonomy progress may not be bottlenecked by training hardware alone.

XVII. What could break the bear case?

Even if Dojo changes form, the lesson survives.

Bull case · what could break the bear
  1. Early lesson advantage. Tesla learned package-scale AI hardware earlier than most.1
  2. Reusable IP. Dojo lessons feed AI5, AI6, robotics, autonomy, and future infrastructure.
  3. Data + workload control. Tesla has unique data and workload control.
  4. Training ↔ inference convergence. Future chips may serve both well.3
  5. Package-scale mainstreams. SoW-X / CoWoS 14R / SoIC validate the direction.7
  6. Power / cooling reusable. Lessons travel even if the tile architecture changes form.2
  7. Robotics scale. Optimus and physical AI add new training-compute demand.
  8. Vertical integration still rare. Few companies can co-design data, model, and silicon.

Even if Dojo changes form, the lesson survives: AI compute is constrained by bandwidth, power, cooling, packaging, and software.


XVIII. What to watch

What to watch
  • Whether Tesla continues using Dojo systems internally.3
  • AI5 tape-out and production timing.4
  • AI6 tape-out timing.5
  • Samsung 2nm yield and schedule.5
  • TSMC AI5 production in Taiwan / Arizona.4
  • Tesla’s actual training-compute spending.
  • Nvidia usage inside Tesla.
  • AMD usage inside Tesla, if any.
  • Whether Dojo software survives in AI5 / AI6 workflows.
  • Optimus training requirements.
  • Autonomy model size and data growth.
  • Tesla’s FSD training cadence.
  • Package-scale AI hardware trends.7
  • TSMC SoW-X roadmap.7
  • CoWoS capacity and package size evolution.7
  • HBM dependence vs SRAM-heavy architectures.1
  • Power and cooling architecture for AI clusters.2
  • Whether custom accelerators beat GPU clusters in narrow workloads.

Glossary

A short reference for the vocabulary used above. Definitions are simplified.

Glossary
Dojo
Tesla’s custom AI training supercomputer project.
D1
Tesla’s Dojo training chip.
Training node
Local compute block inside D1.
Training tile
Package-level unit containing 25 D1 dies.
Known-good die
A chip die tested before integration.
Fan-out wafer
Packaging process that redistributes connections outward from dies to enable dense integration.
SRAM
Fast on-chip memory.
HBM
High-bandwidth memory used near AI accelerators.
BF16
bfloat16, a low-precision format often used in AI training.
CFP8
Configurable 8-bit floating point format referenced in Dojo coverage.
SerDes
Serializer / deserializer links used for high-speed data movement.
TDP
Thermal design power.
VRM
Voltage regulator module.
ExaPOD
Tesla’s larger Dojo system concept built from multiple tiles / cabinets.
CoWoS
TSMC advanced packaging technology often used for AI / HPC chips.
SoIC
TSMC 3D stacking technology.
SoW-X
TSMC System-on-Wafer roadmap technology.
COUPE
TSMC co-packaged optics technology.
Tape-out
Final chip design handoff before manufacturing.
Rack-scale AI
AI compute designed at rack or data-center scale rather than single-chip scale.

XIX. The package became the computer

Dojo was not just a chip.

It was Tesla’s attempt to make a training computer out of packaging, power delivery, cooling, interconnect, SRAM, and software.

In 2026, the dedicated Dojo story looks weaker because Tesla shifted toward AI5 and AI6. But the core insight aged well: AI scaling is moving from chip-level performance to package-level and rack-level integration. The package is becoming the computer.

The 2021 SemiAnalysis piece was bold about Tesla’s ~4× / 1.3× / 5× / nearly-order-of-magnitude TCO claim, and explicit that production deployment was the real test. The 2026 reality is more complicated than that claim implied and more aligned with that caution than either side of the bull-bear debate likes to admit. Tesla learned an enormous amount about package-scale AI hardware. Tesla also decided that a dedicated Dojo training-supercomputer team was not how it wanted to spend that learning.

The Dojo team was streamlined toward AI5 and AI6. TSMC’s CoWoS, SoIC, COUPE, and SoW-X roadmap is moving the industry toward exactly the kind of package-scale and rack-scale integration Dojo previewed in 2021. The companies that win the next AI cycle will be the ones that treat the package, the rack, and the software stack as one design problem. Tesla saw that early. Others are catching up now.

That is how the order-of-magnitude bet aged. Not into a clean victory for any one company. Into the AI infrastructure playbook.

That is how the package became the computer.


1 Patel, D. (Aug 2021). Tesla’s Dojo, 1 Order Of Magnitude Better Cost, Performance, Scale Than Nvidia Solutions. SemiAnalysis. Historical anchor for the Dojo system framing, including D1 specs (354 functional units, 362 TFLOPS BF16/CFP8, 50B transistors, 645mm², 400W TDP, 1.25MB SRAM per training node, 10 TB/s on-chip directional bandwidth, 576 SerDes at 112 GT/s, ~8 TB/s off-chip bandwidth), the 25-die training tile (9 PFLOPS, 36 TB/s off-tile bandwidth), the ~10kW / ~15kW tile power framing, vertical power delivery, custom VRMs reflowed onto the fan-out wafer, the cabinet / ExaPOD architecture, the PyTorch / Dojo compiler / LLVM software stack, and Tesla’s ~4x / 1.3x / 5x / ~10x TCO claim versus Nvidia AI solutions with the explicit caution that production deployment was the real test. Used as inspiration only. No content, structure, or charts reproduced.

2 Cadence. Not chips: Tesla’s Dojo. Independent technical summary corroborating 25 known-good D1 dies on a fanout-wafer process, ~9 PFLOPS / tile, ~36 TB/s off-tile bandwidth, ~52V DC, ~18,000A current draw, ~15kW heat dissipation, and the tile / cabinet / ExaPOD framing.

3 Reuters (Aug 2025). Tesla to streamline its AI chip design work, Musk says. Bloomberg-via-Reuters reporting that Tesla was disbanding its Dojo supercomputer team, with Musk saying it did not make sense to divide resources across two very different AI chip designs and that Tesla’s effort was focused on AI5, AI6, and subsequent chips, described as excellent for inference and at least pretty good for training.

4 Reuters (Jul 2025). Tesla / Samsung $16.5B chip supply deal. Samsung’s Taylor (Texas) factory framed to make Tesla’s next-generation AI6 chip per Musk; Samsung currently making AI4; TSMC slated to make AI5 first in Taiwan and then Arizona per Musk; self-driving / Optimus / broader AI context.

5 Reuters (Mar 2026). Musk says Tesla may tape out AI6 in December 2026. Samsung executive cited saying Tesla chips based on Samsung’s advanced 2nm process were planned for production in the second half of 2027.

6 Tesla AI Day 2021 official material and subsequent Tesla technical presentations remain the original sources for D1 chip, training tile, cabinet, ExaPOD, and software stack framing. This essay relies on the SemiAnalysis 2021 framing (fn1) and the Cadence technical summary (fn2) for the specific numbers used; no Tesla material is reproduced.

7 TSMC (2026). 2026 North America Technology Symposium. CoWoS expansion (5.5-reticle today, 14-reticle by 2028 with ~10 large compute dies and 20 HBM stacks), SoIC 3D stacking, COUPE co-packaged optics, and a 40-reticle SoW-X System-on-Wafer technology targeted for 2029.

8 Public Hot Chips 34 Dojo System material and other credible Dojo technical coverage is referenced in this essay only at the level the SemiAnalysis 2021 framing (fn1) and Cadence summary (fn2) already disclose. Specific microarchitecture claims are made only where verifiable.

9 Public Nvidia developer / DGX materials are referenced only as comparative platform context. This essay does not reproduce Nvidia visuals or marketing.

10 TSMC packaging materials (CoWoS, InFO, SoIC, SoW) provide additional context for the system-level packaging direction. The essay uses the 2026 symposium framing (fn7) as the primary citation rather than restating individual product pages.

Further reading
*   *   *

This is Essay No. 040. The topics: intelligence, AI, systems, knowledge, and the questions underneath the questions everyone else is asking. If you read this far and disagreed with any part of it, write to me. I read everything.

Pugalenthi Magendran