← Back to blog

Essay No. 040 · AI Infrastructure · Melbourne, Australia

AI Infrastructure Tesla Dojo D1 Advanced Packaging Fan-Out Wafer TSMC AI5 AI6 Samsung CoWoS SoIC SoW-X SRAM Power Delivery Cooling

The Package Became the Computer.Original analysisNot investment advice

How Tesla Dojo’s order-of-magnitude training bet aged into the AI infrastructure playbook.

Pugalenthi Magendran

April 2026 · Melbourne, Australia

12 min read

Dojo was not just a chip. It was Tesla’s attempt to make a training computer out of packaging, power delivery, cooling, interconnect, SRAM, and software. In 2026, the dedicated Dojo story looks weaker because Tesla shifted toward AI5 and AI6. But the core insight aged well: AI scaling is moving from chip-level performance to package-level and rack-level integration. The package is becoming the computer.

In 2021, Tesla revealed Dojo. The easy headline was: Tesla built an AI chip. That was not the real story. The real story was stranger. Tesla built a training surface.

A D1 chip mattered, but the chip was not the unit of scale. The uploaded SemiAnalysis article made this clear: Dojo’s real unit of scale was the training tile, a 25-chip fan-out wafer package designed to behave like one giant compute plane.¹

That is why Dojo mattered. Not because Tesla made another accelerator. Because Tesla tried to make packaging, power delivery, cooling, and software part of the accelerator itself.

Key idea

Tesla Dojo was an early, extreme version of the problem every AI hardware company now faces: scaling AI compute is not just about faster chips. It is about packaging, interconnect, memory locality, power delivery, cooling, software, and workload fit. The dedicated Dojo path looks less central after Tesla shifted toward AI5 and AI6, but the system-level lesson aged well. AI hardware is becoming package-scale and rack-scale.

I. The 2021 thesis was about the system

In August 2021, Dylan Patel published a SemiAnalysis piece written after Tesla AI Day. The piece did not only describe the D1 chip. It described the training tile, the power delivery, the cabinet, the ExaPOD, and the software stack — arguing that Tesla had designed Dojo because GPU-cluster scaling was not enough, and that Dojo’s distributed compute plane needed high bandwidth, low latency, spatial and temporal locality, and a mesh of compute units connected by fabric. The deeper insight was that the package and system were the unit of scale.¹

2021 thesis

Dojo was not just an accelerator chip. It was Tesla’s attempt to make packaging, bandwidth, power, cooling, and software into one training architecture.

II. D1 was built to be networked

The 2021 piece described D1 as a chip designed for movement, not only for compute. Each training node carried roughly 1.25MB of SRAM with CPU-like flexibility, SIMD, matrix multiply, and ML-focused custom instructions. The die delivered 362 TFLOPS in BF16 / CFP8 across 354 functional units, used 50B transistors on a roughly 645mm² die, drew about 400W TDP, supported 10 TB/s of directional on-chip bandwidth, and exposed 576 SerDes at 112 GT/s for roughly 8 TB/s of total off-chip bandwidth.¹

Card · D1, simplified

D1 chip specs

354

functional units¹

362 TF

BF16 / CFP8¹

50B

transistors¹

645mm²

die area¹

400W

TDP¹

10 TB/s

on-chip BW (directional)¹

576

SerDes @ 112 GT/s¹

~8 TB/s

off-chip BW¹

A simplified, original card. Figures as reported by Tesla via SemiAnalysis and corroborated by Cadence’s technical summary.¹²

Most AI chips are described by compute. Dojo needs to be described by movement.

III. The tile was the breakthrough

Twenty-five D1 chips were packaged in a fan-out wafer process into one training tile delivering roughly 9 PFLOPS of BF16 / CFP8 compute and 36 TB/s of off-tile bandwidth.¹ Cadence’s technical summary corroborates the framing: 25 known-good D1 dies on a fanout-wafer process that preserves bandwidth between adjacent chips.² In a normal accelerator story, the chip is the product. In Dojo, the tile was the product.

Diagram · 25-die training tile, schematic

Training tile · package as compute surface

9 PFLOPS fan-out tile 36 TB/s off-tile

25 known-good D1 dies on a fanout-wafer process, ~9 PFLOPS, ~36 TB/s off-tile bandwidth.¹²

A schematic, original visual. The Tesla AI Day image is not reproduced; cell counts and frame labels are stylised.

The package became the computer.

IV. Fan-out wafer packaging, simply

A normal package connects chips through a substrate or interposer. Fan-out wafer packaging redistributes chip connections through a wafer-like structure, allowing many known-good dies to be connected close together with dense wiring.

Diagram · Fan-out wafer — advantages vs tradeoffs

Advantages

Why Dojo chose it

Shorter chip-to-chip paths.
Lower latency and higher bandwidth density.
Better scale-up inside the tile.
Dense interconnect without monolithic die yield risk.¹

Tradeoffs

What it costs

Yield complexity at the tile level.
Thermal and serviceability challenges.
Power-delivery difficulty at scale.
Software must understand the topology.

A simplified, original split. Dojo was not trying to make one impossible die; it was trying to make many dies behave like one training surface.

Dojo was not trying to make one impossible die. It was trying to make many dies behave like one training surface.

V. Power delivery was architecture

The 2021 SemiAnalysis piece described a tile consuming over 10kW at the package level and about 15kW when power delivery, IO, and wafer wiring are included, with power entering vertically from the bottom, heat leaving from the top, and custom VRMs reflowed directly onto the fan-out wafer.¹ Cadence’s summary corroborates: the tile took 52V DC, drew 18,000A, dissipated 15kW of heat, and delivered 9 PFLOPS in less than one cubic foot.²

Diagram · Power and cooling, simplified

Input

52V DC²

Custom VRMs

reflowed on tile¹

D1 fabric

18,000 A draw²

Heat path

top extraction¹

Cold plate

15kW removed²

Cabinet

ExaPOD scaling¹

A simplified, original 6-step flow. Power, cooling, and the tile are the same architecture.

At Dojo scale, the question is not just “how fast is the chip?” It is “can the system feed enough power and remove enough heat to keep the fabric alive?”

VI. SRAM locality was the memory bet

Each training node carried 1.25MB of SRAM, and Dojo emphasised spatial and temporal locality rather than the HBM-heavy approach most accelerators took.¹ HBM-heavy systems bring enormous memory bandwidth, but depend on expensive memory stacks and interposers. Dojo emphasised many local SRAM pools, a tightly connected compute fabric, and data movement through the tile.

Diagram · SRAM locality vs HBM-heavy memory bet

Mainstream

HBM-heavy

High-bandwidth memory stacks beside the accelerator. Broad workload fit, deep ecosystem, dominant for general-purpose training.

Dojo emphasis

SRAM locality

Many local SRAM pools, tight inter-die fabric. Narrower workload fit, but lower memory-movement penalty if the workload maps cleanly to the tile.¹

A simplified, original split. Architectural emphasis, not absolute absence of external memory.

Dojo was not a generic GPU clone. It was a locality and interconnect bet.

VII. Dojo was also a software bet

A beautiful tile is useless if the software stack cannot map real models onto it. The 2021 SemiAnalysis piece described a Dojo software stack with a PyTorch extension at the top, a Dojo compiler engine in the middle, and an LLVM backend at the bottom, with multi-host and multi-partition support, model and data graph parallelism, and the ability to scale work across chip and tile boundaries.¹

Diagram · Dojo software stack, simplified

PyTorch model

developer entry¹

Dojo compiler

graph partition¹

LLVM backend

code generation¹

Tile fabric

multi-host runtime¹

Cabinet / ExaPOD

training scale¹

A simplified, original visual based on the public 2021 Dojo software-stack framing.¹

Custom AI hardware only matters if the software stack makes the hardware usable.

VIII. The 2021 claim was massive

Tesla’s headline framing in 2021 was bold: roughly 4× performance, 1.3× performance per watt, 5× smaller footprint, and nearly an order-of-magnitude TCO advantage versus an Nvidia AI solution. The SemiAnalysis author was enthusiastic but explicitly cautioned that the real test would be production deployment.¹

Tesla 2021 claim · vs proof required

~4×

performance vs Nvidia¹

~1.3×

performance per watt¹

~5×

smaller footprint¹

~10×

TCO advantage (claim)¹

The claim was not obviously impossible. But the burden of proof was enormous: stable hardware, working compiler, high utilisation, reliable cooling, strong yield, software migration, and production-scale deployment that beat the economics of Nvidia clusters. None of those land from a slide.¹

The claim was not obviously impossible. But the burden of proof was enormous.

IX. The 2026 update is not a clean victory lap

Reuters reported that Bloomberg said Tesla was disbanding its Dojo supercomputer team, with Musk saying Tesla should not divide resources across two different AI chip designs and that Tesla’s effort was focused on AI5, AI6, and subsequent chips, framed as excellent for inference and at least pretty good for training.³

Reading the shift

Dojo’s reorganisation is not a verdict on packaging.

The reporting describes a streamlining of Tesla’s AI chip teams toward AI5 / AI6, not a rejection of fan-out wafer-scale ideas industry-wide. It tells you about Tesla’s priorities. It does not tell you that the package-as-system thesis is wrong; TSMC’s SoW-X roadmap suggests the opposite direction.⁷

Dojo was technically fascinating, but Tesla appears to have chosen a more unified inference-first chip roadmap.

X. AI5 and AI6 are the new center

Reuters reported Tesla signed a ~$16.5B supply deal with Samsung, with Musk saying Samsung’s Taylor (Texas) factory would make Tesla’s next-generation AI6 chip; Samsung currently makes Tesla’s AI4 chips; TSMC is slated to make AI5 first in Taiwan and then Arizona according to Musk; and the chips are intended for self-driving vehicles, Optimus robots, and broader AI applications.⁴ Reuters also reported Musk saying Tesla may tape out AI6 in December 2026, with a Samsung executive saying Tesla chips based on Samsung’s advanced 2nm process were planned for production in the second half of 2027.⁵

Diagram · Dojo → AI5 / AI6 strategy shift

2021

Dojo AI Day

D1, training tile, ExaPOD, custom software.¹

2022–24

Dojo build-out

25 D1 dies, 9 PF tile, fan-out wafer.²

2024–25

Cost / priority test

Custom training path competes with GPU spend.

2025

Dojo streamlined

Resources move to AI5 / AI6.³

2026–27

AI5 / AI6

TSMC + Samsung; inference-first chips.⁴⁵

A simplified, original timeline. Years are approximate; key milestones per cited reporting.

Dojo was the training-supercomputer bet. AI5 and AI6 are the deployment-scale AI-chip bets.

XI. Why the packaging thesis aged better than the Dojo thesis

TSMC’s 2026 North America Technology Symposium materials describe a packaging roadmap with 5.5-reticle CoWoS today, 14-reticle CoWoS by 2028 supporting roughly 10 large compute dies and 20 HBM stacks, and a 40-reticle SoW-X System-on-Wafer targeted for 2029, alongside SoIC 3D stacking and COUPE co-packaged optics.⁷

Diagram · TSMC packaging roadmap, simplified

Today

CoWoS 5.5R

compute + HBM⁷

2028

CoWoS 14R

~10 dies + 20 HBM⁷

Cross-cut

SoIC + COUPE

3D stacking, optics⁷

2029

SoW-X 40R

system-on-wafer⁷

A simplified, original visual of TSMC’s public roadmap framing. Dojo became complicated; the Dojo problem became universal.

Dojo itself became complicated. The Dojo problem became universal.

XII. The package is becoming the computer

AI scaling used to look like: faster chip → faster model training. Now it looks like: die + package + HBM / SRAM + interconnect + power + cooling + rack + software partitioning. The bottlenecks are chip-to-chip bandwidth, HBM capacity, HBM bandwidth, SRAM locality, package size, reticle limits, substrate / interposer complexity, power delivery, cooling, optical I/O, rack networking, compiler / runtime, and workload partitioning.

Diagram · Old vs new scaling lens

Old lens

Chip → server → cluster

Bottleneck · single-chip performance.
Scaling unit · one accelerator.
Constraint · transistor count.
Result · servers full of independent chips.

New lens

Die → tile → rack → data center

Bottleneck · chip-to-chip bandwidth, power, cooling.
Scaling unit · package + tile + rack.⁷
Constraint · interconnect, memory, energy.
Result · AI factory designed as one system.

A simplified, original split. The unit of AI scaling moved from chip to package to rack.

The package is becoming the unit of AI scaling.

XIII. Dojo vs Nvidia was vertical integration vs ecosystem

The Dojo-vs-Nvidia comparison is not just "custom chip vs GPU." It is platform vs vertical integration. Nvidia wins by ecosystem leverage. Dojo tried to win by workload-specific integration.

Dimension

Nvidia path

Tesla Dojo path

Hardware

Mature GPU + HBM + NVLink + DGX rack-scale systems.

D1 dies + fan-out training tile + custom cabinet.²

Memory

HBM-heavy.

SRAM locality + tile fabric.¹

Networking

NVLink + InfiniBand + Spectrum-X.

On-tile fabric + custom interconnect.¹

Software

CUDA + libraries + ecosystem.

Tesla PyTorch ext + Dojo compiler + LLVM.¹

Workload fit

Broad, flexible.

Optimised for Tesla’s vision data loop.

Iteration

Through external supplier scale.

Through internal control.

Nvidia sells a platform. Dojo was Tesla trying to build a machine for one company’s data loop.

XIV. The software problem

A custom accelerator fails quietly when the software team cannot make it convenient. To make Dojo work, Tesla needed a compiler, runtime, training-framework support, model partitioning, debugging tools, scheduling, fault tolerance, reliability management, data pipeline integration, developer productivity, and a migration path from existing GPU workflows. None of those are easy. None of them get cheaper at custom-hardware scale.

The hardware can be brilliant and still lose if the software path is too painful.

XV. The business lesson

Hardware ambition is not enough. The architecture has to map to business leverage. Dojo had a clear technical reason: Tesla-specific training from fleet video. But the business test is whether it reduced cost enough, sped training enough, justified a separate team, kept up with Nvidia’s roadmap, justified custom software, scaled reliably, and helped cars and robots ship faster. Those are a lot of bars to clear in parallel.

Dojo’s enemy was not only Nvidia. Dojo’s enemy was the cost of becoming Nvidia, a packaging company, a compiler company, and a data-center operator at the same time.

XVI. What could break the thesis?

Dojo showed technical ambition, but not enough proven business leverage to keep the dedicated training path central.

Bear case · what could break the thesis

Custom too narrow. Dojo may have been too custom for Tesla’s evolving AI roadmap.³
GPU still wins on cost. External GPUs and AI5 / AI6 may be better uses of Tesla resources.
Training ↔ inference shift. Dedicated training chips can become obsolete if inference dominates.³
Schedule risk. Samsung 2nm or TSMC AI5 schedules may slip.⁵
Nvidia ecosystem. Nvidia’s software / network / cloud stack remains very hard to beat.
Software burden. Custom stacks are expensive to maintain.
Fan-out hard problems. Yield, thermal, and serviceability challenges scale with package size.¹
Compiler ceiling. A beautiful package is wasted if the compiler / runtime is not productive.
Autonomy ≠ compute-bound. Tesla’s autonomy progress may not be bottlenecked by training hardware alone.

XVII. What could break the bear case?

Even if Dojo changes form, the lesson survives.

Bull case · what could break the bear

Early lesson advantage. Tesla learned package-scale AI hardware earlier than most.¹
Reusable IP. Dojo lessons feed AI5, AI6, robotics, autonomy, and future infrastructure.
Data + workload control. Tesla has unique data and workload control.
Training ↔ inference convergence. Future chips may serve both well.³
Package-scale mainstreams. SoW-X / CoWoS 14R / SoIC validate the direction.⁷
Power / cooling reusable. Lessons travel even if the tile architecture changes form.²
Robotics scale. Optimus and physical AI add new training-compute demand.
Vertical integration still rare. Few companies can co-design data, model, and silicon.

Even if Dojo changes form, the lesson survives: AI compute is constrained by bandwidth, power, cooling, packaging, and software.

XVIII. What to watch

What to watch

Whether Tesla continues using Dojo systems internally.³
AI5 tape-out and production timing.⁴
AI6 tape-out timing.⁵
Samsung 2nm yield and schedule.⁵
TSMC AI5 production in Taiwan / Arizona.⁴
Tesla’s actual training-compute spending.
Nvidia usage inside Tesla.
AMD usage inside Tesla, if any.
Whether Dojo software survives in AI5 / AI6 workflows.
Optimus training requirements.
Autonomy model size and data growth.
Tesla’s FSD training cadence.
Package-scale AI hardware trends.⁷
TSMC SoW-X roadmap.⁷
CoWoS capacity and package size evolution.⁷
HBM dependence vs SRAM-heavy architectures.¹
Power and cooling architecture for AI clusters.²
Whether custom accelerators beat GPU clusters in narrow workloads.

Glossary

A short reference for the vocabulary used above. Definitions are simplified.

Glossary

Dojo: Tesla’s custom AI training supercomputer project.
D1: Tesla’s Dojo training chip.
Training node: Local compute block inside D1.
Training tile: Package-level unit containing 25 D1 dies.
Known-good die: A chip die tested before integration.
Fan-out wafer: Packaging process that redistributes connections outward from dies to enable dense integration.
SRAM: Fast on-chip memory.
HBM: High-bandwidth memory used near AI accelerators.
BF16: bfloat16, a low-precision format often used in AI training.
CFP8: Configurable 8-bit floating point format referenced in Dojo coverage.
SerDes: Serializer / deserializer links used for high-speed data movement.
TDP: Thermal design power.
VRM: Voltage regulator module.
ExaPOD: Tesla’s larger Dojo system concept built from multiple tiles / cabinets.
CoWoS: TSMC advanced packaging technology often used for AI / HPC chips.
SoIC: TSMC 3D stacking technology.
SoW-X: TSMC System-on-Wafer roadmap technology.
COUPE: TSMC co-packaged optics technology.
Tape-out: Final chip design handoff before manufacturing.
Rack-scale AI: AI compute designed at rack or data-center scale rather than single-chip scale.

XIX. The package became the computer

Dojo was not just a chip.

It was Tesla’s attempt to make a training computer out of packaging, power delivery, cooling, interconnect, SRAM, and software.

In 2026, the dedicated Dojo story looks weaker because Tesla shifted toward AI5 and AI6. But the core insight aged well: AI scaling is moving from chip-level performance to package-level and rack-level integration. The package is becoming the computer.

The 2021 SemiAnalysis piece was bold about Tesla’s ~4× / 1.3× / 5× / nearly-order-of-magnitude TCO claim, and explicit that production deployment was the real test. The 2026 reality is more complicated than that claim implied and more aligned with that caution than either side of the bull-bear debate likes to admit. Tesla learned an enormous amount about package-scale AI hardware. Tesla also decided that a dedicated Dojo training-supercomputer team was not how it wanted to spend that learning.

The Dojo team was streamlined toward AI5 and AI6. TSMC’s CoWoS, SoIC, COUPE, and SoW-X roadmap is moving the industry toward exactly the kind of package-scale and rack-scale integration Dojo previewed in 2021. The companies that win the next AI cycle will be the ones that treat the package, the rack, and the software stack as one design problem. Tesla saw that early. Others are catching up now.

That is how the order-of-magnitude bet aged. Not into a clean victory for any one company. Into the AI infrastructure playbook.

That is how the package became the computer.

¹ Patel, D. (Aug 2021). Tesla’s Dojo, 1 Order Of Magnitude Better Cost, Performance, Scale Than Nvidia Solutions. SemiAnalysis. Historical anchor for the Dojo system framing, including D1 specs (354 functional units, 362 TFLOPS BF16/CFP8, 50B transistors, 645mm², 400W TDP, 1.25MB SRAM per training node, 10 TB/s on-chip directional bandwidth, 576 SerDes at 112 GT/s, ~8 TB/s off-chip bandwidth), the 25-die training tile (9 PFLOPS, 36 TB/s off-tile bandwidth), the ~10kW / ~15kW tile power framing, vertical power delivery, custom VRMs reflowed onto the fan-out wafer, the cabinet / ExaPOD architecture, the PyTorch / Dojo compiler / LLVM software stack, and Tesla’s ~4x / 1.3x / 5x / ~10x TCO claim versus Nvidia AI solutions with the explicit caution that production deployment was the real test. Used as inspiration only. No content, structure, or charts reproduced.

² Cadence. Not chips: Tesla’s Dojo. Independent technical summary corroborating 25 known-good D1 dies on a fanout-wafer process, ~9 PFLOPS / tile, ~36 TB/s off-tile bandwidth, ~52V DC, ~18,000A current draw, ~15kW heat dissipation, and the tile / cabinet / ExaPOD framing.

³ Reuters (Aug 2025). Tesla to streamline its AI chip design work, Musk says. Bloomberg-via-Reuters reporting that Tesla was disbanding its Dojo supercomputer team, with Musk saying it did not make sense to divide resources across two very different AI chip designs and that Tesla’s effort was focused on AI5, AI6, and subsequent chips, described as excellent for inference and at least pretty good for training.

⁴ Reuters (Jul 2025). Tesla / Samsung $16.5B chip supply deal. Samsung’s Taylor (Texas) factory framed to make Tesla’s next-generation AI6 chip per Musk; Samsung currently making AI4; TSMC slated to make AI5 first in Taiwan and then Arizona per Musk; self-driving / Optimus / broader AI context.

⁵ Reuters (Mar 2026). Musk says Tesla may tape out AI6 in December 2026. Samsung executive cited saying Tesla chips based on Samsung’s advanced 2nm process were planned for production in the second half of 2027.

⁶ Tesla AI Day 2021 official material and subsequent Tesla technical presentations remain the original sources for D1 chip, training tile, cabinet, ExaPOD, and software stack framing. This essay relies on the SemiAnalysis 2021 framing (fn1) and the Cadence technical summary (fn2) for the specific numbers used; no Tesla material is reproduced.

⁷ TSMC (2026). 2026 North America Technology Symposium. CoWoS expansion (5.5-reticle today, 14-reticle by 2028 with ~10 large compute dies and 20 HBM stacks), SoIC 3D stacking, COUPE co-packaged optics, and a 40-reticle SoW-X System-on-Wafer technology targeted for 2029.

⁸ Public Hot Chips 34 Dojo System material and other credible Dojo technical coverage is referenced in this essay only at the level the SemiAnalysis 2021 framing (fn1) and Cadence summary (fn2) already disclose. Specific microarchitecture claims are made only where verifiable.

⁹ Public Nvidia developer / DGX materials are referenced only as comparative platform context. This essay does not reproduce Nvidia visuals or marketing.

¹⁰ TSMC packaging materials (CoWoS, InFO, SoIC, SoW) provide additional context for the system-level packaging direction. The essay uses the 2026 symposium framing (fn7) as the primary citation rather than restating individual product pages.