The Custom AI Hardware Trap.Original analysisNot investment advice
Dojo was impressive, but the uploaded 2021 SemiAnalysis critique aged well. Tesla’s real challenge was never building a beautiful D1 chip or one powerful training tile. It was making memory, interconnect, software, power, cooling, and economics work as a full production training system. In 2026, Tesla’s move toward AI5 and AI6 suggests the standalone Dojo path became less central. But Dojo still matters because it revealed the future of AI hardware: the chip is no longer enough. The system wins.
Tesla Dojo was impressive. That was never the question.
The D1 chip had huge bandwidth. The training tile was exotic. The power density was wild. The system vision was bold.
But the uploaded 2021 SemiAnalysis article asked the question that mattered more.1 Can this become a usable production training system? Not whether Tesla could build an impressive chip. Whether Tesla could solve memory, interconnect, software, packaging, cooling, power delivery, and economics at the same time.
In 2026, that question looks even more important. Tesla has shifted attention from a standalone Dojo training-chip path toward AI5 and AI6.345 But the lesson is not “Dojo was stupid.” The lesson is sharper. Custom AI hardware is not won at the chip-spec level. It is won at the system level.
The chip-spec sheet is the easy part. The production system is the trap.
1. The 2021 critique was about the system
The uploaded SemiAnalysis piece on Dojo was not a hit-piece on Tesla.1 It did not deny that the D1 chip was technically interesting. It did not argue that the training tile was a bad idea. It argued something more disciplined.
It argued that a custom AI chip is not validated by peak TFLOPS or by a glossy reveal video. It is validated when memory, interconnect, compiler, power, cooling, and economics work together in production. The article identified the hard walls Tesla would have to cross. It was system-level discipline, not anti-Tesla noise.
A custom AI chip is not validated by peak TFLOPS. It is validated when memory, interconnect, compiler, power, cooling, and economics work together in production.
Five years on, the framing still holds. Dojo’s silicon was the visible artefact. The invisible artefacts were the harder ones: memory balance, software stack, exotic packaging, custom interconnect, cooling design, and a business case narrow enough that only autonomy could justify the spend.
2. The four walls of custom AI hardware
The cleanest way to read the 2021 critique in 2026 is to organise it as four walls. Each wall is independent. A custom AI training system has to cross all four. Failure on any one is enough to keep the whole effort from reaching production scale.
Memory
Interconnect & packaging
Software
Economics
Peak TFLOPS get you to wall one. The rest of the project is the walls.
3. The memory wall came first
The uploaded article’s sharpest early observation was about memory per unit of compute.1 Each Dojo functional unit had around 1.25 MB of SRAM and roughly 1 TFLOP of FP16 / CFP8 compute. D1 had 354 such units. The article estimated that the full ExaPOD would have only around 1.33 TB of total SRAM behind well over an exaflop of FP16-class compute.
That is a lot of compute. It is not a lot of memory.
The article’s argument was not that this number is wrong. It was that this ratio is uncomfortable. CFP8 helped stretch memory by reducing precision per value, but it did not change the underlying balance. Dojo was compute-rich and memory-tight relative to the size of the models the industry was already aiming at.1
Dojo’s first wall was not compute. It was memory per unit of compute.
4. Why memory balance matters
Training large neural networks does not just need compute. It needs the system to feed that compute. The memory side of the budget is broad, and most of it grows with model size.
- Model parameters
- Forward activations for backprop
- Gradients
- Optimizer state (momentum, second-moment estimates)
- Intermediate tensors and reductions
- Communication buffers (all-reduce, all-gather)
- KV cache when used during training-time evaluation
- Slack for fragmentation and out-of-place ops
When memory per chip is tight, the system has to either keep computation local, partition the model carefully, or move data between chips and nodes. Each option has a cost.
If model partitioning is forced, the compiler has to place operations near their data. If data movement increases, interconnect bandwidth becomes the binding constraint. If utilisation falls, the headline TFLOPS becomes marketing instead of throughput. None of this means a compute-rich, SRAM-heavy design cannot work. It means the software, the compiler, and the engineering team have to do more work to extract real performance.2
Peak compute is easy to market. Feeding that compute is the hard part.
5. The bandwidth wall became a packaging problem
The uploaded article was equally pointed on I/O.1 It said D1 used 112G SerDes lanes. It said D1 had roughly 576 SerDes lanes. It said the chip reached around 8 TB/s of off-die I/O. It argued that normal organic substrates could not expose this much I/O cleanly. The escape valve, the article said, was exotic packaging: TSMC’s Integrated Fan-Out System-on-Wafer (InFO_SoW).
That framing aged well. Cadence’s independent technical summary later described how Tesla’s training tile uses 25 known-good D1 dies on a fan-out wafer process at the package level, with 9 PFLOPS of BF16 / CFP8 compute per tile and 36 TB/s of off-tile bandwidth.2 Whatever you call the package, the architectural point is the same. Dojo solved chip-to-chip bandwidth by making the package part of the architecture.
Bandwidth is bounded by ball-out
- Limited pin count on the package balls
- Long, high-power off-package links
- PCB and connector loss
- Cooling per chip, not per tile
- Serviceability is per socket
Bandwidth becomes a package property
- Dense interconnect on the wafer-like carrier
- Many short, lower-energy die-to-die links
- Bandwidth scales with package area, not pins
- Cooling and power delivery designed for the tile
- Serviceability becomes tile-level, not chip-level
The architectural elegance is real. The tradeoff is real too. Exotic packaging means tighter coupling with one foundry, harder yield management, more capital tied to specialist tools, and more complex repair stories. Tesla’s answer was a bet that the architectural payoff was worth the manufacturing complexity, at least for a workload it controlled.
Dojo attacked the bandwidth wall by making the package exotic.
6. The tile was beautiful, but the tile was not enough
The training tile is the photograph that travels. It is also the unit that tells the truth about why custom AI hardware is hard.
Tesla’s Hot Chips 34 materials describe the tile as the unit of scale, with around 9 PFLOPS BF16 / CFP8 of compute, around 36 TB/s of off-tile bandwidth, and around 11 GB of high-speed ECC SRAM at tile level.6 Cadence’s summary reaches similar numbers from outside Tesla.2 Tesla materials also describe Dojo Interface Processors with 32 GB of HBM per DIP, 800 GB/s of memory bandwidth, 160 GB of DRAM at the tile edge, and on the order of 13 TB of high-bandwidth DRAM at ExaPOD scale.6
This is the part of the 2021 critique that aged most cleanly. The tile is beautiful. The tile is not a computer. Around the tile sit interface processors, HBM, edge DRAM, host integration, custom protocols, fault management, and the software that holds it together. The uploaded article was correct to focus on the system, not the photogenic surface.
The tile was the centre. The system made it usable.
7. The software wall was the real test
If the tile is the part you see, software is the part you trip on. The uploaded article was openly skeptical here.1 It said Tesla had not convincingly shown automatic placement and routing of mini-tensor operations across the architecture. It warned that AI hardware companies often struggle with software for years after the silicon exists. It was not arguing that Tesla had no engineers. It was arguing that custom-hardware software stacks have a long, brutal middle.
The list of things a custom training accelerator’s software has to do is long.
None of these are exotic problems. They are the daily problems of running a training cluster. The uploaded article’s deeper point was that the silicon is the easy half, even when the silicon is hard. The software is where most custom AI hardware projects either pay a long, expensive tax or quietly stop being used.
A beautiful chip is useless if researchers cannot easily make models run on it.
8. Nvidia’s moat was the default path
Dojo should not be compared only against Nvidia GPU silicon. The fair comparison is against Nvidia’s system. CUDA, cuDNN, NCCL, TensorRT, profilers, DGX systems, NVLink, networking, distributed-training recipes, hyperscaler integration, framework compatibility, and developer habit are all part of what you buy when you buy a Nvidia GPU.8
Nvidia’s rack-scale systems make the contrast even clearer. Nvidia says its GB200 NVL72 connects 36 Grace CPUs and 72 Blackwell GPUs into a single 72-GPU NVLink domain with around 130 TB/s of low-latency GPU communication.8 Whether or not those numbers are independently verified for any specific workload, they exist as a credible default. A team can put a PyTorch model on a Nvidia cluster on Monday and have it training on Tuesday.
Nvidia wins through ecosystem-scale integration. Dojo tried to win through workload-specific vertical integration. Both can be legitimate strategies. They are not the same strategy. The custom path only wins if the workload is narrow enough and the team is deep enough to outweigh the years of compounding the default path has already done.
Nvidia is not only a GPU vendor. It is the default software path.
9. The economics only worked if autonomy worked
The uploaded article was explicit on the business case.1 It described around 3,000 large 645 mm² 7 nm dies committed for deployment. By the standards of normal chip economics, that is a small volume for the kind of NRE Dojo needed. Exotic packaging added cost. Custom interconnect added cost. A custom software stack added cost. The article framed the only credible payoff as a meaningful acceleration of Tesla’s autonomy and robotaxi programmes.
That framing is sharp and worth repeating. Dojo did not need to become a profitable standalone chip business. It needed to make autonomy arrive faster than it otherwise would have. Anything less than that was an expensive science project.
Dojo’s business case did not need a standalone chip P&L. It needed autonomy and robotaxi capability to arrive faster than it would have without it.1
Dojo’s economics only made sense if it made autonomy arrive faster.
10. The 2026 update: Dojo became less central
Reuters, citing Bloomberg, reported in August 2025 that Tesla was streamlining its AI chip design work and disbanding the Dojo supercomputer team.3 The reporting included that Peter Bannon, who had led Dojo work, was leaving. Elon Musk’s response was framed by Reuters as a decision not to divide Tesla’s resources across two very different AI chip designs, and to focus on AI5, AI6, and subsequent chips. Musk described the new chips, per Reuters, as excellent for inference and at least pretty good for training.
Read this as a strategy shift, not a verdict on the silicon.
The Reuters / Bloomberg framing is about resource allocation and team structure, not about Dojo silicon failing. Tesla has not publicly retracted any specific D1 or tile claim. The shift is best read as a decision that the standalone Dojo training path was no longer the best use of Tesla’s AI hardware effort.3
The right reading is not “Dojo was a failure.” The right reading is that the standalone Dojo training path became too expensive to keep central. The 2021 critique was about whether memory, software, interconnect, packaging, cooling, and economics could work as one system. The 2025 reorganisation is the first publicly visible answer. The system path Tesla chose was AI5 and AI6, not a continued bet on Dojo as a separate training cluster.
Dojo taught Tesla the system problem. AI5 and AI6 became the product path.
11. AI5 and AI6 are the new center of gravity
The 2026 picture of Tesla’s AI hardware programme is built on two threads.
First, Reuters reported in early 2026 that Musk said Tesla may tape out AI6 in December, with Samsung executives saying Tesla chips based on Samsung’s 2 nm process are planned for production in the second half of 2027.4 AI6, per the Reuters framing, is likely to be used in self-driving cars and humanoid robots.
Second, Reuters reported in July 2025 that Tesla signed a roughly USD 16.5 billion supply deal with Samsung.5 Musk was quoted saying Samsung’s Taylor, Texas factory would manufacture AI6, while TSMC was slated to manufacture AI5 first in Taiwan and then in Arizona.5 Samsung currently makes AI4 in Tesla’s in-car generation, and AI5 sits between AI4 and AI6 in this roadmap.5
AI6 tape-out target
Tesla did not abandon custom AI silicon. The shift is from a separate training-cluster bet to deployment-scale AI chips that go into the products themselves.
Dojo was the training-cluster bet. AI5 and AI6 are the deployment-scale AI-chip bets.
12. The industry moved toward Dojo’s problem
The most interesting move is not what happened inside Tesla. It is what happened around it. Even as Dojo as a standalone training cluster became less central, the rest of the AI hardware world moved toward the same set of problems Dojo had been trying to solve.
TSMC’s 2026 North America Technology Symposium press materials describe its CoWoS roadmap reaching a 5.5-reticle interposer in production today, a 14-reticle interposer planned for 2028 capable of integrating around 10 large compute dies and roughly 20 HBM stacks, and a 40-reticle System-on-Wafer technology (SoW-X) expected in 2029.7 The same materials describe SoIC for 3D stacking and COUPE for co-packaged optics.7
This is the Dojo problem at industry scale. Memory, interconnect, packaging, power, cooling, optics, and rack systems are converging into a single design surface. Dojo may have become less central inside Tesla. The class of problem Dojo was trying to solve became universal.
Dojo became less central. The Dojo problem became universal.
13. Custom AI hardware is a system trap
The trap is not that custom AI hardware is a bad idea. The trap is that the chip-spec sheet looks easier than it is, and the production system is harder than it looks. Many internal chip teams and AI hardware startups have fallen into the same shape of failure mode.
The spec-sheet half
- Peak TFLOPS at a target precision
- Die-to-die bandwidth on paper
- Power per die
- Process node and area
- One impressive demo workload
The system half
- Memory balance and locality at scale
- Compiler that does not need a PhD per model
- Tile-to-tile and rack-to-rack interconnect
- Cooling, power delivery, serviceability
- Researcher productivity and debugging
- Ecosystem migration cost
- Economics over actual deployment volume
Dojo did not fall into every part of this trap. Tesla genuinely advanced the state of the art on packaging, power delivery, and on what a training tile could look like. The trap is the gap. The chip-spec sheet is the easy part. The production system is what makes or breaks the project.
The chip-spec sheet is the easy part. The production system is the trap.
14. What could break the thesis
The thesis here is that Dojo’s 2021 critique aged well, and that custom AI hardware is decided at the system level. There are honest reasons that reading could be wrong.
- Memory balance. SRAM-heavy architectures can struggle to absorb large parameter counts without painful partitioning.1
- Compiler debt. Custom compilers take years to mature, and most teams underestimate the work.
- Interconnect cost. Tile-to-tile fabrics are expensive to design, expensive to maintain, and hard to extend.
- Exotic packaging. Fan-out wafer and SoW-class packaging create yield, capacity, and serviceability risks.7
- Default path strength. Nvidia’s software and rack-scale system are extremely hard to beat.8
- Workload bottlenecks. Autonomy progress may be bottlenecked by data and methods, not by training compute alone.
- Team continuity. A reorganisation like the August 2025 Dojo streamlining can reduce institutional memory.3
- Roadmap dilution. Splitting effort across training-cluster and product-chip programmes can slow both.
- Cost of being early. Being early on package-scale compute does not always convert into product leverage.
15. What could break the bear case
There are equally honest reasons the bear case here could be too dark.
- System literacy. Dojo forced Tesla to learn package-scale AI hardware before most of the industry.
- Unique data. Tesla controls a fleet-video workload that no other AI lab has at comparable scale.
- Hardware-software co-design. Tesla can tune AI5 / AI6 around its own data and autonomy loops.
- Inference-first chips. Reuters reporting describes Musk framing AI5 / AI6 as excellent for inference and at least pretty good for training.3
- Inherited lessons. AI5 and AI6 may inherit Dojo’s lessons on packaging, power, and software.
- Industry tailwind. Package-scale and rack-scale AI are now mainstream design surfaces.7
- Power and cooling reuse. Power-delivery and thermal lessons translate to AI factories more broadly.
- Workload growth. Robotics and autonomy workloads may grow large enough to justify custom silicon again.
Even if Dojo changes form, the lesson survives: AI compute is constrained by bandwidth, memory, power, cooling, packaging, and software.
16. What to watch
The most honest way to read Dojo in 2026 is as an unfinished experiment whose outcome will be readable over the next 18 to 36 months. These are the signals worth tracking.
- Whether Tesla continues using Dojo systems internally
- Dojo software survival inside AI5 / AI6 workflows
- AI5 tape-out and production timing5
- AI6 tape-out timing4
- Samsung 2 nm yield and schedule4
- TSMC AI5 production in Taiwan and Arizona5
- Tesla’s actual training-compute spending
- Tesla’s ongoing Nvidia usage
- Tesla’s AMD usage, if any
- Optimus training requirements
- FSD training cadence
- Model size and data growth at Tesla
- Package-scale AI hardware trends at peers
- TSMC SoW-X roadmap and customer alignment7
- CoWoS capacity and package size cadence7
- HBM-heavy versus SRAM-local architecture choices
- AI data-centre power and cooling architecture
- Whether custom accelerators beat GPU clusters on narrow workloads
17. The custom AI hardware trap
Dojo was impressive, but the uploaded article’s caution aged well. Tesla’s real challenge was not building a beautiful D1 chip or one powerful training tile. It was making memory, interconnect, software, power, cooling, and economics work as a full production training system.
In 2026, Tesla’s move toward AI5 and AI6 suggests the standalone Dojo path became less central.345 But Dojo still matters because it revealed the future of AI hardware. The chip is no longer enough. The system wins.
“Custom AI hardware is not won at the spec-sheet level. It is won at the system level.”
18. Glossary
- Dojo
- Tesla’s custom AI training supercomputer project.
- D1
- Tesla’s Dojo training chip.
- Training tile
- Package-level unit containing multiple D1 dies.
- SRAM
- Fast on-chip memory.
- HBM
- High-bandwidth memory stack used near AI accelerators.
- SerDes
- Serializer / deserializer links used for high-speed data movement.
- InFO_SoW
- TSMC Integrated Fan-Out System-on-Wafer packaging.
- Fan-out wafer
- Packaging method that redistributes chip connections through wafer-like structures.
- Compiler
- Software that maps model operations to hardware.
- Runtime
- Software that manages execution on hardware.
- Graph partitioning
- Splitting neural network computation across hardware units.
- Model parallelism
- Splitting a single model across multiple compute devices.
- Data parallelism
- Splitting batches of data across compute devices.
- TCO
- Total cost of ownership of a system over its useful life.
- ExaPOD
- Tesla’s larger Dojo system concept.
- CoWoS
- TSMC advanced packaging technology for AI / HPC chips.
- SoIC
- TSMC 3D stacking technology.
- SoW-X
- TSMC System-on-Wafer roadmap technology.
- NVLink
- Nvidia high-speed GPU interconnect.
- Rack-scale AI
- AI compute designed at rack or data-centre scale rather than single-chip scale.
This piece is original 2026 analysis. It uses the uploaded 2021 SemiAnalysis article only as a cited historical anchor for the 2021 critique. It uses Hot Chips and Cadence materials as a public technical baseline. It uses Reuters reporting as a frame around what Tesla and Musk have said publicly. It is not investment advice. No specific Tesla, Nvidia, Samsung, TSMC, or supplier security is being recommended.
1 Uploaded SemiAnalysis PDF, Dylan Patel (SemiAnalysis), 2021. Skeptical Dojo analysis, framed in this essay as the 2021 anchor. Used only as historical thesis / inspiration, not reproduced. The 2021 piece argued Dojo had a memory problem (~1.25 MB SRAM and ~1 TFLOP FP16 / CFP8 per functional unit, 354 units in D1, an estimated ~1.33 TB of total SRAM at ExaPOD scale behind >1 EFLOP of FP16-class compute), that D1 used 112G SerDes (576 lanes, ~8 TB/s I/O) and needed exotic fan-out wafer packaging (TSMC InFO_SoW), that Tesla’s software claims (placement / routing of mini-tensor operations) were unproven, and that economics relied on ~3,000 large 645 mm² 7 nm dies being justified by the autonomy / robotaxi payoff.
2 Cadence Breakfast Bytes (Paul McLellan). Not chips: Tesla’s Dojo. Independent technical summary of the D1 chip and training tile; the essay uses the Cadence framing for the 25-die training tile on a fan-out wafer process, ~9 PFLOPS, ~36 TB/s of off-tile bandwidth, and power-delivery / cooling design points.
3 Reuters (Aug 2025). Tesla to streamline its AI chip design work, Musk says. Bloomberg-via-Reuters reporting that Tesla was disbanding its Dojo supercomputer team, with Peter Bannon described as leaving, and Musk saying it did not make sense to divide resources across two very different AI chip designs and that Tesla’s effort was focused on AI5, AI6, and subsequent chips, described as excellent for inference and at least pretty good for training. Framed in this essay as a strategy shift, not a verdict on Dojo silicon.
4 Reuters (Mar 2026). Musk says Tesla may tape out next-generation AI6 chips in December. Reuters reporting Musk saying Tesla may tape out AI6 in December, with Samsung executives saying Tesla chips based on Samsung’s 2 nm process are planned for production in the second half of 2027. AI6 likely to be used in self-driving cars and humanoid robots, per the Reuters framing.
5 Reuters (Jul 2025). Tesla / Samsung ~USD 16.5B supply deal. Reuters reporting that Tesla signed a roughly USD 16.5 billion supply deal with Samsung; Musk was quoted saying Samsung’s Taylor, Texas factory would manufacture AI6, while TSMC was slated to make AI5 first in Taiwan and then in Arizona; Samsung currently makes AI4.
6 Hot Chips 34, Dojo System materials. Hot Chips 34 conference deck, Dojo System. Used for training tile as unit of scale, ~9 PFLOPS BF16 / CFP8 if verified, ~36 TB/s off-tile bandwidth, ~11 GB high-speed ECC SRAM, Dojo Interface Processor with 32 GB HBM, ~800 GB/s memory bandwidth, ~160 GB DRAM per tile edge, and ~13 TB high-bandwidth DRAM at ExaPOD scale. Conference materials cited as a public technical baseline; no Tesla AI Day or Hot Chips images are reproduced.
7 TSMC (Apr 2026). TSMC 2026 North America Technology Symposium press release. Used for the CoWoS 5.5-reticle / 14-reticle (2028 plan) / SoW-X 40-reticle (2029 plan) roadmap framing, the ~10 large compute dies + ~20 HBM stacks figure for 14-reticle CoWoS if verified, and the SoIC + COUPE co-packaged optics positioning. No TSMC diagrams reproduced.
8 Nvidia, GB200 NVL72 platform page. GB200 NVL72. Used for the comparative description of 36 Grace CPUs + 72 Blackwell GPUs forming a 72-GPU NVLink domain with ~130 TB/s of low-latency GPU communication, framed as Nvidia’s rack-scale “default path” system in this essay; numbers are Nvidia’s and are not independent benchmarks.
- The Package Became the Computer · companion essay on Tesla Dojo’s order-of-magnitude claim and the package-scale playbook.
- The Wafer-Scale Training Bet · earlier essay on Dojo, InFO_SoW, and the move toward AI5 / AI6.
- The Bubble That Became Infrastructure · on Nvidia’s ecosystem advantage from 2021 to 2026.
- The AI Chip Software Wall · on why custom AI accelerators most often fail in software, not silicon.
- The Back-End Bottleneck · on advanced packaging, bonding, and the system-level back-end of AI hardware.
- Accelerated Computing Atlas · interactive atlas of the Nvidia accelerated-computing ecosystem.