← Back to blog

Essay No. 027 · AI Infrastructure · Melbourne, Australia

AI Infrastructure Tenstorrent Nvidia AI Hardware Blackhole Galaxy Wormhole RISC-V Ethernet CUDA TT-Metalium TT-Forge Sovereign AI Cost Per Token

The Networked AI Bet.Original analysisNot investment advice

Why Tenstorrent is trying to beat Nvidia with openness, Ethernet, and cost-per-token, not bigger GPUs.

Pugalenthi Magendran

April 2026 · Melbourne, Australia

12 min read

Tenstorrent’s 2021 Wormhole idea was not really about one chip. It was about making scale-out AI look like one programmable mesh. In 2026, that idea has become the Networked AI bet: open software, RISC-V IP, Ethernet-native scaling, and switch-light AI systems.

In 2021, Tenstorrent Wormhole looked like one of the more interesting AI-chip ideas outside Nvidia. Not because it had the biggest die. Not because it used the most advanced node. Not because it copied the GPU.

The idea was different.

Tenstorrent wanted to make AI compute look like a mesh.

The uploaded SemiAnalysis article described Wormhole as a scale-out architecture built around Tensix cores, packetized mini-tensors, an internal network-on-chip, GDDR6 memory, and 16 ports of 100Gb Ethernet.¹ The real claim was that Tenstorrent could extend the chip’s internal fabric across chips, servers, and racks so software could see one large mesh of cores instead of a painful hierarchy of GPUs, NICs, switches, and manually partitioned models.

That was the 2021 bet.

In 2026, the same idea has a clearer name: Networked AI.

Key idea

The correct thesis is not “Tenstorrent will kill Nvidia.” The correct thesis is that Tenstorrent is one of the more interesting attacks on Nvidia because it is not trying to copy Nvidia. It is attacking the system differently: open software, RISC-V IP, Ethernet-native scale-out, switch-light fabrics, and cost-efficient AI serving.

I. The 2021 thesis

In June 2021, Dylan Patel published a SemiAnalysis piece arguing that Wormhole was not interesting for its FLOPS. It was interesting for its scale-out architecture. The chip integrated compute, memory, network-on-chip, and 16×100GbE links into one die, and Tensix cores carried not only compute but also routing and packet-management logic. Data moved as packetized mini-tensors. Nebula and Galaxy were the early system-level expressions of this idea, and the goal was to make communication across cores, chips, servers, and racks look uniform to software.¹

The 2021 essay was excited but explicitly skeptical about whether the compiler could really place and route work efficiently across the mesh without congestion. That skepticism aged well. The point still holds five years later.

2021 thesis

Wormhole was not just an AI chip. It was a bet that scale-out AI could be made easier by turning chips, servers, and racks into one programmable mesh of Tensix cores.

II. The network is the architecture

Most AI systems are hierarchical. Inside the chip, data moves one way. Across accelerators, it moves another way. Across servers, it moves another way. Across racks, it crosses expensive networking. The software has to understand all of that.

Tenstorrent’s idea is to flatten the hierarchy. The SemiAnalysis 2021 description was that mini-tensor packets move through the mesh, cores include router and packet-manager logic, and sending data between cores on the same chip should look similar to sending data across chips when the network-on-chip extends over Ethernet. The compiler then maps work across that mesh.¹

Diagram · Traditional hierarchy vs Tenstorrent-style mesh

Traditional

Hierarchy

01GPU cores

02NVLink / PCIe inside server

03NIC + switch fabric

04Server & rack boundaries

05Cluster / pod / DC

Tenstorrent-style

Mesh

01Tensix mesh inside chip

02Chip mesh over Ethernet

03Server mesh

04Rack mesh

05One programmable substrate

A simplified, original visual; not a Tenstorrent or Nvidia chart. Levels generalised from the 2021 SemiAnalysis description.¹

The network was not an accessory. The network was the architecture.

III. Why this matters more in 2026

AI changed after 2021. The market is no longer only about training bigger models. It is about serving them.

Production AI cares about tokens per dollar, power per token, memory per user, latency, time to first token, video-generation throughput, model bring-up speed, model churn, deployment control, private AI, sovereign AI, and avoiding vendor lock-in. ASML’s 2025 Annual Report describes AI as requiring leading-edge processors and a significant increase in DRAM relative to traditional compute. TSMC describes AI demand as the dominant driver of advanced-node and advanced-packaging usage.¹³¹⁴

That demand is enormous, and it is starting to feel concentrated.

AI infrastructure is becoming too expensive, too closed, and too dependent on one vendor. That is Tenstorrent’s opening.

IV. Blackhole is the first real test

Many AI-chip startups never escape slides. Tenstorrent has shipped. In April 2025, the company announced its Blackhole developer products at Tenstorrent Dev Day, including the Blackhole p100 card starting at $999, the p150 card starting at $1,399, and a TT-QuietBox workstation with four Blackhole processors starting at $11,999. Tenstorrent describes Blackhole as a second-generation Tensix architecture built on a 6nm-class process with a faster NoC, higher memory density, integrated RISC-V cores, and an open-source software stack.²

Diagram · Wormhole → Blackhole → Galaxy

2021

Wormhole

Tensix cores, NoC, GDDR6, 16×100GbE, mini-tensor mesh.¹

2025

Blackhole dev kit

p100 from $999, p150 from $1,399, TT-QuietBox from $11,999.²

2025–26

Galaxy Blackhole

32 Blackhole ASICs, 23 PFLOPS FP8, 1TB GDDR6, 56×800GbE.³

2026

Networked AI

Open software, RISC-V IP, sovereign AI partnerships, custom silicon.

Tenstorrent’s rough product arc from the 2021 Wormhole concept to 2026’s Networked AI positioning. Years and prices per cited Tenstorrent announcements.¹²³

The first job of an AI-chip startup is not beating Nvidia. The first job is shipping.

V. Galaxy is Wormhole turned into a system

The clearest 2026 expression of the Networked AI idea is Galaxy. Tenstorrent describes Galaxy Blackhole as a system with 32 Blackhole ASICs, 23 PFLOPS Block FP8 compute, 6.2GB of SRAM at 2.9PB/s, 1TB of GDDR6 at 16TB/s, 10×400GbE links per ASIC, and up to 56×800GbE QSFP-DD scale-out ports. The company lists Galaxy Blackhole at $110,000 and a four-Galaxy supercluster starting around $440,000.³

That product is what Wormhole was always pointing at: compute, memory, and networking integrated as one fabric, not assembled out of separate proprietary parts. Tenstorrent’s TT-Deploy framing describes Galaxy as production hardware engineered for AI inference at scale, with Ethernet-native interconnect designed to keep switching simple.⁴

Diagram · Galaxy system stack

Blackhole ASICs 32 per Galaxy

Compute

SRAM 6.2GB on-die at 2.9PB/s

Memory

GDDR6 1TB at 16TB/s

Memory

Ethernet fabric 10×400GbE per ASIC · up to 56×800GbE QSFP-DD

Network

Open software TT-Metalium · TT-NN · TT-Forge

Software

Production AI serving inference, video gen, private & sovereign AI

Workload

Galaxy specs and pricing as published by Tenstorrent. The diagram is original; not a Tenstorrent visual.³⁴⁶

Tenstorrent is not trying to win by building the biggest GPU. It is trying to build a cheaper, open, Ethernet-native AI fabric.

VI. The benchmark claims show direction, not verdict

Tenstorrent’s performance announcement says Galaxy can reach 350+ tokens/sec/user on DeepSeek-R1-0528 671B at 100K context with sub-4-second time-to-first-token, plus a roughly 10× video-generation speedup with Prodia, including 720p 81-frame video generated in about 2.4 seconds.⁵

Caution

These are Tenstorrent claims, not independent proof.

Treat every number above as a vendor benchmark. Independent reproduction with comparable models, contexts, batch sizes, networking, and software releases is what would turn these into infrastructure-grade evidence. Until then, they show the workloads Tenstorrent wants to compete on, not the workloads it has demonstrably won.

The strategic point is not the exact benchmark number. It is where Tenstorrent wants to compete: production inference, video generation, latency-sensitive serving, and cost per token, not necessarily the hardest training workloads.

The proof will be independent benchmarks, customer deployments, uptime, model coverage, and real cost per token.

VII. Open software is the anti-CUDA argument

Tenstorrent’s software docs describe a layered open-source stack. TT-Metalium is the low-level programming model and SDK for Tensix hardware. TT-NN is a Python/C++ neural-network operator library built on top of Metalium. TT-Forge is an MLIR-based compiler stack with frontends including TT-Torch, TT-XLA, and TT-Forge-ONNX, plus a shared TT-MLIR backend that lowers into Metalium.⁷⁸⁹

Diagram · Tenstorrent open software stack

TT-Forge

MLIR-based compiler with TT-Torch, TT-XLA, and TT-Forge-ONNX frontends.⁹

TT-NN

Python and C++ neural-network operator library.⁷

TT-Metalium

Low-level SDK and programming model for Tensix hardware.⁸

TT-LLK / kernels

Low-level kernels and hardware abstractions exposed under Metalium.⁸

A simplified, original visual based on Tenstorrent’s public software-stack documentation.⁷

Nvidia’s moat is CUDA. But CUDA is not just syntax. CUDA is libraries, debugging, profiling, cloud support, enterprise trust, documentation, production experience, and a decade of ecosystem memory.

Nvidia / CUDA moat

What you actually get with CUDA

Libraries · cuDNN, cuBLAS, NCCL, TensorRT, Triton ecosystem.
Tooling · Nsight, profilers, debuggers, observability.
Cloud · first-class availability on every major cloud.
Enterprise · trusted procurement and support paths.
Ecosystem memory · a decade of production battle-testing.

Tenstorrent counterpunch

What the open stack offers instead

Open source · readable, inspectable, modifiable stack.
Architecture simplicity · mesh + Ethernet, fewer hidden layers.
Lower friction · affordable dev kits, public docs, public repos.
Less lock-in · portable across deployments and partners.
Custom silicon path · IP licensing for those who want their own chip.

Nvidia’s moat is CUDA. Tenstorrent’s counterargument is open software plus architecture-level simplicity.

VIII. The IP business may matter as much as the boxes

Tenstorrent’s December 2024 Series D announcement says the company raised over US$693M at a US$2B pre-money valuation, with strategic investors including Samsung Securities, AFW Partners, LG Technology Ventures, Hyundai Motor Group, Fidelity, Baillie Gifford, and Bezos Expeditions. The same materials describe Tenstorrent’s product line as both AI computers and licensable AI/RISC-V IP.¹⁰

EE Times has reported separately that Tenstorrent is productising its RISC-V CPU and AI cores as licensable IP, that LG and Hyundai are IP licensees, and that the majority of bookings to date came from IP deals rather than systems sales.¹¹

Diagram · Tenstorrent IP business surface

CPU IP

RISC-V cores

Licensable RISC-V CPU IP for custom SoCs.¹¹

AI IP

Tensix cores

AI compute and mesh-fabric IP for partner designs.¹¹

Customers

LG, Hyundai, partners

Public IP licensees in mobility and consumer SoCs.¹¹

A simplified, original IP-business framing based on Tenstorrent’s public Series D positioning and EE Times reporting.¹⁰¹¹

Tenstorrent is not only competing with Nvidia boxes. It is also competing for the future of custom AI silicon.

IX. Sovereign AI makes the story stronger

Reuters reported in November 2024 that Japan partnered with Tenstorrent on a $50M program to train up to 200 Japanese chip designers over five years, connected to the country’s Rapidus ecosystem and broader semiconductor revival.¹²

That program is small in dollar terms and large in symbolic terms. Countries are increasingly explicit about wanting control over AI infrastructure: local chip-design skills, custom silicon, supply-chain optionality, open architectures, fewer black boxes, alternatives to Nvidia dependency, and domestic capability.

Diagram · What sovereign AI buyers actually want

Local skills

Chip-design talent and tooling inside the country, not only via foreign vendors.

Custom silicon

IP licensing and reference designs that allow domestic SoCs to exist.

Open stack

Inspectable software so the AI runtime is not a black box.

Supply-chain control

Multiple foundries, multiple memory suppliers, fewer single points of failure.

A simplified, original framing of the sovereign AI demand pattern that Tenstorrent’s open + RISC-V + IP-licensing strategy speaks to.

Sovereign AI is not only about models. It is about who controls the chips, tools, and skills underneath the models.

X. The execution risk is real

None of this matters if the boring product realities do not hold. The clearest example sits in Tenstorrent’s own firmware release notes. Starting January 2026, Blackhole p150 cards ship with 120 Tensix cores instead of the original 140, and firmware v19.5.0 changes existing cards to expose 120 cores to unify the developer interface. Tenstorrent says typical workloads see only a 1–2% performance difference, but developers using grid-size-dependent code may need to update their applications.¹⁵

Execution risk

The boring product realities of AI hardware.

Yield. Firmware. Driver compatibility. Application updates. Documentation freshness. Thermal reliability. Procurement confidence. Long-term roadmap trust. Support quality. These are the layers that decide whether a smart architecture turns into a deployable platform. They do not appear in benchmark slides.

Open, developer-friendly hardware still has to survive boring product realities.

XI. Where Nvidia still wins

Be fair to Nvidia. The advantages are real and they are not only chips. Nvidia owns CUDA maturity, developer trust, training dominance, inference maturity, the NVLink / InfiniBand / Spectrum-X stack, cloud availability, enterprise procurement trust, model compatibility, ecosystem tooling, support, performance on the hardest workloads, and a decade of software optimisation depth. ASML and TSMC both describe AI demand as the dominant pull on advanced logic, memory, and packaging, and that pull lands most heavily on Nvidia silicon today.¹³¹⁴

Dimension

Nvidia today

Tenstorrent today

Software

CUDA, mature libraries, tooling, observability.

Open TT-Metalium / TT-NN / TT-Forge, immature relative to CUDA.⁷

Training

Default for frontier model training at scale.

Limited public training deployments; not the focus.

Inference

Strong, with Triton and Dynamo ecosystem.

Targeted at production inference and video generation.⁵

Networking

NVLink, InfiniBand, Spectrum-X.

Ethernet-native, switch-light scale-out by design.³

Business model

Systems, software, services.

Systems + open software + RISC-V/AI IP licensing.¹¹

Strategic fit

Default AI operating environment.

Cost-per-token, openness, sovereign AI, custom silicon.

Tenstorrent may have a smart architecture, but Nvidia has the most proven AI infrastructure machine in the world.

XII. What could break the thesis?

The strongest bear case is that Nvidia is not just a chip company. It is the default AI operating environment.

Bear case · what could break Tenstorrent

CUDA stays too strong. A decade of libraries, tooling, and developer habit does not unwind quickly.
Software immaturity. TT-Metalium, TT-NN, and TT-Forge need to keep up with frontier model churn.⁷
Benchmark non-replication. Company tokens/sec and TTFT figures need independent reproduction.⁵
Model coverage lag. New open and closed models appear faster than ports.
Safety preference. Customers may pay more for Nvidia simply because it is safer to defend internally.
Networking gap. Ethernet-native scale-out may not match NVLink/IB at the hardest training scale.
Hyperscaler in-house. Cloud vendors increasingly prefer their own silicon.
AMD wakes up. A stronger ROCm and MI roadmap could absorb the “Nvidia alternative” slot.
Reliability. Support quality and uptime are where many startups die.
Yield and revisions. p150 120-core firmware adjustments are minor but symbolic of the risk.¹⁵
Porting friction. Open source does not erase the real cost of moving production workloads.
Mindshare. Developer attention is concentrated, and concentration compounds.
IP company outcome. Tenstorrent may end up valued more as an IP business than a systems company.¹¹

XIII. What could break the bear case?

The strongest bull case is that AI is becoming too large, too expensive, and too politically important for one closed stack to satisfy every customer.

Bull case · what could break the bear

Workloads keep changing. Inference, agentic systems, and video generation reward flexibility over peak training throughput.
Cost per token wins. As AI scales, every unit of inference cost compounds.
Customers want alternatives. Procurement teams do not like single-vendor risk.
Sovereign AI grows. National buyers want control they cannot get from a single foreign vendor.¹²
Open software compounds. Inspectable stacks are more valuable as agents and regulations mature.
RISC-V adoption rises. The base of open IP grows across CPUs, NPUs, and SoCs.¹¹
AI coding tools reduce porting friction. Model porting becomes cheaper to attempt.
Ethernet-native systems integrate well. Most data centers already speak Ethernet.³
Niche wins are enough. Inference, video, private AI, and custom silicon are large markets.
IP business is high-leverage. A licensing engine pays even if systems do not displace Nvidia.¹¹

AI is becoming too large, too expensive, and too politically important for one closed stack to satisfy every customer.

XIV. What to watch

If the Networked AI bet is real, certain signals should keep showing up across customer announcements, benchmarks, and roadmap notes. If it is fragile, the cracks will appear in the same places first.

What to watch

Independent Galaxy benchmarks.
Real customer deployments at scale.
DeepSeek, Llama, Qwen, and open-model coverage.
Cost per token vs Nvidia and AMD.
Time-to-first-token under real load.
Video-generation throughput in production.
Uptime and reliability across long runs.
Firmware stability and release cadence.
Blackhole p150 120-core transition impact.¹⁵
TT-Metalium, TT-NN, TT-Forge maturity.
vLLM and serving-stack support.
Cloud availability of Tenstorrent systems.
Developer community growth.
IP licensing revenue trajectory.¹¹
LG, Hyundai, and automotive traction.
Sovereign AI partnerships beyond Japan.¹²
Rapidus and broader RISC-V design ecosystem progress.
Samsung Foundry and TSMC manufacturing roadmap.
Nvidia’s software and pricing response.
AMD’s software progress.

Glossary

A short reference for the vocabulary used above. Definitions are simplified.

Glossary

Tensix core: Tenstorrent’s AI compute core combining compute, SRAM, routing, and packet-management logic.
Mini-tensor: A smaller packetised tensor unit used inside Tenstorrent’s mesh architecture.
NoC: Network-on-chip; a communication fabric inside a chip.
Ethernet scale-out: Using Ethernet links to connect many accelerators or systems together.
SRAM: Fast memory placed close to compute on the die.
GDDR6: Graphics memory used by some AI accelerators for bandwidth.
FP8 / Block FP8: Low-precision numerical formats used for AI compute throughput.
CUDA: Nvidia’s software platform for GPU computing and AI development.
RISC-V: An open instruction-set architecture used for CPUs and custom silicon.
IP licensing: Selling reusable CPU or AI core designs to other chip designers.
TT-Metalium: Tenstorrent’s low-level programming model and SDK.
TT-NN: Tenstorrent’s neural-network operator library.
TT-Forge: Tenstorrent’s MLIR-based compiler stack.
Sovereign AI: Nationally controlled AI infrastructure, skills, and supply chains.
Cost per token: The cost of producing AI model outputs.
Time to first token: The latency before a model starts generating output.

XV. The Networked AI bet

Nvidia still owns the default AI stack. That is not changing overnight. CUDA, libraries, networking, cloud availability, developer trust, and a decade of production battle-testing are not numbers you replace with a slide.

The bet is not that Tenstorrent kills Nvidia. The bet is that AI infrastructure becomes too large, too diverse, and too politically important for one stack to own everything.

If customers keep wanting lower cost, more openness, more deployment control, and alternatives to a single closed ecosystem, Tenstorrent becomes worth watching. If sovereign AI buyers keep wanting local capability and inspectable stacks, the IP side of the business may matter as much as the box side. If open-source AI hardware can hold up under boring product realities, including yield, firmware, and support, then the Networked AI bet stops being a story about one company and becomes a story about how AI compute gets organised in the next decade.

The proof is still ahead. Independent benchmarks. Real deployments. Reliability. Roadmap discipline. Software maturity. Customer growth.

But the direction is clear. AI hardware is no longer only about who builds the biggest chip. It is about who builds the most useful network.

¹ Patel, D. (Jun 2021). Tenstorrent Wormhole Analysis — A Scale Out Architecture for Machine Learning That Could Put Nvidia On Their Back Foot. SemiAnalysis. Historical anchor for the 2021 Wormhole thesis, including Tensix cores, mini-tensor packets, network-on-chip, GDDR6 memory, 16×100GbE links, Nebula and Galaxy topology, and the uniform-mesh software model. Used as inspiration only. No content, structure, or charts reproduced.

² Tenstorrent (Apr 2025). Tenstorrent launches Blackhole developer products at Tenstorrent Dev Day. Blackhole p100 from $999, p150 from $1,399, TT-QuietBox from $11,999, plus framing of the 6nm-class node, NoC, memory, RISC-V cores, and open software stack.

³ Tenstorrent. Galaxy. Galaxy Blackhole with 32 Blackhole ASICs, 23 PFLOPS Block FP8, 6.2GB SRAM at 2.9PB/s, 1TB GDDR6 at 16TB/s, 10×400GbE per ASIC, up to 56×800GbE QSFP-DD scale-out ports, with pricing of $110,000 for one Galaxy Blackhole and from $440,000 for the four-Galaxy supercluster.

⁴ Tenstorrent. TT-Deploy. Galaxy production framing, integrated compute / SRAM / DRAM / networking story, and Tenstorrent’s positioning around AI inference at scale.

⁵ Tenstorrent. Tenstorrent enables AI at scale with industry-leading performance. Company claims of 350+ tokens/sec/user on DeepSeek-R1-0528 671B, 100K context, sub-4-second time-to-first-token, and a roughly 10× Prodia video-generation speedup with 720p 81-frame video in about 2.4 seconds. Treated in this essay as vendor benchmarks, not independent results.

⁶ Tenstorrent. Tenstorrent software stack overview. TT-Metalium, TT-NN, and TT-Forge described as the main stack components.

⁷ Tenstorrent. Software stack getting-started docs. Layered model with TT-Metalium at the bottom, TT-NN as the neural-network library, and TT-Forge as the compiler-level entry point.

⁸ Tenstorrent. TT-Metalium documentation. Low-level programming model and SDK for Tensix hardware.

⁹ Tenstorrent. TT-Forge documentation. MLIR-based compiler stack with TT-Torch, TT-XLA, and TT-Forge-ONNX frontends.

¹⁰ Tenstorrent (Dec 2024). Tenstorrent closes $693M Series D. Over US$693M raised at a US$2B pre-money valuation, with strategic investors including Samsung Securities, AFW Partners, LG Technology Ventures, Hyundai Motor Group, Fidelity, Baillie Gifford, and Bezos Expeditions, plus AI and RISC-V IP licensing framing.

¹¹ Brown, S. Tenstorrent productises RISC-V CPU and AI IP. EE Times. IP-business framing including LG and Hyundai licensees and the reported share of bookings coming from IP deals.

¹² Mukherjee, S. and Mukherjee, S. (Nov 2024). Japan taps US chip startup Tenstorrent to help train new wave of engineers. Reuters. $50M program, up to 200 Japanese chip designers, five-year horizon, Rapidus and RISC-V context.

¹³ ASML (2025). 2025 Annual Report, strategic report section. AI requires leading-edge high-performance processors and a significant increase in DRAM relative to traditional compute architectures.

¹⁴ TSMC. 2025 Annual Report. Robust AI-related demand, advanced packaging and 3D stacking investment, and the role of advanced logic and packaging for AI/HPC.

¹⁵ Tenstorrent. tt-zephyr-platforms release notes 19.5. Starting January 2026, Blackhole p150 cards ship with 120 Tensix cores instead of 140. Firmware v19.5.0 exposes 120 cores on existing cards for a unified interface, with typical workloads seeing a 1–2% performance difference and possible application updates for grid-size-dependent code.