Why AI Chips Are Splitting by Workload
AI chips are not just “faster computers”. They are specialised machines for turning electricity into
matrix multiplication at trillion-scale. CPUs, GPUs, and TPUs exist because different workloads need different
hardware. As AI matures, training and inference are splitting into different chip-design
problems.
CPU
General-purpose control
A handful of sophisticated cores with deep control logic, large caches, branch prediction, and flexibility.
Excellent for operating systems, orchestration, databases, application logic, and serial tasks. Not enough
parallel arithmetic units to drive trillion-scale neural-network matrix multiplication.
GPU
Parallel computation
Thousands of simpler cores that apply the same operation across many pieces of data at once. Originally
built for graphics, where millions of pixels can be processed in parallel. Neural networks rely on the same
parallel matrix math, which made GPUs the accidental engine of deep learning.
TPU / AI accelerator
Specialised matrix engine
Sacrifices general-purpose flexibility for efficiency on tensor operations. Systolic-array designs push
data through grids of simple compute units, doing matrix multiplication with very little wasted control
overhead. Higher performance per watt on AI workloads; less useful for arbitrary code.
Why GPUs won AI
Graphics and neural networks look unrelated, but they share the same deeper pattern: huge amounts of
parallel math. Graphics applies the same transformations across millions of pixels and vertices. Neural
networks apply multiply-add operations across enormous matrices. Hardware built for one turned out to be ideal
for the other.
Analogy · CPU
Jumbo jet
Fast, flexible, many routes, many cargo types. The right vehicle when each trip is different and the
itinerary matters more than the volume moved.
Analogy · GPU
Cargo ship
Less flexible. Slower for any single trip. But moves enormous volume of the same cargo at once. The right
vehicle when you have huge amounts of similar work to do in parallel.
What is actually inside a modern GPU
A GPU is not a faster CPU. It trades flexibility for parallel throughput by stacking thousands of simple
arithmetic units, feeding them data through very wide memory pipes.
- Many simple arithmetic cores. Each one is much less capable than a CPU core. The workhorse operation is fused multiply-add — multiplying two numbers and adding a third in a single step.
- Tensor cores. Specialised units that perform matrix multiplication and addition directly — the central operation of neural networks. This is what made modern GPUs especially good for AI rather than just generally parallel.
- Same instruction, many data points (SIMD / SIMT). The thousands of cores execute the same operation across different pieces of data at the same time. Graphics, mining, and deep learning all fit this shape, which is why one chip family serves all three.
- Very high memory bandwidth. Thousands of cores starve without data. The AI chip race is partly a memory-bandwidth race: raw compute is wasted if the chip cannot be fed fast enough.
Training and inference are different workloads
The same silicon family runs both, but the optimisations diverge. Training builds the model.
Inference runs it. Two different bottlenecks, two different chip-design problems.
Training
Build the model
Compute-bound and interconnect-heavy. Needs raw floating-point throughput, large batch processing,
high-bandwidth memory, and very fast chip-to-chip communication so gradients can be shared across many
accelerators each step. Rewards scale and cluster networking.
Inference
Run the model
Often memory-bound and latency-sensitive. The system must load weights, manage the KV cache for context,
generate tokens quickly, and minimise cost per token. Agentic AI raises the stakes — one user task can
trigger many sequential model and tool calls, so inference efficiency drives the unit economics.
Strategic read
AI hardware is moving from one-chip-fits-all to workload-specific design. The frontier training cluster
wants maximum throughput and interconnect. The production inference fleet wants low latency, memory
efficiency, reliability, and cost control. This is why the AI chip market is splitting into training
accelerators, inference accelerators, hyperscaler custom silicon, edge NPUs, and specialised AI processors.
- Chips are no longer just about peak FLOPS — utilisation, memory and interconnect decide real throughput.
- Memory bandwidth and chip-to-chip interconnect are becoming as important as raw compute.
- Training rewards scale, throughput, and cluster networking.
- Inference rewards latency, utilisation, KV-cache efficiency, and cost per generated token.
- Agentic AI increases demand for efficient inference because one task may require many model and tool calls.
- Hyperscalers design custom silicon because workload control can become a structural cost advantage.
Beyond Nvidia: the rise of custom AI silicon
AI compute is no longer a single market. Nvidia GPUs became the default engine because they are flexible,
programmable, and excellent at large-scale parallel math — and that flexibility still matters, especially for
frontier training. But production AI is increasingly an inference problem: serving tokens quickly,
reliably, and cheaply across millions of requests. As that side of the workload grows, specialised silicon
becomes attractive — a buyer who knows the workload can trade general-purpose flexibility for better latency,
utilisation, power efficiency, and cost per token.
General-purpose AI GPU
Flexible training and inference
Best ecosystem, broad programmability, deep software moat — CUDA, kernels, libraries, model code — and useful
across many AI and non-AI workloads. Examples: Nvidia data-centre GPUs. The default platform when
the workload is varied or not yet stable.
Hyperscaler ASIC
Custom cloud-scale AI compute
Useful when a cloud provider controls the workload end-to-end and wants lower cost, better performance
per watt, and less dependence on external chip supply. Examples: Google TPU, AWS Trainium and
Inferentia, Microsoft Maia, Meta MTIA.
Wafer-scale AI chip
Very-large-chip compute
Pushes more compute and memory onto one very large piece of silicon to cut chip-to-chip communication
overhead. Targets training and inference where interconnect is the binding constraint. Examples:
Cerebras-style wafer-scale engines.
Inference accelerator
Fast token generation
Optimised for low-latency inference, high throughput, and cost-efficient serving. Trades training
generality for serving speed and tokens-per-dollar. Examples: Groq-style LPUs, SambaNova,
d-Matrix, and other inference-focused ASICs.
Edge NPU
Local inference on device
Moves smaller AI workloads onto phones, laptops, cameras, and embedded devices — lower latency, better
privacy, offline use, and reduced cloud cost. Examples: Apple Neural Engine, Qualcomm Hexagon,
Intel / AMD / Arm NPUs.
Read it as workload fragmentation, not Nvidia versus everyone
The durable lesson is not “Nvidia is losing” or that any single challenger will win. The durable lesson is
that AI compute is fragmenting by workload. Frontier training clusters, cloud inference fleets, enterprise
agents, consumer-device AI, and specialised scientific workloads will not all use the same hardware forever.
Even Nvidia has shown interest in inference-specific technology through major licensing, partnership, and
talent-related moves — the incumbent itself treats inference as its own category.
- Nvidia remains the default platform for flexible AI compute, especially for training.
- Custom ASICs become more attractive when the workload is predictable and high-volume.
- Inference may become the largest economic battleground — models have to be served continuously, agents call them in long chains.
- Cost per token, latency, and utilisation become as important as peak FLOPS.
- Memory bandwidth, networking, and software tooling matter as much as the silicon.
- Hyperscalers have a structural advantage because they control both the demand and the infrastructure.
- Startups compete by specialising for a workload, not by copying the Nvidia stack head-on.
Chip types — what does each one actually do?
CPUs, GPUs, TPUs, custom hyperscaler silicon, AMD Instinct, Intel Gaudi, wafer-scale, LPUs and edge NPUs — nine cards, each with its best use, its weakness, and a concrete example.
Which chip for which job?
A practical first cut, not a benchmark. Always validate on your workload, your software stack and your latency target.
Misconceptions to drop
The fastest way to think clearly about AI chips is to stop carrying these around.
Sources & methodology
Hardware specs are taken from official vendor product pages, datasheets and architecture briefs (NVIDIA,
AMD, Google Cloud TPU docs, AWS Neuron docs, Intel Gaudi, Cerebras, Groq, Apple). Geopolitical and
supply-chain reads draw on TSMC + ASML disclosures, SIA / CSIS reports, Reuters coverage, and the
official Singapore EDB / Malaysia MIDA sector pages. "Latest available" labels apply throughout.