Maia 200 is Microsoft’s new in home AI accelerator designed for inference in Azure datacenters. It targets the price of token technology for big language fashions and different reasoning workloads by combining slim precision compute, a dense on chip reminiscence hierarchy and an Ethernet based mostly scale up material.
Why Microsoft constructed a devoted inference chip?
Coaching and inference stress {hardware} in numerous methods. Coaching wants very giant all to all communication and lengthy working jobs. Inference cares about tokens per second, latency and tokens per greenback. Microsoft positions Maia 200 as its best inference system, with about 30 p.c higher efficiency per greenback than the most recent {hardware} in its fleet.
Maia 200 is a part of a heterogeneous Azure stack. It’ll serve a number of fashions, together with the most recent GPT 5.2 fashions from OpenAI, and can energy workloads in Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence crew will use the chip for artificial information technology and reinforcement studying to enhance in home fashions.
Core silicon and numeric specs
Every Maia 200 die is fabricated on TSMC’s 3 nanometer course of. The chip integrates greater than 140 billion transistors.
The compute pipeline is constructed round native FP8 and FP4 tensor cores. A single chip delivers greater than 10 petaFLOPS in FP4 and greater than 5 petaFLOPS in FP8, inside a 750W SoC TDP envelope.
Reminiscence is cut up between stacked HBM and on die SRAM. Maia 200 supplies 216 GB of HBM3e with about 7TB per second of bandwidth and 272MB of on die SRAM. The SRAM is organized into tile degree SRAM and cluster degree SRAM and is totally software program managed. Compilers and runtimes can place working units explicitly to maintain consideration and GEMM kernels near compute.
Tile based mostly microarchitecture and reminiscence hierarchy
The Maia 200 microarchitecture is hierarchical. The bottom unit is the tile. A tile is the smallest autonomous compute and storage unit on the chip. Every tile features a Tile Tensor Unit for prime throughput matrix operations and a Tile Vector Processor as a programmable SIMD engine. Tile SRAM feeds each items and tile DMA engines transfer information out and in of SRAM with out stalling compute. A Tile Management Processor orchestrates the sequence of tensor and DMA work.
A number of tiles type a cluster. Every cluster exposes a bigger multi banked Cluster SRAM that’s shared throughout tiles in that cluster. Cluster degree DMA engines transfer information between Cluster SRAM and the co packaged HBM stacks. A cluster core coordinates multi tile execution and makes use of redundancy schemes for tiles and SRAM to enhance yield whereas conserving the identical programming mannequin.
This hierarchy lets the software program stack pin totally different components of the mannequin in numerous tiers. For instance, consideration kernels can maintain Q, Okay, V tensors in tile SRAM, whereas collective communication kernels can stage payloads in cluster SRAM and scale back HBM stress. The design purpose is sustained excessive utilization when fashions develop in measurement and sequence size.
On chip information motion and Ethernet scale up material
Inference is usually restricted by information motion, not peak compute. Maia 200 makes use of a customized Community on Chip together with a hierarchy of DMA engines. The Community on Chip spans tiles, clusters, reminiscence controllers and I/O items. It has separate planes for big tensor site visitors and for small management messages. This separation retains synchronization and small outputs from being blocked behind giant transfers.
Past the chip boundary, Maia 200 integrates its personal NIC and an Ethernet based mostly scale up community that runs the AI Transport Layer protocol. The on-die NIC exposes about 1.4 TB per second in every path, or 2.8 TB per second bidirectional bandwidth, and scales to six,144 accelerators in a two tier area.
Inside every tray, 4 Maia accelerators type a Absolutely Related Quad. These 4 units have direct non switched hyperlinks to one another. Most tensor parallel site visitors stays inside this group, whereas solely lighter collective site visitors goes out to switches. This improves latency and reduces swap port depend for typical inference collectives.
Azure system integration and cooling
At system degree, Maia 200 follows the identical rack, energy and mechanical requirements as Azure GPU servers. It helps air cooled and liquid cooled configurations and makes use of a second technology closed loop liquid cooling Warmth Exchanger Unit for prime density racks. This permits combined deployments of GPUs and Maia accelerators in the identical datacenter footprint.
The accelerator integrates with the Azure management airplane. Firmware administration, well being monitoring and telemetry use the identical workflows as different Azure compute providers. This permits fleet extensive rollouts and upkeep with out disrupting working AI workloads.
Key Takeaways
Listed here are 5 concise, technical takeaways:
- Inference first design: Maia 200 is Microsoft’s first silicon and system platform constructed just for AI inference, optimized for big scale token technology in fashionable reasoning fashions and huge language fashions.
- Numeric specs and reminiscence hierarchy: The chip is fabricated on TSMCs 3nm, integrates about 140 billion transistors and delivers greater than 10 PFLOPS FP4 and greater than 5 PFLOPS FP8, with 216 GB HBM3e at 7TB per second together with 272 MB on chip SRAM cut up into tile SRAM and cluster SRAM and managed in software program.
- Efficiency versus different cloud accelerators: Microsoft experiences about 30 p.c higher efficiency per greenback than the most recent Azure inference techniques and claims 3 occasions FP4 efficiency of third technology Amazon Trainium and better FP8 efficiency than Google TPU v7 on the accelerator degree.
- Tile based mostly structure and Ethernet material: Maia 200 organizes compute into tiles and clusters with native SRAM, DMA engines and a Community on Chip, and exposes an built-in NIC with about 1.4 TB per second per path Ethernet bandwidth that scales to six,144 accelerators utilizing Absolutely Related Quad teams because the native tensor parallel area.
Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking complicated datasets into actionable insights.

