Majestic AIU — Inference Latency Simulator

Tiled GEMM Execution Model — Partition → Split-K → Accumulate

Speed

⚡ Key Insight

In HEX mode, each C-tile is owned end-to-end by one HEX. Inside that HEX the K dimension is split 4-way across its 4 Minds: each Mind computes a partial over K/4 and atomic_adds into one shared region in the HEX's 12 MB SRAM. No cross-HEX reduction. With 48 HEX total, the AIU processes 48 tiles per wave → waves = ⌈tiles / 48⌉ with +2.0 µs overhead (includes Split-K atomic tax). In Mind mode each of the 192 Minds owns a full tile end-to-end — no Split-K, no atomic — so waves = ⌈tiles / 192⌉ with +1.0 µs overhead. Block time is est × waves + ov(mode); the chooser picks whichever mode yields lower total D × block.

Mode-dependent per-block overhead — HEX mode pays 2.0 µs (includes the ~1 µs Split-K atomic tax: partial-sum atomic_add into HEX SRAM + intra-HEX drain barrier). Mind mode skips Split-K entirely (1 Mind owns the full K), so it pays only 1.0 µs (DMA setup + kernel launch + HexMem writeback). Provenance of the 2.0 value: base 1.0 + atomic 0.5 µs × 2 GEMMs = 1.0 µs; the ~1 µs atomic delta (0.5 µs × 2 GEMMs) is the incremental cost of cooperative reduction and is charged only when Split-K is active. Total time = D × (est · waves + ov(mode)).

Config Chooser — Pick the Optimal Tile Shape for Your Workload

Convention: AIU GEMM is C[B,M] = X[B,K] · W[K,M] — aligned with the Algorithm tab's A·B=C, where A↔X[B,K], B↔W[K,M], C↔output[B,M]. The tile grid is gN = ⌈B / eff_tN⌉ along the batch axis × gM = ⌈M / tile_M⌉ along the output axis. There is no separate N dimension here.

Pick from Preset Heatmap — Block Time (µs) (click any cell to load B/M/K/D)

Note: Cell values come from the Excel V1.1 MODE_OVERRIDE table (frozen Excel mode choice). The Chooser re-runs the block-time formula and may select a different mode than the cell — the HEX/Mind values in the Decision Flow still match when compared at the same mode.

B (batch / seq·tokens) M (output features) K (reduction) D (depth)

Decision Flow

Candidate Rankings

Rank	Mode	eff_tN	tile_M	grid gN×gM	tiles	w/wave	waves	SRAM (MB)	C-tile (KB)	Atomic	Est (µs)	+OV (µs)	Block (µs)	Total (µs)	Status

🧭 How it picks

For each candidate (mode, eff_tN ∈ {64,128}, tile_M ∈ {64,128,256}) the chooser computes the SRAM footprint In·4 + Out·2 + W1·(2 or 2/4) and rejects any candidate over 3 MB/Mind. Grid dims are gN=⌈B/eff_tN⌉, gM=⌈M/tile_M⌉; in HEX mode each tile is owned by one HEX (4 Minds split K and atomic_add into one HEX-SRAM region), so waves = ⌈tiles/48⌉. In Mind mode each Mind owns a full tile, so waves = ⌈tiles/192⌉. Block time = est·waves + ov(mode) where ov = 2.0 µs for HEX (Split-K atomic tax included) or 1.0 µs for Mind (no atomic). Total = D·block. The candidate with the lowest total wins — and because Mind mode saves ~1 µs per block, it rarely beats HEX (which has 4× fewer waves).

Per-block overhead depends on mode — paid exactly once per block, regardless of wave count:

Mind mode — 1.0 µs: DMA setup + prefetch, kernel launch / barrier, HexMem writeback. No Split-K, no atomic.
HEX mode — 2.0 µs: same base + ~1 µs Split-K atomic tax (0.5 µs × 2 GEMMs) (4 Minds atomic_add into a shared HEX-SRAM region + intra-HEX drain barrier before writeback).

So block = est · waves + ov(mode). The ~1 µs delta is exactly the cost of intra-HEX cooperative reduction; Mind mode avoids it by keeping one Mind per tile. Small workloads are dominated by this term — which is why Mind mode often wins despite having 4× more waves.

Model Parameter Space

Each curve plots the set of (M, D) pairs that satisfy P = 2 · D · M² for a fixed parameter count P (two weight matrices per block). Along any one curve D = P / (2·M²), so doubling M quarters D. X-axis = hidden size M; Y-axis = depth D.

Model Space — D vs M (log-log)

Model Space — D vs M (linear)

Theoretical Model Generator

Enter a parameter count P (in millions) and one or more depths D (comma-separated). For each D the dashboard derives M = √(P / (2·D)) with K = M / 4, snaps K to the nearest measured value in the HW benchmarks, and computes the four latency curves across batch size using the same chooser logic as the Config Chooser tab.

P (millions of params) D values (comma-separated)

Block Time vs Batch Size

Total Time vs Batch Size

Est. Tile Time vs Batch Size

Waves vs Batch Size

Theoretical Peak vs Simulated Block Time

Roofline per Mind — theory charges compute OR memory (whichever is bigger), plus a store step:

Per-Mind matrix engine: 2048 BF16 FLOPs/cycle @ 2 GHz ⇒ 4.096 TFLOPs/s peak, with MAC utilization: 90% when K/worker ≥ 1024, linearly ramping 50%→90% below.
Per-Mind vector engine: 256 FLOPs/cycle @ 2 GHz ⇒ 0.512 TFLOPs/s.
Per-Mind CLP load port: 128 GB/s from HexMem. Weights/activations assumed pre-staged in HexMem.
HEX mode → 4 Minds split K per tile ⇒ each Mind does K/4 of the GEMM + loads its slice. Mind mode → 1 Mind per tile ⇒ full K.

per_tile = max(compute, CLP_load) + CLP_store, block = waves × per_tile + ov(mode) (ov = 2 µs HEX / 1 µs Mind). Ratio = sim / theory.

Theoretical Block Time Heatmap (µs)

Theoretical Total Run Time Heatmap (µs · D depth)

Simulated / Theory Ratio — Total Time (1.0 = at peak)

Side-by-side Table

Majestic AIU Dashboard

Majestic AIU — Inference Latency Simulator v0.3.0

Block Time vs Batch Size

Total Time vs Batch Size

Est. Tile Time vs Batch Size

Waves vs Batch Size

Latency vs K (eff_tN = 1)

Latency vs K (eff_tN = 64)

Total Time vs Batch Size (Small Models)

Total Time vs Batch Size (Large Models)

Speedup (H100 / AIU) vs Batch Size

Crossover Analysis — Batch Size where AIU wins

Model Space — D vs M (log-log)

Model Space — D vs M (linear)

Block Time vs Batch Size

Total Time vs Batch Size

Est. Tile Time vs Batch Size

Waves vs Batch Size