Majestic AIU Dashboard

Enter password to continue

Majestic AIU — Inference Latency Simulator v0.3.0

192 Minds · 48 HEX Clusters · 3 MB SRAM/Mind · 11 Model Sizes · 10 Batch Sizes
Overview
Tile Simulator
Heatmaps
Charts
HW Benchmarks
H100 Comparison
Algorithm
Config Chooser
Model Space
Theoretical Model
Theory vs Measured
TUNABLES v0.3.0
Apply recomputes heatmaps, KPIs, charts, H100 comparison, and the config chooser.
CONFIGURATION
Excel sheet: Latency Optimized
derived: 48 HEX · 192 Mind
(48 HEX + 192 Mind per AIU)
Block Time Heatmap (µs)
Execution Mode Map
Total Time Heatmap (µs)
Model Configuration (M × D)
Memory Breakdown (3 MB SRAM Limit)
All Configurations for This Model × Batch
Wave Count
Optimal tile_M Selection
Memory Usage (MB)

Block Time vs Batch Size

Total Time vs Batch Size

Est. Tile Time vs Batch Size

Waves vs Batch Size

Block Latency vs eff_tN — Pivot Heatmaps by tile_M
Latency Scaling — K vs tile_M

Latency vs K (eff_tN = 1)

Latency vs K (eff_tN = 64)

All HW Benchmark Entries (218 configs)
Reference Hardware: Showing: H100 (measured)
Total Time — AIU vs Best H100 (µs)
Speedup Heatmap — Best H100 / AIU (> 1 = AIU faster)
Latency Scaling — AIU vs H100

Total Time vs Batch Size (Small Models)

Total Time vs Batch Size (Large Models)

Speedup (H100 / AIU) vs Batch Size

Crossover Analysis — Batch Size where AIU wins

Detailed Comparison Table
Tiled GEMM Execution Model — Partition → Split-K → Accumulate
⚡ Key Insight
In HEX mode, each C-tile is owned end-to-end by one HEX. Inside that HEX the K dimension is split 4-way across its 4 Minds: each Mind computes a partial over K/4 and atomic_adds into one shared region in the HEX's 12 MB SRAM. No cross-HEX reduction. With 48 HEX total, the AIU processes 48 tiles per wave → waves = ⌈tiles / 48⌉ with +2.0 µs overhead (includes Split-K atomic tax). In Mind mode each of the 192 Minds owns a full tile end-to-end — no Split-K, no atomic — so waves = ⌈tiles / 192⌉ with +1.0 µs overhead. Block time is est × waves + ov(mode); the chooser picks whichever mode yields lower total D × block.
Mode-dependent per-block overhead — HEX mode pays 2.0 µs (includes the ~1 µs Split-K atomic tax: partial-sum atomic_add into HEX SRAM + intra-HEX drain barrier). Mind mode skips Split-K entirely (1 Mind owns the full K), so it pays only 1.0 µs (DMA setup + kernel launch + HexMem writeback). Provenance of the 2.0 value: base 1.0 + atomic 0.5 µs × 2 GEMMs = 1.0 µs; the ~1 µs atomic delta (0.5 µs × 2 GEMMs) is the incremental cost of cooperative reduction and is charged only when Split-K is active. Total time = D × (est · waves + ov(mode)).
Config Chooser — Pick the Optimal Tile Shape for Your Workload
Convention: AIU GEMM is C[B,M] = X[B,K] · W[K,M] — aligned with the Algorithm tab's A·B=C, where A↔X[B,K], B↔W[K,M], C↔output[B,M]. The tile grid is gN = ⌈B / eff_tN⌉ along the batch axis × gM = ⌈M / tile_M⌉ along the output axis. There is no separate N dimension here.
Pick from Preset Heatmap — Block Time (µs) (click any cell to load B/M/K/D)
Note: Cell values come from the Excel V1.1 MODE_OVERRIDE table (frozen Excel mode choice). The Chooser re-runs the block-time formula and may select a different mode than the cell — the HEX/Mind values in the Decision Flow still match when compared at the same mode.
Decision Flow
Candidate Rankings
RankModeeff_tNtile_Mgrid gN×gMtilesw/wavewavesSRAM (MB)C-tile (KB)AtomicEst (µs)+OV (µs)Block (µs)Total (µs)Status
🧭 How it picks
For each candidate (mode, eff_tN ∈ {64,128}, tile_M ∈ {64,128,256}) the chooser computes the SRAM footprint In·4 + Out·2 + W1·(2 or 2/4) and rejects any candidate over 3 MB/Mind. Grid dims are gN=⌈B/eff_tN⌉, gM=⌈M/tile_M⌉; in HEX mode each tile is owned by one HEX (4 Minds split K and atomic_add into one HEX-SRAM region), so waves = ⌈tiles/48⌉. In Mind mode each Mind owns a full tile, so waves = ⌈tiles/192⌉. Block time = est·waves + ov(mode) where ov = 2.0 µs for HEX (Split-K atomic tax included) or 1.0 µs for Mind (no atomic). Total = D·block. The candidate with the lowest total wins — and because Mind mode saves ~1 µs per block, it rarely beats HEX (which has 4× fewer waves).
Per-block overhead depends on mode — paid exactly once per block, regardless of wave count:
  • Mind mode — 1.0 µs: DMA setup + prefetch, kernel launch / barrier, HexMem writeback. No Split-K, no atomic.
  • HEX mode — 2.0 µs: same base + ~1 µs Split-K atomic tax (0.5 µs × 2 GEMMs) (4 Minds atomic_add into a shared HEX-SRAM region + intra-HEX drain barrier before writeback).
So block = est · waves + ov(mode). The ~1 µs delta is exactly the cost of intra-HEX cooperative reduction; Mind mode avoids it by keeping one Mind per tile. Small workloads are dominated by this term — which is why Mind mode often wins despite having 4× more waves.
Model Parameter Space
Each curve plots the set of (M, D) pairs that satisfy P = 2 · D · M² for a fixed parameter count P (two weight matrices per block). Along any one curve D = P / (2·M²), so doubling M quarters D. X-axis = hidden size M; Y-axis = depth D.

Model Space — D vs M (log-log)

Model Space — D vs M (linear)

Theoretical Model Generator
Enter a parameter count P (in millions) and one or more depths D (comma-separated). For each D the dashboard derives M = √(P / (2·D)) with K = M / 4, snaps K to the nearest measured value in the HW benchmarks, and computes the four latency curves across batch size using the same chooser logic as the Config Chooser tab.

Block Time vs Batch Size

Total Time vs Batch Size

Est. Tile Time vs Batch Size

Waves vs Batch Size

Theoretical Peak vs Simulated Block Time
Roofline per Mind — theory charges compute OR memory (whichever is bigger), plus a store step:
  • Per-Mind matrix engine: 2048 BF16 FLOPs/cycle @ 2 GHz ⇒ 4.096 TFLOPs/s peak, with MAC utilization: 90% when K/worker ≥ 1024, linearly ramping 50%→90% below.
  • Per-Mind vector engine: 256 FLOPs/cycle @ 2 GHz ⇒ 0.512 TFLOPs/s.
  • Per-Mind CLP load port: 128 GB/s from HexMem. Weights/activations assumed pre-staged in HexMem.
  • HEX mode → 4 Minds split K per tile ⇒ each Mind does K/4 of the GEMM + loads its slice. Mind mode → 1 Mind per tile ⇒ full K.
per_tile = max(compute, CLP_load) + CLP_store, block = waves × per_tile + ov(mode) (ov = 2 µs HEX / 1 µs Mind). Ratio = sim / theory.
Theoretical Block Time Heatmap (µs)
Theoretical Total Run Time Heatmap (µs · D depth)
Simulated / Theory Ratio — Total Time (1.0 = at peak)
Side-by-side Table