Enter password to continue
waves = ⌈tiles / 48⌉ with +2.0 µs overhead (includes Split-K atomic tax). In Mind mode each of the 192 Minds owns a full tile end-to-end — no Split-K, no atomic — so waves = ⌈tiles / 192⌉ with +1.0 µs overhead. Block time is est × waves + ov(mode); the chooser picks whichever mode yields lower total D × block.
D × (est · waves + ov(mode)).
C[B,M] = X[B,K] · W[K,M] — aligned with the Algorithm tab's A·B=C, where A↔X[B,K], B↔W[K,M], C↔output[B,M]. The tile grid is gN = ⌈B / eff_tN⌉ along the batch axis × gM = ⌈M / tile_M⌉ along the output axis. There is no separate N dimension here.
MODE_OVERRIDE table (frozen Excel mode choice). The Chooser re-runs the block-time formula and may select a different mode than the cell — the HEX/Mind values in the Decision Flow still match when compared at the same mode.
| Rank | Mode | eff_tN | tile_M | grid gN×gM | tiles | w/wave | waves | SRAM (MB) | C-tile (KB) | Atomic | Est (µs) | +OV (µs) | Block (µs) | Total (µs) | Status |
|---|
In·4 + Out·2 + W1·(2 or 2/4)
and rejects any candidate over 3 MB/Mind. Grid dims are gN=⌈B/eff_tN⌉, gM=⌈M/tile_M⌉; in HEX mode each tile is owned by one HEX (4 Minds split K and atomic_add into one HEX-SRAM region), so waves = ⌈tiles/48⌉. In Mind mode each Mind owns a full tile, so waves = ⌈tiles/192⌉.
Block time = est·waves + ov(mode) where ov = 2.0 µs for HEX (Split-K atomic tax included) or 1.0 µs for Mind (no atomic). Total = D·block. The candidate with the lowest total wins — and because Mind mode saves ~1 µs per block, it rarely beats HEX (which has 4× fewer waves).
est · waves + ov(mode). The ~1 µs delta is exactly the cost of intra-HEX cooperative reduction; Mind mode avoids it by keeping one Mind per tile. Small workloads are dominated by this term — which is why Mind mode often wins despite having 4× more waves.
P = 2 · D · M²
for a fixed parameter count P (two weight matrices per block).
Along any one curve D = P / (2·M²), so doubling M quarters D.
X-axis = hidden size M; Y-axis = depth D.
M = √(P / (2·D)) with K = M / 4,
snaps K to the nearest measured value in the HW benchmarks, and computes the four latency curves
across batch size using the same chooser logic as the Config Chooser tab.
per_tile = max(compute, CLP_load) + CLP_store,
block = waves × per_tile + ov(mode)
(ov = 2 µs HEX / 1 µs Mind).
Ratio = sim / theory.