Sequence Modeling · Efficient AI · Transformers
On Subquadratic Architectures: xLSTM vs Mamba-2 vs Gated DeltaNet
A JKU Linz study runs xLSTM, Mamba-2, and Gated DeltaNet through code, time-series, and synthetic tasks, then traces xLSTM's lead to two primitives: counting-style accumulation and finite-state tracking.
Quick answer
xLSTM posts the strongest aggregate results across code pre-training, distillation, time-series forecasting, and synthetic state-tracking, and the paper argues the gap is not luck. It comes from two capabilities its gating supports that the others only partially do: counting-style accumulation that survives past the training length, and finite-state tracking. On the hardest synthetic checks, an xLSTM variant with recurrent layers stays at 1.000 on parity and the symmetric group S3 across lengths up to 2048, while Mamba-2 never exceeds 0.352 and a pure matrix-state xLSTM fails the state-tracking tasks outright. That last contrast is the point: no single backbone wins everything.
The three architectures, in one frame
The paper rewrites all three subquadratic backbones in one notation built from an input gate, a forget gate, a matrix state, and an output read. That shared frame is what lets the comparison be fair rather than a feature-list contest.
- xLSTM mixes matrix-state linear-attention layers (mLSTM) with recurrent sLSTM layers. Its input gate is exponential, which gives it softmax-like overwriting over time, and its input and forget gates are separate, so it can correct memory flexibly. Variants are written as ratios like xLSTM[7:1] or xLSTM[1:0].
- Mamba-2 is a state-space-derived linear attention with tied input and forget gates, closer to a GRU. The tying reduces how much it can re-weight earlier content.
- Gated DeltaNet adds a fast-weight mechanism with an orthogonal projection that explicitly overwrites old state in one direction when its gate fires. That overwrite helps some tasks and hurts counting.
All three reach O(T) cost through chunkwise parallelism: parallel matrix ops inside a chunk, sequential state carryover between chunks.
Key results
On code-focused pre-training at 400M parameters, xLSTM[7:1] leads HumanEval pass@64 by 1.43 points at 20B code tokens, and the lead narrows to 0.90 points at 100B tokens. On a mixed code-plus-FineWeb-Edu corpus the lead is 1.81 points. On reasoning and commonsense the same variant has the best aggregate, but the margins there fall below 0.1 points, so that part of the win is thin.
On code distillation from Qwen3-4B, xLSTM[1:0] averages 0.768 across HumanEval, HumanEval+, MBPP, and MBPP+, versus 0.755 for Gated DeltaNet. The split is informative: xLSTM[1:0] leads HumanEval (0.831 vs 0.802) and HumanEval+ (0.764 vs 0.739), while Gated DeltaNet edges MBPP+ (0.802 vs 0.788). So the distillation win is real but task-dependent, not a sweep.
On time-series foundation-model pre-training scored on GIFT-Eval, xLSTM[3:1] leads from 1M to 40M parameters; at 10M it reaches MASE 0.733 and CRPS 0.508 against Mamba-2’s 0.767 and 0.525. By 80M the models converge and Mamba-2 takes CRPS by 0.005. The advantage is a small-scale effect that washes out as capacity grows.
The synthetic tasks are where the mechanism shows. Models are trained at length 128 and tested at 128, 512, and 2048. On counting (A^nB^n, A^nB^nC^n, majority), xLSTM[1:0] extrapolates to 2048 at 0.763 majority accuracy; Gated DeltaNet drops to 0.268 and Mamba-2 collapses to 0.241. On state-tracking (parity, modular arithmetic over Z5, the symmetric group S3), xLSTM[1:1] is perfect at every length, Gated DeltaNet[-1,1] gets partial credit (0.472 parity, 0.667 S3 at 2048), and Mamba-2 never solves them.
Why gating decides accumulation and state tracking
The accumulation result tracks the gating math. Counting needs a state that adds the same way regardless of position; xLSTM’s separate, exponential gates let it keep a stable running tally past the training length. Mamba-2’s tied gates (a shared 1 minus sigmoid term) cannot re-weight earlier content the same way, so its count drifts and then collapses at length 2048. Gated DeltaNet’s orthogonal overwrite is the wrong move for counting because it removes accumulated value in a state direction.
State tracking is a separate primitive, and the surprising detail is that the matrix-state xLSTM[1:0] fails it even though it counts well. Solving permutation composition like S3 needs the recurrent sLSTM layer; the pure linear-attention variant cannot. The paper’s frame says Mamba-2 and Gated DeltaNet inherit the TC0 expressivity ceiling that limits Transformers on hard state-tracking, while negative eigenvalues from the [-1,1] parameterization (following Grazzi et al., 2025) only partly relieve it for Gated DeltaNet.
Misread guardrail
The headline “xLSTM is strongest” hides that it is a family, not a single model, and the winning member changes by task. xLSTM[1:0] counts but cannot track state; xLSTM[1:1] tracks state; the m:s ratio is a design knob, not a free lunch. The code and time-series margins are also small or fading at scale (sub-0.1 points on reasoning, a 0.005 reversal at 80M time-series), so a reader should not generalize the synthetic-task gap into a blanket production claim. The strong evidence is mechanistic, on toy tasks where the primitive is isolated.
Limits and open questions
The clearest gap is scale. Code runs cap at 400M parameters and time-series at 80M, where the gaps already shrink or flip. Whether xLSTM’s accumulation and state-tracking edge holds at multi-billion-parameter scale, with the wall-clock and memory cost of recurrent sLSTM layers, is untested here.
The conclusion is also tied to the hybrid m:s ratio search. The paper picks ratios like [7:1] and [3:1] per task, so part of the win is configuration choice, and there is no single ratio that wins everywhere. A practitioner inheriting one fixed ratio may not see the reported margins.
Finally, the state-tracking claim rests on synthetic tasks chosen to isolate the primitive. Those tasks are clean evidence that the capability exists, but the paper does not show how often real code or time-series workloads actually need S3-style composition rather than the counting that several backbones already handle.
FAQ
Why does xLSTM beat Mamba-2 on counting and state-tracking tasks in On Subquadratic Architectures?
Counting needs position-independent accumulation, which xLSTM’s separate exponential gates support and Mamba-2’s tied gates do not, so Mamba-2 collapses to 0.241 majority accuracy at length 2048 while xLSTM[1:0] holds 0.763. State tracking needs the recurrent sLSTM layer; xLSTM[1:1] is perfect on parity and S3 at all tested lengths, and Mamba-2 never solves them.
What does the TC0 ceiling mean for Mamba-2 and Gated DeltaNet?
The paper places Mamba-2 and Gated DeltaNet inside the TC0 expressivity class that also limits Transformers, which means they cannot natively solve hard state-tracking like permutation composition. Gated DeltaNet’s [-1,1] negative-eigenvalue parameterization (per Grazzi et al., 2025) only partly relaxes this, reaching 0.472 parity at length 2048 rather than a full solution.
How large is xLSTM’s advantage on HumanEval code pre-training?
At 400M parameters, xLSTM[7:1] leads HumanEval pass@64 by 1.43 points at 20B code tokens and by 0.90 points at 100B tokens, and by 1.81 points on a mixed code-plus-FineWeb-Edu corpus. The lead narrows as token budget grows, and on reasoning and commonsense the margin falls under 0.1 points.
Should builders switch their subquadratic backbone based on this paper?
Not yet at production scale. The evidence is strong and mechanistic on synthetic tasks and small models (400M for code, 80M for time series), but the time-series gap already reverses by 80M, and xLSTM’s win depends on choosing the right m:s ratio per task. Treat it as a guide to which primitive your workload needs, then test at your own scale.
One line: On Subquadratic Architectures shows xLSTM’s lead comes from supporting both accumulation and finite-state tracking, but the win is family-specific and tested at small scale. Read the original paper on arXiv.