Mamba: Selective State Spaces for Linear-Time Sequence Modeling

Quick answer

Mamba is a sequence model that replaces attention with a selective state space model (SSM): its recurrence parameters are functions of the current input, so it decides per-token what to keep or discard. The payoff is concrete — linear scaling in sequence length, 5x higher inference throughput than Transformers, performance that keeps improving on real data up to million-length sequences, and a Mamba-3B model that outperforms same-size Transformers while matching Transformers twice its size in both pretraining and downstream evals.

Why prior efficient architectures kept losing to attention

Most attempts to escape attention’s quadratic cost — linear attention, gated convolutions, recurrent nets, classic structured SSMs — scale better on long inputs but lose to attention on discrete modalities like language. The authors pin the failure on one thing: these models cannot do content-based reasoning. Their dynamics are fixed regardless of what token arrives, so they cannot selectively focus on a relevant word or ignore a filler one. Attention can, because every token compares against every other. That comparison is exactly what costs quadratic time — and what prior efficient models threw away to get speed.

Selective state spaces

A structured SSM maps a sequence through a hidden state using matrices that, in earlier work, were constant across time. Mamba’s core move is to let those parameters (the step size and the input/state projections) be functions of the input. Now the model can, at each position, propagate or forget information along the sequence based on the current token — content-based reasoning inside a recurrence.

This breaks the trick that made prior SSMs fast. Time-invariant SSMs can be computed as a global convolution; once parameters vary per token, that convolution no longer applies. The authors’ answer is a hardware-aware parallel scan: compute the recurrence in scan form, but keep the expanded state in fast SRAM and avoid materializing it in slower GPU memory. That is what keeps a per-token-varying recurrence both expressive and fast.

The architecture is also deliberately stripped down. Mamba folds the selective SSM into a single homogeneous block and drops attention and the standard MLP block entirely — the whole network is a stack of these blocks. Simplicity here is a feature: fewer component types, one repeated primitive.

Key results

5x higher inference throughput than Transformers, because generation is a constant-memory recurrence rather than a growing attention cache.
Linear scaling in sequence length, with quality that improves on real data out to million-length sequences — the regime where attention becomes impractical.
Mamba-3B outperforms same-size Transformers and matches Transformers twice its size, in both pretraining perplexity and downstream evaluation.
Strong results as a general backbone across language, audio, and genomics, not a single-domain trick.

The honest read: the headline comparisons are at the ~3B scale, not at frontier scale, and “matches 2x-size Transformers” is the kind of claim that gets harder to defend as models grow.

Limits and open questions

Mamba is not a drop-in attention replacement. Its evidence tops out around the 3B range, and the field’s open question is whether the selective-SSM advantage survives at the tens-of-billions scale where most production models live. There is also a structural tradeoff: a recurrent state is a fixed-size summary, so tasks needing exact recall of an arbitrary earlier token (copying, precise retrieval) can favor explicit attention, which keeps every token addressable. This is why much subsequent work went hybrid — interleaving Mamba and attention layers — rather than going pure SSM. Tooling maturity, training recipes, and stability at scale were all unsettled at publication and partly remain so.

FAQ

What makes Mamba different from a Transformer?

Mamba has no attention. Instead of comparing every token against every other, it carries a recurrent state and uses input-dependent parameters to decide what that state keeps or forgets. This gives linear-time scaling and a constant-size generation footprint, where attention costs grow with context length.

Why is Mamba faster at inference than Transformers?

Transformer generation re-attends over a key-value cache that grows with the sequence, so per-token cost rises with context. Mamba generates by advancing a fixed-size recurrent state, yielding roughly 5x higher throughput and memory that does not balloon with length.

Does Mamba beat Transformers on language modeling?

At the sizes tested, yes within its weight class: Mamba-3B outperforms same-size Transformers and matches ones twice its size in pretraining and downstream tasks. Whether that edge holds at frontier scale is still open, which is why many systems now combine Mamba layers with attention.

What is a selective state space model?

It is a structured SSM whose recurrence parameters depend on the current input rather than being fixed across time. That input dependence lets the model selectively propagate or forget information per token — the content-based reasoning that earlier SSMs lacked — computed efficiently via a hardware-aware parallel scan.

Mamba’s lasting contribution is reframing recurrence as a serious, hardware-aware competitor to attention rather than an obsolete idea — read it at arXiv:2312.00752.