Fine-Tuning & Adaptation · Mixture of Experts · Efficient AI

MinT: Infrastructure for Training and Serving Millions of LoRA LLMs

MinT keeps one frontier base model resident and swaps only LoRA adapters, cutting the model-handoff step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while addressing million-scale adapter catalogs.

MinT: Infrastructure for Training and Serving Millions of LoRA LLMs

Quick answer

MinT is Mind Lab’s serving-and-training infrastructure that keeps a single frontier base model resident in GPU memory and treats each fine-tuned LLM as a swappable LoRA adapter rather than a full set of weights. Because an adapter can be under 1% of base-model size in rank-1 settings, MinT replaces the expensive “load a new model” step with an “attach a new adapter” step — measured at 18.3x faster on a 4B dense model and 2.85x faster on a 30B MoE. It is built to address catalogs of roughly 10^6 adapters and validates on architectures past 1 trillion parameters.

The problem: millions of models, one set of GPUs

If you want to serve a personalized or task-specialized LLM to many tenants, the naive answer — one full model per use case — does not scale. A frontier dense or MoE model is hundreds of gigabytes to over a terabyte of weights; swapping that in and out of GPU memory per request dominates latency, and storing a million variants is a non-starter. The bet behind MinT is that most of those “different models” are the same base model plus a small LoRA delta, so the base should stay put and only the deltas should move. That reframes the whole serving problem from “manage millions of models” to “manage millions of small adapters over one shared backbone.”

How MinT actually works

MinT scales the shared-base path along three axes. First, adapter-only handoff: instead of tearing down one model and loading another, a live engine keeps the base resident and switches the active LoRA, so the per-switch cost collapses to moving the adapter tensors. Second, packed MoE LoRA loading: MoE models multiply the bookkeeping because every expert can carry its own adapter slice, and MinT packs those tensors so a live engine loads them 8.5-8.7x faster than the unpacked path. Third, concurrent multi-policy optimization: MinT can train many LoRA policies against the shared base at once — running GRPO-style updates across a wave of adapters — rather than serializing one job per model.

The design also spans real attention variants (the paper covers MLA/DSA-style paths and tensor-parallel deployment), which matters because a system that only works on one attention layout would not survive contact with frontier MoE stacks.

Key results

  • Adapter-only handoff is 18.3x faster on a 4B dense model and 2.85x faster on a 30B MoE versus the full model-reload baseline — the headline number, and the reason the rest of the system is worth building.
  • Packed MoE LoRA tensors load 8.5-8.7x faster into a live engine than the unpacked layout.
  • Concurrent multi-policy GRPO cuts wall-clock time by 1.77x on a dense model and 1.45x on an MoE compared with running policies one at a time.
  • MinT addresses ~10^6-scale adapter catalogs, backed by 100,000 single-engine measurement sweeps and runs with 1,000 concurrent adapter waves at cluster scale.
  • It validates on frontier-scale architectures exceeding 1 trillion parameters, with adapters kept under 1% of base-model size in rank-1 configurations.

Why this matters now

The dense-vs-MoE speedup gap is the most honest signal in the paper. 18.3x on a 4B dense model versus 2.85x on a 30B MoE tells you exactly where the win comes from: the smaller the per-switch base cost relative to adapter movement, the bigger the handoff saving, and MoE routing plus larger weights eat into that. The same pattern shows in training — 1.77x dense versus 1.45x MoE. So MinT is most compelling for the personalization economy: many tenants, many narrow fine-tunes, one expensive base you cannot afford to reload per request. That is the regime where multi-tenant LoRA serving has been heading, and MinT is an infrastructure answer to it at a scale (million adapters, trillion-parameter base) most prior systems did not target.

Limits and open questions

The biggest caveat is what “millions of LLMs” actually means: these are LoRA adapters over a shared base, not a million independently trained models. If two use cases genuinely need different base weights, MinT’s core trick does not apply — the speedup is conditional on the shared-base assumption holding. The quality of those million adapters is also out of scope here; this is an infrastructure paper, so it measures throughput and handoff latency, not whether a rank-1 adapter is good enough for a given task. The MoE numbers (2.85x handoff, 1.45x training) are real but materially smaller than the dense headline, so buyers running large MoE backbones should size their expectations to the MoE figures, not the 18.3x. Finally, the headline scale figures (10^6 catalog, 1,000 concurrent waves, 1T+ parameters) describe what the system is built to address and validates against; treat them as engineering capacity claims rather than a guarantee that every workload hits the best-case speedup.

FAQ

What is MinT and what problem does it solve?

MinT is Mind Lab’s infrastructure for training and serving millions of LLMs by keeping one base model resident and swapping LoRA adapters. It solves the cost of serving many specialized models, since reloading a full frontier model per request is far slower than attaching a small adapter.

How much faster is MinT’s adapter handoff?

MinT’s adapter-only handoff is measured at 18.3x faster on a 4B dense model and 2.85x faster on a 30B MoE than reloading a full model. The dense gain is larger because the base-switch cost it removes is proportionally bigger there.

Does MinT really train a million separate models?

No. MinT addresses ~10^6-scale catalogs of LoRA adapters over a shared base model, not a million independently trained sets of weights. The “millions of LLMs” framing refers to adapter variants, each typically under 1% of base-model size.

Does MinT work on MoE and trillion-parameter models?

Yes. MinT validates on frontier dense and MoE architectures exceeding 1 trillion parameters and packs MoE LoRA tensors to load 8.5-8.7x faster, though its handoff and training speedups are smaller on MoE (2.85x and 1.45x) than on dense models.

One line: keep the base resident, move only the adapter — and serving a million fine-tuned LLMs becomes an adapter-management problem. Read the original paper on arXiv.