Scaling Latent Reasoning via Looped Language Models

1ByteDance Seed   2UC Santa Cruz   3Princeton University
4Mila - Quebec AI Institute   5University of Montreal   6Carnegie Mellon University
*Full author list in paper

📢 News

2025-10-28

vLLM and SGLang integration will come soon!

Introducing Ouro

Modern LLMs are trained to 'think' primarily via explicit text generation, such as chain-of-thought, which defers reasoning to post-training and under-leverages pre-training data. We present Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens.

Through controlled experiments, we show this advantage stems not from increased knowledge storage, but from superior knowledge manipulation capabilities. We also show our latent reasoning is more faithful to the reasoning process than standard LLMs. Our resulting 1.4B and 2.6B models match the performance of up to 12B SOTA standard LLMs across a wide range of benchmarks, showing its potential as a scaling direction in a data-constrained era.

Ouro Looped Language Model Architecture and Performance

Key Features

🔄 Looped Architecture

Ouro uses a parameter-shared looped architecture where the same transformer blocks are applied recurrently. This allows the model to perform iterative computation in latent space, enabling deeper reasoning without proportionally increasing parameter count. Our models use 4 recurrent steps (R4) to achieve exceptional parameter efficiency.

🎯 Exceptional Parameter Efficiency

Through 7.7T token pre-training, our models demonstrate remarkable parameter efficiency:

  • Ouro-1.4B: Matches the performance of 4B standard transformer models
  • Ouro-2.6B: Rivals 8B standard transformer models
  • Achieves 2-3× parameter efficiency improvements across diverse benchmarks

🧠 Superior Knowledge Manipulation

Through controlled experiments on synthetic tasks, we demonstrate that the looped architecture's advantage stems not from increased knowledge storage, but from superior knowledge manipulation capabilities on tasks requiring fact composition and multi-hop reasoning.

📊 Entropy-Regularized Adaptive Computation

We develop a novel training objective with entropy regularization that enables dynamic depth allocation. Simple inputs can exit after fewer recurrent steps, while complex problems automatically allocate more iterations, matching computational depth to input difficulty.

Training Pipeline

Our training pipeline is a carefully designed multi-stage process totaling 7.7T tokens of training data:

  1. Warmup: Initial model warmup phase
  2. Stable Training Phase 1: 3T tokens on standard pre-training data
  3. Model Branching: Creating 1.4B and 2.6B variants (via upcycling)
  4. Stable Training Phase 2: Additional 3T tokens for both model sizes
  5. CT Annealing: 1.4T tokens with chain-of-thought annealing
  6. LongCT: 20B tokens of long-context chain-of-thought training
  7. Mid-Training: 300B tokens of targeted mid-training
  8. Reasoning SFT: Supervised fine-tuning for reasoning-focused models (Ouro-Thinking variants)

The architecture uses standard decoder-only Transformer with Rotary Position Embeddings (RoPE), SwiGLU activation, and sandwich normalization for enhanced training stability with deep recurrent computation.