Olmo 3 Technical Report Summary

Overview

Olmo 3 is a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales developed by the Allen Institute for AI (AI2). This release includes the entire Model Flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it.

Key features:

Fully open: All training data, code, and intermediate checkpoints are publicly released
Diverse capabilities: Long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall
Flagship model: Olmo 3.1 Think 32B is the strongest fully-open reasoning model ever released

Model variants:

Olmo 3 Base: Foundation model (7B, 32B)
Olmo 3 Think: Reasoning model that performs step-by-step reasoning
Olmo 3 Instruct: Model that generates concise and direct responses
Olmo 3 RL-Zero: Model trained with RL directly from the Base model

Paper: arXiv:2512.13961

Model Flow

The development of Olmo 3 is divided into two major stages: Base Model Training and Post-training.

flowchart TD
    subgraph base["Base Model Training"]
        S1["Stage 1: Pretraining (5.9T tokens)<br/>Dolma 3 Mix (Web, PDFs, Code, etc.)"]
        S2["Stage 2: Midtraining (100B tokens)<br/>Dolma 3 Dolmino Mix (Math, Code, QA, etc.)"]
        S3["Stage 3: Long-context Extension (50-100B tokens)<br/>Dolma 3 Longmino Mix (Long PDFs + Midtrain data)"]
        S1 --> S2 --> S3
    end

    S3 --> BASE["Olmo 3 Base"]

    subgraph post["Post-training"]
        P1["Path 1: Olmo 3 Think<br/>SFT → DPO (Delta Learning) → RLVR (OlmoRL)"]
        P2["Path 2: Olmo 3 Instruct<br/>SFT → DPO → RLVR"]
        P3["Path 3: Olmo 3 RL-Zero<br/>Base → RLVR (from scratch)"]
    end

    BASE --> P1
    BASE --> P2
    BASE --> P3

Figure 1: Olmo 3 Model Flow

Base Model Training

Stage 1: Pretraining

Olmo 3 Base is pretrained on Dolma 3 Mix, a diverse dataset of approximately 5.9 trillion tokens.

Details: Dolma 3 Dataset

Key innovations:

Fast and scalable global deduplication: A new tool for deduplication at the trillion-token scale

Details: Deduplication
olmOCR science PDFs: A new data source converting academic PDFs to linearized plain text

Details: olmOCR Science PDFs
New data mixing methods: Token-constrained mixing and Quality-aware upsampling

Details: Data Mixing Methods

Data sources:

Web pages
Academic PDFs (olmOCR science PDFs)
Code repositories
Mathematical data
Other diverse sources

Stage 2: Midtraining

Midtraining is conducted on 100 billion tokens of Dolma 3 Dolmino Mix. The purpose of this stage is to enhance critical capabilities such as code, math, general knowledge QA, and more.

Details: Midtraining

Innovative methods:

Two-part framework:
- Lightweight distributed feedback loops on individual data sources
- Centralized integration tests to evaluate candidate mixes
Priming for post-training: Deliberately including instruction data and thinking traces to lay the groundwork for post-training

Evaluation suite: OlmoBaseEval

Details: OlmoBaseEval

Stage 3: Long-context Extension

Olmo 3 supports long-context capabilities of up to 65K tokens. The 7B model is trained on 50B tokens and the 32B model on 100B tokens of Dolma 3 Longmino Mix.

Details: Long-context Extension

Key techniques:

RoPE extension: Extending positional encoding using YaRN
Document packing: Efficient placement of long documents using best-fit packing
Intra-document masking: Attention only to tokens within the same document
Model souping: Averaging multiple checkpoints

Data source scale:

8K+ tokens: 22.3M documents (640B tokens)
32K+ tokens: 4.5M documents (380B tokens)

This is the largest openly available collection for long-context research.

Base Model Results

Olmo 3 Base is the strongest fully-open model at the 32B parameter scale:

Fully-open models: Outperforms Stanford Marin 32B and Apertus 70B
Math and code: Double-digit improvements over other fully-open 32B models
Long-context performance: Comparable to Qwen 2.5 32B, Mistral Small 3.1 24B, and Gemma 3 27B

Post-training

Three variants are developed from the Base model.

Olmo 3 Think: Reasoning Model

Olmo 3 Think is trained to perform step-by-step reasoning, generating intermediate thinking traces before producing the final answer.

Training pipeline:

SFT (Supervised Finetuning): Learning thinking traces with Dolci Think SFT
DPO (Direct Preference Optimization): Preference alignment via Delta Learning

Details: Delta Learning
RLVR (Reinforcement Learning with Verifiable Rewards): Reinforcement learning via OlmoRL

Details: OlmoRL / GRPO

Details: Dolci Dataset

Results:

Olmo 3.1 Think 32B: The strongest fully-open reasoning model
Outperforms Qwen 2.5 32B, Gemma 2/3 27B, and DeepSeek R1 32B
Approaches the performance of Qwen 3 32B (with 1/6 of the training tokens)

Key benchmark results (Olmo 3.1 Think 32B):

Category	Benchmark	Score
Math	MATH	96.2
Math	AIME 2024	80.6
Reasoning	BigBenchHard	88.6
Reasoning	ZebraLogic	80.1
Coding	HumanEvalPlus	91.5
Coding	LiveCodeBench v3	83.3
IF	IFEval	93.8
Knowledge	MMLU	86.4

Olmo 3 Instruct: Instruction-following Model

Olmo 3 Instruct is trained to generate efficient and helpful responses without producing internal thinking traces.

Features:

Concise and direct responses
Optimized for function calling
Low latency (no thinking traces)

Training pipeline:

SFT: Dolci Instruct SFT (including function-calling data)
DPO: Multi-turn preference data and response length optimization
RLVR: Further improvement of core capabilities

Results:

Outperforms Qwen 2.5, Gemma 3, IBM Granite 3.3, and Llama 3 at comparable scales
Narrows the performance gap with Qwen 3

Olmo 3 RL-Zero: RL from Base

Olmo 3 RL-Zero is a model trained with RL directly from the Base model.

Purpose:

Enables studying the impact of pretraining data on RL performance
Provides a fully open RL benchmark

Domains:

Math
Code
Precise IF (Instruction Following)
General Mix

Significance:

Existing open-weight models do not release their pretraining data, limiting RL research
Olmo 3 RL-Zero enables clear benchmarking with the impact of data leakage eliminated

Training Cost and Timeline

Training Olmo 3 Think 32B required approximately 56 days using 1,024 H100 GPUs.

Breakdown:

Pretraining: ~47 days (including midtraining and long-context extension)
Post-training: ~9 days (SFT, DPO, RL)

Estimated cost: ~$2.75M at $2/H100 hour

Open Artifacts

Olmo 3 releases all intermediate checkpoints and final models.

Released artifacts:

Models:
- Intermediate checkpoints at every stage
- Final models (Base, Think, Instruct, RL-Zero)
Data:
- Data mixes: The actual tokens used for training
- Source data pools: Complete source data for each stage
  - Pretraining: 9T tokens of clean data
  - Midtraining: 2T tokens of specialized data
  - Long-context: 640B tokens of long-document data
Sample mixes: For experimentation with fewer compute resources
- Pretraining: 150B tokens
- Midtraining: 10B tokens
Code:
- Training code: OLMo-core (pretraining), Open Instruct (post-training)
- Data code: datamap-rs, duplodocus (deduplication), dolma3
- Evaluation code: OLMES, decon (evaluation data decontamination)

Key Contributions

Fully open Model Flow: All stages, data, and code are released
Strongest fully-open model: Best performance in both Base and Think
New datasets: Dolma 3 (pretraining) and Dolci (post-training)
New methods:
- OlmoBaseEval (efficient Base model evaluation)
- OlmoRL (efficient reinforcement learning framework)
- Delta Learning (high-quality preference data creation)
- Long-context extension techniques (RoPE, Document packing, Intra-document masking)
Reproducibility: Thinking chains can be traced back to original training data

Summary

Olmo 3 is a comprehensive release designed to advance fully open AI research and development. It makes transparent not only the final model weights but the entire development process, enabling researchers to intervene and customize at every stage of model development.

Core philosophy: To truly advance open-source AI, it is necessary to make not just the final model but the entire “path” to it transparent and accessible.

Flagship model: Olmo 3.1 Think 32B approaches Qwen 3 32B on the reasoning benchmark suite while achieving this with 1/6 of the training tokens, with all training data and thinking chains fully traceable.