Dolci: Post-training Data Suite

Dolci (Dolma Instruct) is a comprehensive data suite used for the post-training of Olmo 3. It consists of multiple subsets designed to support three distinct model variations: Think (reasoning), Instruct (instruction-following), and RL-Zero (RL directly from the base model).

Overview and Purpose

Dolci is a high-quality dataset for specializing Olmo 3 base models toward specific tasks. Its purpose is to transform the broad knowledge acquired during pretraining into practical capabilities such as mathematical reasoning, coding, instruction following, and chat.

Key characteristics of Dolci:

Fully open: All data sources and curation pipelines are publicly available
High quality: Rigorous filtering and validation of model-generated data
Diverse domains: Covers Math, Code, Chat, Instruction Following, and Safety
Stage-wise training: Data optimized for each stage of SFT, DPO, and RL

Three Subsets

Dolci consists of three major subsets.

flowchart LR
    D[Dolci Data Suite]

    D --- T[Dolci Think]
    T --> T1[Think SFT<br>1.94M samples]
    T1 --> T2[Think DPO<br>187K pairs]
    T2 --> T3[Think RL<br>40K prompts]

    D --- I[Dolci Instruct]
    I --> I1[Instruct SFT<br>544K samples]
    I1 --> I2[Instruct DPO<br>105K pairs]
    I2 --> I3[Instruct RL<br>128K prompts]

    D --- R[Dolci RL-Zero]
    R --> R1[Math<br>30K prompts]
    R --> R2[Code<br>30K prompts]
    R --> R3[IF<br>30K prompts]
    R --> R4[General Mix<br>30K prompts]

Figure 1: Dolci Data Suite composition

Each subset corresponds to a different model variation’s training pipeline.

Dolci Think: For reasoning models that perform step-by-step thinking

Dolci Instruct: For models that generate concise and direct responses

Dolci RL-Zero: For models trained with RL directly from the base model

Dolci Think: Data for Reasoning Models

Dolci Think is a dataset for training reasoning models (Olmo 3 Think) that perform step-by-step reasoning before generating a final answer.

Dolci Think SFT: Synthetic Thinking Traces

Scale: Approximately 1.94 million samples

Purpose: Teach the model the ability to generate thinking traces

Data source composition:

Table 1: Dolci Think SFT data source composition

Category	Main Sources	Sample Count
Math	OpenThoughts3+, SYNTHETIC-2-Verified	~850K
Code	OpenThoughts3+, Dolci Think Python Algorithms	~550K
Chat & IF	WildChat, Persona IF, OpenAssistant	~450K
Safety	CoCoNot, WildGuardMix, WildJailbreak	~90K
Other	Aya, TableGPT	~100K

Data generation approach:

Dolci Think SFT data is created by generating thinking traces with powerful models for existing prompts.

Models used:

Math / Code: QwQ-32B (reasoning model)
Chat / Safety: DeepSeek R1 (reasoning-specialized model)

Filtering criteria:

Remove incomplete thinking traces (truncated mid-stream)
Remove domain-specific errors (mathematical mistakes, code syntax errors)
Remove excessive repetition and verbosity
Topic filtering using the OpenAI taxonomy

Dolci Think DPO: Preference Data via Delta Learning

Scale: Approximately 187K pairs

Purpose: Improve the quality of thinking traces

The principle of Delta Learning:

What matters in DPO (Direct Preference Optimization) is the “quality gap (delta)” between the chosen and rejected responses. In Dolci Think DPO, a strong model and a weak model are paired to create preference data with a clear quality gap.

Data generation setup:

Table 2: Model pair for Delta Learning

Role	Model
Chosen	Qwen 3 32B (thinking mode)
Rejected	Qwen 3 0.6B (thinking mode)

Key findings:

Data that does not improve performance through SFT can yield significant improvements through DPO.

SFT on Qwen3-32B outputs leads to performance degradation
Using the same data paired with a weak model for DPO leads to substantial improvement

This demonstrates that “relative quality differences” matter more for learning than “absolute quality.”

Dolci Think RL: Challenging Prompts

Scale: Approximately 40K prompts

Purpose: Performance improvement via RLVR (Reinforcement Learning with Verifiable Rewards)

Characteristics:

Dolci Think RL is a dataset of challenging prompts that reasoning models struggle with.

Domains:

Math: High-difficulty problems at the AIME (American Invitational Mathematics Examination) level
Code: Practical programming challenges from LiveCodeBench and similar sources
Reasoning: Logic puzzles from ZebraLogic and similar sources

Dolci Instruct: Data for Instruction-Following Models

Dolci Instruct is a dataset for training models (Olmo 3 Instruct) that generate concise and direct responses without producing thinking traces.

Dolci Instruct SFT: Function-calling Enabled Data

Scale: Approximately 544K samples

Purpose: Teach the ability to generate efficient and helpful responses

Main data sources:

Tulu 3 SFT: Diverse instruction-following tasks
Function-calling data: Tool use and API call samples
Flan: Task format learning data

Differences from Dolci Think SFT:

Table 3: Comparison of Think and Instruct SFT data

Aspect	Dolci Think SFT	Dolci Instruct SFT
Thinking traces	Yes	No
Response style	Step-by-step reasoning	Concise and direct
Function-calling	No	Yes
Sample count	1.94M	544K

Dolci Instruct DPO: Response Length Optimization

Scale: Approximately 105K pairs

Purpose: Improve conciseness and usability

Key improvements:

Multi-turn preferences:

Synthetic conversations are generated to teach consistent responses across multiple turns.

Length control:

The length difference between chosen and rejected responses is capped at 100 tokens to suppress verbosity.

Prevents the model from simply learning to produce “longer responses”
Promotes concise responses focused on usability

Dolci Instruct RL: Improving Core Capabilities

Scale: Approximately 128K prompts

Purpose: Further improvement of core capabilities via RLVR

Domain distribution:

Instruction Following: Accurate compliance with complex instructions
Chat: Handling diverse conversational scenarios
Function-calling: Improving tool use accuracy
Knowledge Recall: Accurate recall of knowledge

Dolci RL-Zero: RL Directly from Base

Dolci RL-Zero is a specialized dataset for training models with RL directly from the base model, bypassing SFT and DPO.

Purpose and Significance

Research value:

Enables studying the impact of pretraining data on RL performance
Provides a fully open RL benchmark

Existing challenges:

Previous open-weight models (Llama 3, Qwen 2.5, etc.) did not release their pretraining data, limiting RL research.

Dolci RL-Zero enables clear benchmarking that eliminates the confounding effects of data leakage.

Four Domains

Dolci RL-Zero consists of four distinct domains.

Math:

Scale: 30K prompts
Tasks: Mathematical reasoning problems from GSM8K, MATH, and similar benchmarks
Verification: Symbolic comparison using SymPy

Code:

Scale: 30K prompts
Tasks: Programming challenges from HumanEvalPlus, LiveCodeBench, and similar sources
Verification: Test case execution and validation

IF (Instruction Following):

Scale: 30K prompts
Tasks: Precise instruction following from IFEval, IFBench, and similar sources
Verification: Constraint-checking functions

General Mix:

Scale: 30K prompts
Tasks: A mixture of the above three domains plus Chat
Verification: Domain-specific verification methods

Decontamination

All data in Dolci RL-Zero undergoes rigorous decontamination to eliminate overlap with evaluation benchmarks.

Method: Two-phase processing using the decon package

Detection phase: 8-gram matching to detect overlap (50% threshold)
Cluster expansion phase: Removal of entire clusters of similar samples

This ensures complete separation between RL training data and benchmark data.

Data Curation Pipeline

Dolci’s data curation is carried out through the following pipeline.

flowchart TD
    S1["Step 1: Source Selection<br>Public datasets (OpenThoughts, WildChat, etc.)<br>Model generation (QwQ-32B, DeepSeek R1)"]
    S2["Step 2: Heuristic Filtering<br>Remove incomplete traces<br>Remove domain-specific errors<br>Remove excessive repetition"]
    S3["Step 3: Topic Filtering<br>OpenAI taxonomy classification<br>Remove off-topic samples"]
    S4["Step 4: Difficulty Filtering<br>Select challenging prompts for RL<br>Balance difficulty distribution"]
    S5["Step 5: Data Mixing<br>Balance domain distribution<br>Optimize mix for target tasks"]
    S6["Step 6: Decontamination<br>8-gram matching (50% threshold)<br>Cluster expansion<br>Final Dolci datasets"]

    S1 --> S2 --> S3 --> S4 --> S5 --> S6

Figure 2: Dolci data curation pipeline

Key Steps in the Pipeline

Step 1: Source Selection:

Collect data from public datasets and generate data using powerful models.

Step 2: Heuristic Filtering:

Remove obviously low-quality data.

Incomplete thinking traces
Domain-specific errors (mathematical mistakes, code syntax errors)
Excessive repetition

Step 3: Topic Filtering:

Remove off-topic samples using the OpenAI taxonomy.

Step 4: Difficulty Filtering:

Select challenging prompts for RL and balance the difficulty distribution.

Step 5: Data Mixing:

Balance the domain distribution and create a mix optimized for target tasks.

Step 6: Decontamination:

Completely eliminate overlap with evaluation benchmarks.

Key Features

The Dolci data suite has the following features.

Fully Open

All data sources, curation pipelines, and processing code are publicly available.

Published materials:

References to original data sources
Curation scripts
Filtering criteria
Data mixing ratios

High Quality

High quality is achieved through generation by powerful models and rigorous filtering.

Quality assurance mechanisms:

Model generation: Uses state-of-the-art models such as QwQ-32B and DeepSeek R1
Multi-stage filtering: Heuristic, topic, and difficulty filtering
Decontamination: Complete separation from evaluation data

Broad Domain Coverage

Covers a wide range of domains including Math, Code, Chat, Instruction Following, and Safety.

Domain distribution:

Table 4: Dolci domain coverage

Domain	Think SFT	Instruct SFT	RL-Zero
Math	850K	Included	30K
Code	550K	Included	30K
Chat	450K	Majority	Included
IF	450K	Majority	30K
Safety	90K	Included	-

Stage-wise Training Support

Provides data optimized for each training stage: SFT, DPO, and RL.

Training pipeline:

flowchart TD
    A[Base Model] --> B[SFT<br>Dolci Think/Instruct SFT]
    B --> C[DPO<br>Dolci Think/Instruct DPO]
    C --> D[RL<br>Dolci Think/Instruct RL]
    D --> E[Final Model]

Figure 3: Dolci stage-wise training pipeline

Each stage uses different types of data.

SFT: High-quality input-output pairs
DPO: Preference pairs with quality differences
RL: Prompts with verifiable rewards

Details of Delta Learning

Delta Learning is an important technique used in creating Dolci DPO datasets.

Core insight:

What matters in DPO is the “quality gap (delta)” between chosen and rejected responses. Relative quality differences are more important for learning than absolute quality.

Experimental results:

Setup	MATH Score	Change
Base model	45.2	-
SFT with Qwen3-32B	43.8	-1.4
DPO with Qwen3-32B (chosen) vs Qwen3-0.6B (rejected)	52.3	+7.1

Even with the same Qwen3-32B outputs, SFT leads to performance degradation while pairing with a weak model for DPO yields substantial improvement.

Applications:

This finding is leveraged in both Dolci Think DPO and Dolci Instruct DPO.

Summary

Dolci is a comprehensive data suite used for the post-training of Olmo 3. It supports three distinct model variations – Think, Instruct, and RL-Zero – and provides data optimized for each training stage: SFT, DPO, and RL.

Key contributions:

Fully open: All data sources and pipelines are publicly available
High quality: Generation by powerful models with rigorous filtering
Diversity: Covers Math, Code, Chat, IF, and Safety
Stage-wise training: Optimized for SFT, DPO, and RL
Research value: Dolci RL-Zero provides a fully open RL benchmark

Dolci, as a fully open post-training dataset, enables researchers to conduct highly reproducible research and to intervene in or customize any stage of the training pipeline.