flowchart LR
D[Dolci Data Suite]
D --- T[Dolci Think]
T --> T1[Think SFT<br>1.94M samples]
T1 --> T2[Think DPO<br>187K pairs]
T2 --> T3[Think RL<br>40K prompts]
D --- I[Dolci Instruct]
I --> I1[Instruct SFT<br>544K samples]
I1 --> I2[Instruct DPO<br>105K pairs]
I2 --> I3[Instruct RL<br>128K prompts]
D --- R[Dolci RL-Zero]
R --> R1[Math<br>30K prompts]
R --> R2[Code<br>30K prompts]
R --> R3[IF<br>30K prompts]
R --> R4[General Mix<br>30K prompts]
Dolci: Post-training Data Suite
Dolci (Dolma Instruct) is a comprehensive data suite used for the post-training of Olmo 3. It consists of multiple subsets designed to support three distinct model variations: Think (reasoning), Instruct (instruction-following), and RL-Zero (RL directly from the base model).
Overview and Purpose
Dolci is a high-quality dataset for specializing Olmo 3 base models toward specific tasks. Its purpose is to transform the broad knowledge acquired during pretraining into practical capabilities such as mathematical reasoning, coding, instruction following, and chat.
Key characteristics of Dolci:
- Fully open: All data sources and curation pipelines are publicly available
- High quality: Rigorous filtering and validation of model-generated data
- Diverse domains: Covers Math, Code, Chat, Instruction Following, and Safety
- Stage-wise training: Data optimized for each stage of SFT, DPO, and RL
Three Subsets
Dolci consists of three major subsets.
Each subset corresponds to a different model variation’s training pipeline.
Dolci Think: For reasoning models that perform step-by-step thinking
Dolci Instruct: For models that generate concise and direct responses
Dolci RL-Zero: For models trained with RL directly from the base model
Dolci Think: Data for Reasoning Models
Dolci Think is a dataset for training reasoning models (Olmo 3 Think) that perform step-by-step reasoning before generating a final answer.
Dolci Think SFT: Synthetic Thinking Traces
Scale: Approximately 1.94 million samples
Purpose: Teach the model the ability to generate thinking traces
Data source composition:
| Category | Main Sources | Sample Count |
|---|---|---|
| Math | OpenThoughts3+, SYNTHETIC-2-Verified | ~850K |
| Code | OpenThoughts3+, Dolci Think Python Algorithms | ~550K |
| Chat & IF | WildChat, Persona IF, OpenAssistant | ~450K |
| Safety | CoCoNot, WildGuardMix, WildJailbreak | ~90K |
| Other | Aya, TableGPT | ~100K |
Data generation approach:
Dolci Think SFT data is created by generating thinking traces with powerful models for existing prompts.
Models used:
- Math / Code: QwQ-32B (reasoning model)
- Chat / Safety: DeepSeek R1 (reasoning-specialized model)
Filtering criteria:
- Remove incomplete thinking traces (truncated mid-stream)
- Remove domain-specific errors (mathematical mistakes, code syntax errors)
- Remove excessive repetition and verbosity
- Topic filtering using the OpenAI taxonomy
Dolci Think DPO: Preference Data via Delta Learning
Scale: Approximately 187K pairs
Purpose: Improve the quality of thinking traces
The principle of Delta Learning:
What matters in DPO (Direct Preference Optimization) is the “quality gap (delta)” between the chosen and rejected responses. In Dolci Think DPO, a strong model and a weak model are paired to create preference data with a clear quality gap.
Data generation setup:
| Role | Model |
|---|---|
| Chosen | Qwen 3 32B (thinking mode) |
| Rejected | Qwen 3 0.6B (thinking mode) |
Key findings:
Data that does not improve performance through SFT can yield significant improvements through DPO.
- SFT on Qwen3-32B outputs leads to performance degradation
- Using the same data paired with a weak model for DPO leads to substantial improvement
This demonstrates that “relative quality differences” matter more for learning than “absolute quality.”
Dolci Think RL: Challenging Prompts
Scale: Approximately 40K prompts
Purpose: Performance improvement via RLVR (Reinforcement Learning with Verifiable Rewards)
Characteristics:
Dolci Think RL is a dataset of challenging prompts that reasoning models struggle with.
Domains:
- Math: High-difficulty problems at the AIME (American Invitational Mathematics Examination) level
- Code: Practical programming challenges from LiveCodeBench and similar sources
- Reasoning: Logic puzzles from ZebraLogic and similar sources
Dolci Instruct: Data for Instruction-Following Models
Dolci Instruct is a dataset for training models (Olmo 3 Instruct) that generate concise and direct responses without producing thinking traces.
Dolci Instruct SFT: Function-calling Enabled Data
Scale: Approximately 544K samples
Purpose: Teach the ability to generate efficient and helpful responses
Main data sources:
- Tulu 3 SFT: Diverse instruction-following tasks
- Function-calling data: Tool use and API call samples
- Flan: Task format learning data
Differences from Dolci Think SFT:
| Aspect | Dolci Think SFT | Dolci Instruct SFT |
|---|---|---|
| Thinking traces | Yes | No |
| Response style | Step-by-step reasoning | Concise and direct |
| Function-calling | No | Yes |
| Sample count | 1.94M | 544K |
Dolci Instruct DPO: Response Length Optimization
Scale: Approximately 105K pairs
Purpose: Improve conciseness and usability
Key improvements:
Multi-turn preferences:
Synthetic conversations are generated to teach consistent responses across multiple turns.
Length control:
The length difference between chosen and rejected responses is capped at 100 tokens to suppress verbosity.
- Prevents the model from simply learning to produce “longer responses”
- Promotes concise responses focused on usability
Dolci Instruct RL: Improving Core Capabilities
Scale: Approximately 128K prompts
Purpose: Further improvement of core capabilities via RLVR
Domain distribution:
- Instruction Following: Accurate compliance with complex instructions
- Chat: Handling diverse conversational scenarios
- Function-calling: Improving tool use accuracy
- Knowledge Recall: Accurate recall of knowledge
Dolci RL-Zero: RL Directly from Base
Dolci RL-Zero is a specialized dataset for training models with RL directly from the base model, bypassing SFT and DPO.
Purpose and Significance
Research value:
- Enables studying the impact of pretraining data on RL performance
- Provides a fully open RL benchmark
Existing challenges:
Previous open-weight models (Llama 3, Qwen 2.5, etc.) did not release their pretraining data, limiting RL research.
Dolci RL-Zero enables clear benchmarking that eliminates the confounding effects of data leakage.
Four Domains
Dolci RL-Zero consists of four distinct domains.
Math:
- Scale: 30K prompts
- Tasks: Mathematical reasoning problems from GSM8K, MATH, and similar benchmarks
- Verification: Symbolic comparison using SymPy
Code:
- Scale: 30K prompts
- Tasks: Programming challenges from HumanEvalPlus, LiveCodeBench, and similar sources
- Verification: Test case execution and validation
IF (Instruction Following):
- Scale: 30K prompts
- Tasks: Precise instruction following from IFEval, IFBench, and similar sources
- Verification: Constraint-checking functions
General Mix:
- Scale: 30K prompts
- Tasks: A mixture of the above three domains plus Chat
- Verification: Domain-specific verification methods
Decontamination
All data in Dolci RL-Zero undergoes rigorous decontamination to eliminate overlap with evaluation benchmarks.
Method: Two-phase processing using the decon package
- Detection phase: 8-gram matching to detect overlap (50% threshold)
- Cluster expansion phase: Removal of entire clusters of similar samples
This ensures complete separation between RL training data and benchmark data.
Data Curation Pipeline
Dolci’s data curation is carried out through the following pipeline.
flowchart TD
S1["Step 1: Source Selection<br>Public datasets (OpenThoughts, WildChat, etc.)<br>Model generation (QwQ-32B, DeepSeek R1)"]
S2["Step 2: Heuristic Filtering<br>Remove incomplete traces<br>Remove domain-specific errors<br>Remove excessive repetition"]
S3["Step 3: Topic Filtering<br>OpenAI taxonomy classification<br>Remove off-topic samples"]
S4["Step 4: Difficulty Filtering<br>Select challenging prompts for RL<br>Balance difficulty distribution"]
S5["Step 5: Data Mixing<br>Balance domain distribution<br>Optimize mix for target tasks"]
S6["Step 6: Decontamination<br>8-gram matching (50% threshold)<br>Cluster expansion<br>Final Dolci datasets"]
S1 --> S2 --> S3 --> S4 --> S5 --> S6
Key Steps in the Pipeline
Step 1: Source Selection:
Collect data from public datasets and generate data using powerful models.
Step 2: Heuristic Filtering:
Remove obviously low-quality data.
- Incomplete thinking traces
- Domain-specific errors (mathematical mistakes, code syntax errors)
- Excessive repetition
Step 3: Topic Filtering:
Remove off-topic samples using the OpenAI taxonomy.
Step 4: Difficulty Filtering:
Select challenging prompts for RL and balance the difficulty distribution.
Step 5: Data Mixing:
Balance the domain distribution and create a mix optimized for target tasks.
Step 6: Decontamination:
Completely eliminate overlap with evaluation benchmarks.
Key Features
The Dolci data suite has the following features.
Fully Open
All data sources, curation pipelines, and processing code are publicly available.
Published materials:
- References to original data sources
- Curation scripts
- Filtering criteria
- Data mixing ratios
High Quality
High quality is achieved through generation by powerful models and rigorous filtering.
Quality assurance mechanisms:
- Model generation: Uses state-of-the-art models such as QwQ-32B and DeepSeek R1
- Multi-stage filtering: Heuristic, topic, and difficulty filtering
- Decontamination: Complete separation from evaluation data
Broad Domain Coverage
Covers a wide range of domains including Math, Code, Chat, Instruction Following, and Safety.
Domain distribution:
| Domain | Think SFT | Instruct SFT | RL-Zero |
|---|---|---|---|
| Math | 850K | Included | 30K |
| Code | 550K | Included | 30K |
| Chat | 450K | Majority | Included |
| IF | 450K | Majority | 30K |
| Safety | 90K | Included | - |
Stage-wise Training Support
Provides data optimized for each training stage: SFT, DPO, and RL.
Training pipeline:
flowchart TD
A[Base Model] --> B[SFT<br>Dolci Think/Instruct SFT]
B --> C[DPO<br>Dolci Think/Instruct DPO]
C --> D[RL<br>Dolci Think/Instruct RL]
D --> E[Final Model]
Each stage uses different types of data.
- SFT: High-quality input-output pairs
- DPO: Preference pairs with quality differences
- RL: Prompts with verifiable rewards
Delta Learning is an important technique used in creating Dolci DPO datasets.
Core insight:
What matters in DPO is the “quality gap (delta)” between chosen and rejected responses. Relative quality differences are more important for learning than absolute quality.
Experimental results:
| Setup | MATH Score | Change |
|---|---|---|
| Base model | 45.2 | - |
| SFT with Qwen3-32B | 43.8 | -1.4 |
| DPO with Qwen3-32B (chosen) vs Qwen3-0.6B (rejected) | 52.3 | +7.1 |
Even with the same Qwen3-32B outputs, SFT leads to performance degradation while pairing with a weak model for DPO yields substantial improvement.
Applications:
This finding is leveraged in both Dolci Think DPO and Dolci Instruct DPO.
Summary
Dolci is a comprehensive data suite used for the post-training of Olmo 3. It supports three distinct model variations – Think, Instruct, and RL-Zero – and provides data optimized for each training stage: SFT, DPO, and RL.
Key contributions:
- Fully open: All data sources and pipelines are publicly available
- High quality: Generation by powerful models with rigorous filtering
- Diversity: Covers Math, Code, Chat, IF, and Safety
- Stage-wise training: Optimized for SFT, DPO, and RL
- Research value: Dolci RL-Zero provides a fully open RL benchmark
Dolci, as a fully open post-training dataset, enables researchers to conduct highly reproducible research and to intervene in or customize any stage of the training pipeline.