Dolci: Post-training Data Suite

Dolci (Dolma Instruct) is a comprehensive data suite used for the post-training of Olmo 3. It consists of multiple subsets designed to support three distinct model variations: Think (reasoning), Instruct (instruction-following), and RL-Zero (RL directly from the base model).

Overview and Purpose

Dolci is a high-quality dataset for specializing Olmo 3 base models toward specific tasks. Its purpose is to transform the broad knowledge acquired during pretraining into practical capabilities such as mathematical reasoning, coding, instruction following, and chat.

Key characteristics of Dolci:

  • Fully open: All data sources and curation pipelines are publicly available
  • High quality: Rigorous filtering and validation of model-generated data
  • Diverse domains: Covers Math, Code, Chat, Instruction Following, and Safety
  • Stage-wise training: Data optimized for each stage of SFT, DPO, and RL

Three Subsets

Dolci consists of three major subsets.

flowchart LR
    D[Dolci Data Suite]

    D --- T[Dolci Think]
    T --> T1[Think SFT<br>1.94M samples]
    T1 --> T2[Think DPO<br>187K pairs]
    T2 --> T3[Think RL<br>40K prompts]

    D --- I[Dolci Instruct]
    I --> I1[Instruct SFT<br>544K samples]
    I1 --> I2[Instruct DPO<br>105K pairs]
    I2 --> I3[Instruct RL<br>128K prompts]

    D --- R[Dolci RL-Zero]
    R --> R1[Math<br>30K prompts]
    R --> R2[Code<br>30K prompts]
    R --> R3[IF<br>30K prompts]
    R --> R4[General Mix<br>30K prompts]
Figure 1: Dolci Data Suite composition

Each subset corresponds to a different model variation’s training pipeline.

Dolci Think: For reasoning models that perform step-by-step thinking

Dolci Instruct: For models that generate concise and direct responses

Dolci RL-Zero: For models trained with RL directly from the base model

Dolci Think: Data for Reasoning Models

Dolci Think is a dataset for training reasoning models (Olmo 3 Think) that perform step-by-step reasoning before generating a final answer.

Dolci Think SFT: Synthetic Thinking Traces

Scale: Approximately 1.94 million samples

Purpose: Teach the model the ability to generate thinking traces

Data source composition:

Table 1: Dolci Think SFT data source composition
Category Main Sources Sample Count
Math OpenThoughts3+, SYNTHETIC-2-Verified ~850K
Code OpenThoughts3+, Dolci Think Python Algorithms ~550K
Chat & IF WildChat, Persona IF, OpenAssistant ~450K
Safety CoCoNot, WildGuardMix, WildJailbreak ~90K
Other Aya, TableGPT ~100K

Data generation approach:

Dolci Think SFT data is created by generating thinking traces with powerful models for existing prompts.

Models used:

  • Math / Code: QwQ-32B (reasoning model)
  • Chat / Safety: DeepSeek R1 (reasoning-specialized model)

Filtering criteria:

  • Remove incomplete thinking traces (truncated mid-stream)
  • Remove domain-specific errors (mathematical mistakes, code syntax errors)
  • Remove excessive repetition and verbosity
  • Topic filtering using the OpenAI taxonomy

Dolci Think DPO: Preference Data via Delta Learning

Scale: Approximately 187K pairs

Purpose: Improve the quality of thinking traces

The principle of Delta Learning:

What matters in DPO (Direct Preference Optimization) is the “quality gap (delta)” between the chosen and rejected responses. In Dolci Think DPO, a strong model and a weak model are paired to create preference data with a clear quality gap.

Data generation setup:

Table 2: Model pair for Delta Learning
Role Model
Chosen Qwen 3 32B (thinking mode)
Rejected Qwen 3 0.6B (thinking mode)

Key findings:

Data that does not improve performance through SFT can yield significant improvements through DPO.

  • SFT on Qwen3-32B outputs leads to performance degradation
  • Using the same data paired with a weak model for DPO leads to substantial improvement

This demonstrates that “relative quality differences” matter more for learning than “absolute quality.”

Dolci Think RL: Challenging Prompts

Scale: Approximately 40K prompts

Purpose: Performance improvement via RLVR (Reinforcement Learning with Verifiable Rewards)

Characteristics:

Dolci Think RL is a dataset of challenging prompts that reasoning models struggle with.

Domains:

  • Math: High-difficulty problems at the AIME (American Invitational Mathematics Examination) level
  • Code: Practical programming challenges from LiveCodeBench and similar sources
  • Reasoning: Logic puzzles from ZebraLogic and similar sources

Dolci Instruct: Data for Instruction-Following Models

Dolci Instruct is a dataset for training models (Olmo 3 Instruct) that generate concise and direct responses without producing thinking traces.

Dolci Instruct SFT: Function-calling Enabled Data

Scale: Approximately 544K samples

Purpose: Teach the ability to generate efficient and helpful responses

Main data sources:

  • Tulu 3 SFT: Diverse instruction-following tasks
  • Function-calling data: Tool use and API call samples
  • Flan: Task format learning data

Differences from Dolci Think SFT:

Table 3: Comparison of Think and Instruct SFT data
Aspect Dolci Think SFT Dolci Instruct SFT
Thinking traces Yes No
Response style Step-by-step reasoning Concise and direct
Function-calling No Yes
Sample count 1.94M 544K

Dolci Instruct DPO: Response Length Optimization

Scale: Approximately 105K pairs

Purpose: Improve conciseness and usability

Key improvements:

Multi-turn preferences:

Synthetic conversations are generated to teach consistent responses across multiple turns.

Length control:

The length difference between chosen and rejected responses is capped at 100 tokens to suppress verbosity.

  • Prevents the model from simply learning to produce “longer responses”
  • Promotes concise responses focused on usability

Dolci Instruct RL: Improving Core Capabilities

Scale: Approximately 128K prompts

Purpose: Further improvement of core capabilities via RLVR

Domain distribution:

  • Instruction Following: Accurate compliance with complex instructions
  • Chat: Handling diverse conversational scenarios
  • Function-calling: Improving tool use accuracy
  • Knowledge Recall: Accurate recall of knowledge

Dolci RL-Zero: RL Directly from Base

Dolci RL-Zero is a specialized dataset for training models with RL directly from the base model, bypassing SFT and DPO.

Purpose and Significance

Research value:

  • Enables studying the impact of pretraining data on RL performance
  • Provides a fully open RL benchmark

Existing challenges:

Previous open-weight models (Llama 3, Qwen 2.5, etc.) did not release their pretraining data, limiting RL research.

Dolci RL-Zero enables clear benchmarking that eliminates the confounding effects of data leakage.

Four Domains

Dolci RL-Zero consists of four distinct domains.

Math:

  • Scale: 30K prompts
  • Tasks: Mathematical reasoning problems from GSM8K, MATH, and similar benchmarks
  • Verification: Symbolic comparison using SymPy

Code:

  • Scale: 30K prompts
  • Tasks: Programming challenges from HumanEvalPlus, LiveCodeBench, and similar sources
  • Verification: Test case execution and validation

IF (Instruction Following):

  • Scale: 30K prompts
  • Tasks: Precise instruction following from IFEval, IFBench, and similar sources
  • Verification: Constraint-checking functions

General Mix:

  • Scale: 30K prompts
  • Tasks: A mixture of the above three domains plus Chat
  • Verification: Domain-specific verification methods

Decontamination

All data in Dolci RL-Zero undergoes rigorous decontamination to eliminate overlap with evaluation benchmarks.

Method: Two-phase processing using the decon package

  1. Detection phase: 8-gram matching to detect overlap (50% threshold)
  2. Cluster expansion phase: Removal of entire clusters of similar samples

This ensures complete separation between RL training data and benchmark data.

Data Curation Pipeline

Dolci’s data curation is carried out through the following pipeline.

flowchart TD
    S1["Step 1: Source Selection<br>Public datasets (OpenThoughts, WildChat, etc.)<br>Model generation (QwQ-32B, DeepSeek R1)"]
    S2["Step 2: Heuristic Filtering<br>Remove incomplete traces<br>Remove domain-specific errors<br>Remove excessive repetition"]
    S3["Step 3: Topic Filtering<br>OpenAI taxonomy classification<br>Remove off-topic samples"]
    S4["Step 4: Difficulty Filtering<br>Select challenging prompts for RL<br>Balance difficulty distribution"]
    S5["Step 5: Data Mixing<br>Balance domain distribution<br>Optimize mix for target tasks"]
    S6["Step 6: Decontamination<br>8-gram matching (50% threshold)<br>Cluster expansion<br>Final Dolci datasets"]

    S1 --> S2 --> S3 --> S4 --> S5 --> S6
Figure 2: Dolci data curation pipeline

Key Steps in the Pipeline

Step 1: Source Selection:

Collect data from public datasets and generate data using powerful models.

Step 2: Heuristic Filtering:

Remove obviously low-quality data.

  • Incomplete thinking traces
  • Domain-specific errors (mathematical mistakes, code syntax errors)
  • Excessive repetition

Step 3: Topic Filtering:

Remove off-topic samples using the OpenAI taxonomy.

Step 4: Difficulty Filtering:

Select challenging prompts for RL and balance the difficulty distribution.

Step 5: Data Mixing:

Balance the domain distribution and create a mix optimized for target tasks.

Step 6: Decontamination:

Completely eliminate overlap with evaluation benchmarks.

Key Features

The Dolci data suite has the following features.

Fully Open

All data sources, curation pipelines, and processing code are publicly available.

Published materials:

  • References to original data sources
  • Curation scripts
  • Filtering criteria
  • Data mixing ratios

High Quality

High quality is achieved through generation by powerful models and rigorous filtering.

Quality assurance mechanisms:

  • Model generation: Uses state-of-the-art models such as QwQ-32B and DeepSeek R1
  • Multi-stage filtering: Heuristic, topic, and difficulty filtering
  • Decontamination: Complete separation from evaluation data

Broad Domain Coverage

Covers a wide range of domains including Math, Code, Chat, Instruction Following, and Safety.

Domain distribution:

Table 4: Dolci domain coverage
Domain Think SFT Instruct SFT RL-Zero
Math 850K Included 30K
Code 550K Included 30K
Chat 450K Majority Included
IF 450K Majority 30K
Safety 90K Included -

Stage-wise Training Support

Provides data optimized for each training stage: SFT, DPO, and RL.

Training pipeline:

flowchart TD
    A[Base Model] --> B[SFT<br>Dolci Think/Instruct SFT]
    B --> C[DPO<br>Dolci Think/Instruct DPO]
    C --> D[RL<br>Dolci Think/Instruct RL]
    D --> E[Final Model]
Figure 3: Dolci stage-wise training pipeline

Each stage uses different types of data.

  • SFT: High-quality input-output pairs
  • DPO: Preference pairs with quality differences
  • RL: Prompts with verifiable rewards

Delta Learning is an important technique used in creating Dolci DPO datasets.

Core insight:

What matters in DPO is the “quality gap (delta)” between chosen and rejected responses. Relative quality differences are more important for learning than absolute quality.

Experimental results:

Setup MATH Score Change
Base model 45.2 -
SFT with Qwen3-32B 43.8 -1.4
DPO with Qwen3-32B (chosen) vs Qwen3-0.6B (rejected) 52.3 +7.1

Even with the same Qwen3-32B outputs, SFT leads to performance degradation while pairing with a weak model for DPO yields substantial improvement.

Applications:

This finding is leveraged in both Dolci Think DPO and Dolci Instruct DPO.

Summary

Dolci is a comprehensive data suite used for the post-training of Olmo 3. It supports three distinct model variations – Think, Instruct, and RL-Zero – and provides data optimized for each training stage: SFT, DPO, and RL.

Key contributions:

  • Fully open: All data sources and pipelines are publicly available
  • High quality: Generation by powerful models with rigorous filtering
  • Diversity: Covers Math, Code, Chat, IF, and Safety
  • Stage-wise training: Optimized for SFT, DPO, and RL
  • Research value: Dolci RL-Zero provides a fully open RL benchmark

Dolci, as a fully open post-training dataset, enables researchers to conduct highly reproducible research and to intervene in or customize any stage of the training pipeline.