Dolma 3: Pretraining Dataset

Dolma 3 is the large-scale dataset used to pretrain the Olmo 3 Base model. Known as Dolma 3 Mix, it comprises approximately 6 trillion tokens of diverse data spanning multiple sources including web pages, academic PDFs, code repositories, and mathematical data.

Overview and Objectives

The primary objective of Dolma 3 is to provide the Olmo 3 Base model with broad knowledge and capabilities. Since the dataset is used in the most compute-intensive pretraining stage (consuming over 90% of total compute), scalability and quality are of paramount importance.

Core data strategy principles:

  • Scale matters: Influencing pretraining requires data at the trillion-token scale in sufficient volume
  • Task data handling: Structured task data (QA pairs, chat instances, etc.) is reserved for later midtraining and long-context extension stages, and is not used during pretraining

Data Source Composition

Dolma 3 Mix is composed of multiple data sources. The following table shows the token counts and document counts for each source.

Table 1: Dolma 3 Mix data source composition
Data Source Type 9T Pool 6T Mix 6T Mix Share
Common Crawl Web pages 8.14T tokens
9.67B docs
4.51T tokens
3.15B docs
76.1%
olmOCR science PDFs Academic docs 972B tokens
101M docs
805B tokens
83.8M docs
13.6%
Stack-Edu (Rebalanced) GitHub code 137B tokens
167M docs
409B tokens
526M docs
6.89%
arXiv LaTeX papers 21.4B tokens
3.95M docs
50.8B tokens
9.10M docs
0.86%
FineMath 3+ Math web pages 34.1B tokens
21.4M docs
152B tokens
95.5M docs
2.56%
Wikipedia & Wikibooks Encyclopedia 3.69B tokens
6.67M docs
2.51B tokens
4.24M docs
0.04%
Total 9.31T tokens
9.97B docs
5.93T tokens
3.87B docs
100%

Data Source Descriptions

Common Crawl (Web pages):

  • The largest data source by share (76.1%)
  • Text data extracted from diverse web pages
  • Includes data up to December 31, 2024

olmOCR science PDFs (Academic documents):

  • A new data source converting academic PDFs to linearized plain text
  • Extracted from 238 million unique PDF documents
  • Text extraction performed using the olmOCR tool

Stack-Edu (GitHub code):

  • Curated educational programming content from The Stack v2 dataset
  • Split by programming language with an optimized mix applied

arXiv (LaTeX papers):

  • Sourced from the Proof-Pile-2 dataset
  • Retains the original LaTeX notation, enabling the model to learn both mathematical content and proper formatting

FineMath 3+ (Math web pages):

  • A subset of Common Crawl documents containing mathematical educational content
  • Reprocessed to properly preserve mathematical notation

Wikipedia & Wikibooks (Encyclopedia):

  • English and Simple English Wikipedia and Wikibooks
  • A base source for encyclopedic knowledge

Key Innovations

Dolma 3 introduces three major technical innovations.

1. Global Deduplication

For Dolma 3, a new tool called Duplodocus was developed to achieve fast and scalable global deduplication at the trillion-token scale.

Deduplication is performed in three stages:

Stage 1: Exact Deduplication:

  • Global deduplication based on document text hashes
  • Removes all exact copies
  • Identified 67% of data as duplicates, reducing the corpus from 38.7B to 12.8B documents

Stage 2: Fuzzy Deduplication:

  • MinHash-based deduplication to identify and remove near-identical documents
  • Removes documents that differ only in headers or footers (documents copied across multiple domains)
  • Identified 23% of data as duplicates, reducing the corpus to 9.8B documents

Stage 3: Substring Deduplication:

  • A new fuzzy suffix-array-based deduplication procedure
  • Removes repetitive content within individual documents (boilerplate text and HTML artifacts)
  • Marks repeated substrings of 500 bytes or more
  • Removed 14% of text bytes, reducing the corpus to 9.7B documents (36.5T bytes)

This three-stage procedure reduced the web corpus from 38.7B to 9.7B documents (a 75% reduction in document count).

Details: Deduplication

2. Token-constrained Mixing and Quality-aware Upsampling

Dolma 3 introduces two new approaches to data mixing.

Token-constrained Mixing:

  • Uses a Swarm-based approach to train and evaluate a large number of small proxy models
  • Uses these results to determine the optimal mix
  • A conditional mixing procedure that accommodates development cycles where data sources are continuously improved and updated

The Token-constrained Mixing procedure:

  1. Swarm construction: Train a large number of small proxy models with different mixing ratios
  2. Per-task regression: Each proxy model maps mixing weights to task performance
  3. Mix optimization: Find the mixing that minimizes average task BPB (bits-per-byte)

Quality-aware Upsampling:

  • Accounts for quality variation within each topic
  • Selectively repeats high-quality documents, concentrating repetition on high-quality data while minimizing overall repetition

Details: Data Mixing Methods

3. olmOCR Science PDFs

olmOCR science PDFs is a new data source that converts academic PDFs to linearized plain text. It was introduced as a replacement for the previous peS2o dataset.

Features:

  • “Polite” crawling identified as AI2Bot
  • Respects robots.txt and does not circumvent paywalls
  • Crawling focused on academic sites and paper repositories
  • Text extraction using olmOCR (versions 0.1.49-0.1.53)

Data scale:

  • 238 million unique PDF documents (cutoff date: December 2024)
  • 160 million PDF documents after text extraction
  • 156 million documents after deduplication

Details: olmOCR Science PDFs

Data Curation Pipeline

Data curation for Dolma 3 Mix follows the pipeline below.

flowchart TD
    subgraph CC["Common Crawl (Web pages)"]
        CC1["HTML text extraction"] --> CC2["Heuristic filtering"]
        CC2 --> CC3["Deduplication"]
        CC3 --> CC4["Topic & quality classification"]
    end

    subgraph PDF["Academic PDFs"]
        PDF1["OCR text extraction"] --> PDF2["Heuristic filtering"]
        PDF2 --> PDF3["Deduplication"]
        PDF3 --> PDF4["Topic & quality classification"]
    end

    subgraph GH["GitHub repos (Stack-Edu)"]
        GH1["Language classification"]
    end

    subgraph OTHER["FineMath, arXiv, Wiki"]
        OTHER1["Preprocessed"]
    end

    CC4 --> MIX["Mixing"]
    PDF4 --> MIX
    GH1 --> MIX
    OTHER1 --> MIX
    MIX --> QU["Quality upsampling"]
    QU --> FINAL["Dolma 3 Mix (6T tokens)"]
Figure 1: Data Curation Pipeline

Key pipeline steps:

  1. Text extraction: Extract text from HTML or PDF
  2. Heuristic filtering: Remove low-quality documents, spam, and adult content
  3. Deduplication: Remove exact duplicates, fuzzy duplicates, and substring duplicates
  4. Topic and quality classification: Classify documents into 24 topics using the WebOrganizer tool and assign quality scores
  5. Mixing: Determine optimal data source ratios via Token-constrained mixing
  6. Quality upsampling: Selectively repeat high-quality documents

Usage Across Three Training Stages

The Dolma 3 dataset is used across three stages of the Olmo 3 training process.

Stage 1: Pretraining

Data used: Dolma 3 Mix (6T tokens)

Objective: Build a foundation model with diverse knowledge and capabilities

Data source breakdown:

  • Common Crawl: 76.1%
  • olmOCR science PDFs: 13.6%
  • Stack-Edu: 6.89%
  • arXiv: 0.86%
  • FineMath 3+: 2.56%
  • Wikipedia & Wikibooks: 0.04%

Stage 2: Midtraining

Data used: Dolma 3 Dolmino Mix (100B tokens)

Objective: Enhance critical capabilities such as code, math, and general knowledge QA

Features: Deliberately includes instruction data and thinking traces as groundwork for post-training

Details: Midtraining

Stage 3: Long-context Extension

Data used: Dolma 3 Longmino Mix (50-100B tokens)

Objective: Acquire long-context capabilities supporting up to 65K tokens

Data scale:

  • 7B model: 50B tokens
  • 32B model: 100B tokens

Source data scale:

  • 8K+ tokens: 22.3M documents (640B tokens)
  • 32K+ tokens: 4.5M documents (380B tokens)

This is the largest openly available collection for long-context research.

Details: Long-context Extension

Data Mixing Results

Token-constrained mixing determined the optimal ratios across data sources.

Web text topic distribution:

  • STEM domains (“Science, Math, and Technology”, “Software Development”) are significantly upweighted
  • Training a 1B parameter model at 5x Chinchilla yielded an average improvement of 0.056 BPB
  • Performance degradation was observed on only 13 of 54 tasks, with a maximum decrease of 0.035 BPB

Stack-Edu programming language distribution:

  • Python is prioritized over Java and Markdown
  • Improvements are achieved on nearly all coding benchmarks

Dolma 3 also releases sample mixes to enable experimentation with fewer compute resources.

Pretraining sample mix:

  • 150B tokens
  • Same data source composition as Dolma 3 Mix

Benefits:

  • Enables small-scale experiments
  • Useful for ablation studies on data mixing
  • Accessible to researchers with limited compute resources

Summary

Dolma 3 is a large-scale, high-quality dataset used for pretraining the Olmo 3 Base model. It achieves optimal data mixing through innovative methods including global deduplication, Token-constrained mixing, and Quality-aware upsampling.

Key features:

  • Diverse data sources: Web, academic PDFs, code, math, encyclopedia, and more
  • Large scale: 6 trillion tokens (Dolma 3 Mix)
  • High quality: Quality control through deduplication and heuristic filtering
  • Optimized mixing: Optimal ratios determined via Swarm-based methods
  • Three-stage usage: Used across Pretraining, Midtraining, and Long-context Extension

Dolma 3 is fully open, with all data sources, processing pipelines, and mixing ratios publicly released to enable reproducible research.