Data Mixing

Overview

Data Mixing is a technique for combining multiple data sources at optimal ratios to maximize model performance. In Dolma 3, two innovative methods—Token-constrained Mixing and Quality-aware Upsampling—were introduced to compose a 6T-token training mix from a 9T-token data pool. These methods achieve an optimal data allocation under a fixed token budget.

Purpose of Data Mixing

Token Budget Constraints

Model training is subject to a token budget constraint:

Computational cost: The number of tokens available for training is limited by computational resources
Need for optimal allocation: Within a limited budget, the proportion of each data source must be decided
Balancing diversity and quality: High-quality data must be prioritized while maintaining data diversity

Determining the Optimal Mix Ratio

When composing a 6T-token training mix from a 9T-token data pool, the following factors are considered:

Data source characteristics: Features unique to each source—web text, academic PDFs, code, math, etc.
Topic balance: Optimal allocation across topics such as STEM, software development, and general knowledge
Quality considerations: Preferential selection of high-quality documents

Token-constrained Mixing

Token-constrained Mixing is a method for determining the optimal data mix under a token budget constraint.

Swarm-based Methods

Many small proxy models are trained, and the optimal mix is estimated from their results:

Procedure:

Swarm construction: Train many small proxy models with different mixing ratios
Per-task regression: Each proxy model maps mixing weights to task performance
Mix optimization: Find the mixing that minimizes average task BPB (bits-per-byte)

Advantages:

Computational efficiency: Experiments with small-scale models allow estimating the optimal mix before training a large-scale model
Parallelism: Multiple proxy models can be trained in parallel
Iterative refinement: The mix can be improved incrementally based on results

Scale of the Swarm

In Dolma 3, many 1B-parameter models were trained to evaluate performance under different mixing ratios. These proxy models were trained at 5x Chinchilla (five times the standard number of tokens) to accurately measure the effect of data mixes.

Conditional Mixing

A conditional mixing procedure is adopted to accommodate continuous improvements to data sources:

Features:

Flexibility: When data sources are updated, the entire mix does not need to be recomputed
Modularity: Individual data sources can be improved independently
Scalability: New data sources can be added easily

Adapting to the development cycle:

Continuous improvement of data sources
Incremental introduction of new data sources
Dynamic adjustment of mix ratios

Quality-aware Upsampling

Quality-aware Upsampling is a method that selectively reintroduces high-quality documents into a deduplicated clean dataset.

Selective Introduction of Duplicates

High-quality documents are selectively restored from data removed during deduplication:

Approach:

Deduplication as the foundation: First, build a clean dataset by removing all duplicates
Quality assessment: Compute a quality score for each document
Selective upsampling: Selectively repeat high-quality documents

Effects:

Improved quality: Increasing the proportion of high-quality data improves model performance
Efficient repetition: Repetition is concentrated on high-quality data while minimizing overall repetition
Token efficiency: The limited token budget is preferentially allocated to high-quality data

Quality-aware Upsampling Strategy

Among documents removed during deduplication, some are high quality. By selectively restoring these, quality degradation from deduplication is prevented while the overall quality of the dataset is improved.

Classification by Topic and Quality

In Dolma 3, web text is classified along both topic and quality axes to achieve fine-grained mixing.

24-Topic Classification with WebOrganizer

WebOrganizer is a tool that classifies web text into 24 major topics:

Major topics (examples):

Science, Math, and Technology
Software Development
Arts and Entertainment
Business and Finance
Health and Medicine
Education
News and Current Events
17 additional topics

Benefits of classification:

Per-topic weighting: Assign optimal weights to each topic
STEM reinforcement: Preferentially allocate Science, Math, and Technology topics
Balanced mix: Adjust to avoid overrepresentation of any single topic

fastText Quality Classifier

Within each topic, documents are further classified by quality score:

Quality classification:

20 quality tiers: Each topic is divided into 20 quality tiers
fastText-based classifier: Fast and accurate quality estimation
Objective quality metric: Consistent quality assessment across documents

480 Subsets

24 topics x 20 quality tiers = 480 subsets:

Fine-grained mixing:

Per-subset weights: Individual weights are assigned to each subset
Quality and topic alignment: High-quality data in important topics is prioritized
Flexible tuning: Data allocation optimization at fine granularity

flowchart LR
    A["WebOrganizer<br/>(24 topics)"] --> B["Science, Math,<br/>and Technology"]
    A --> C["Software<br/>Development"]
    A --> D["Arts and<br/>Entertainment"]
    A --> E["... (21 more topics)"]
    B --> B1["Quality tiers (1-20)"]
    C --> C1["Quality tiers (1-20)"]
    D --> D1["Quality tiers (1-20)"]
    E --> E1["Quality tiers (1-20)"]

Figure 1: Topic and Quality Classification (480 subsets = 24 topics x 20 quality tiers)

Mixing Strategy Results

Token-constrained Mixing and Quality-aware Upsampling determined the optimal ratios of data sources.

Per-Topic Weights (Figure 9a)

In the topic distribution of web text, the following trends are observed:

Upweighted topics:

Science, Math, and Technology: STEM domains are significantly upweighted
Software Development: Programming and software development are reinforced
Education: Educational content is emphasized

Downweighted topics:

Entertainment-related topics
General news and social media content

Results:

Training a 1B-parameter model at 5x Chinchilla achieved an average improvement of 0.056 BPB
Performance degradation was observed on only 13 out of 54 tasks, with a maximum degradation of 0.035 BPB

Comparison with DCLM Baseline (Figure 9b)

Compared to the DCLM (DataComp for Language Models) Baseline, the following improvements were confirmed:

Improvements:

STEM tasks: Substantial performance gains on science, math, and technology tasks
Coding tasks: Improved programming ability
General knowledge: Performance improvements on a wide range of knowledge tasks

Trade-offs:

Slight performance degradation on some tasks
Overall, performance gains on important tasks are prioritized

Impact of Data Mixing

Optimization of data mixing has a significant impact on model performance. By prioritizing STEM domains, performance on scientific and technical tasks improves, forming a core strength of Olmo 3.

Programming Language Distribution in Stack-Edu

An optimal mix of programming languages was also determined for code data:

Upweighted languages:

Python: Highest weight (importance in machine learning and data science)
JavaScript/TypeScript: Primary languages for web development
C++/Rust: Systems programming languages

Downweighted languages:

Java: Relatively lower weight (high proportion of verbose code)
Markdown: Documentation files are limited

Results:

Improvements achieved on nearly all coding benchmarks
Particularly notable improvements on Python-centric tasks

Summary

Data Mixing is a critical process that determines the quality of Dolma 3. Two innovative methods—Token-constrained Mixing and Quality-aware Upsampling—achieve an optimal data allocation under a fixed token budget.

Key features:

Token-constrained Mixing: Optimization via swarm-based methods
Quality-aware Upsampling: Selective reintroduction of high-quality data
480 subsets: Fine-grained classification by topic and quality
Conditional Mixing: Accommodates continuous improvement of data sources
Demonstrated improvement: Average improvement of 0.056 BPB compared to the DCLM Baseline

These methods make Dolma 3 the foundation supporting the high performance of the Olmo 3 Base model.