OlmoBaseEval: Base Model Evaluation Suite
OlmoBaseEval is a benchmark suite designed to efficiently evaluate the performance of base models (pretrained models). It addresses the shortcomings of conventional evaluation methods and enables reliable evaluation even for smaller models.
Background and Objectives
During the language model development process, models need to be evaluated frequently throughout training. However, many existing evaluation benchmarks target fully trained or instruction-tuned large models, making them unsuitable for evaluating smaller base models.
OlmoBaseEval was developed with the following objectives:
- Provide meaningful evaluation results even for smaller models
- Accurately track the progress of models during training
- Achieve high evaluation accuracy while keeping computational costs low
- Measure a model’s fundamental capabilities from multiple angles
Challenges with Conventional Methods
There are three main challenges when evaluating smaller base models.
Smaller models often perform no better than random chance on many tasks. For example, on a 4-choice question when the model has not yet learned the relevant knowledge, accuracy hovers around 25%, making it impossible to meaningfully measure the model’s capabilities.
Even as training progresses, changes in scores are minuscule, making it difficult to determine whether improvement has occurred. Detecting statistically significant differences requires an extremely large number of samples.
Some tasks exhibit large amounts of noise, causing results to fluctuate across evaluations of the same model. This makes it impossible to distinguish genuine performance improvements from noise-induced variation.
Three Pillars of the Solution
OlmoBaseEval addresses the above challenges through three approaches.
Task Clustering
By aggregating multiple tasks that evaluate similar capabilities, evaluation reliability is improved.
Even if individual tasks are noisy, integrating results from multiple tasks that measure the same capability yields more stable evaluation metrics. This allows accurate measurement of specific model capabilities (e.g., reasoning ability, language comprehension).
Benefits of task clustering:
- Averages out noise from individual tasks
- Enables multi-faceted assessment of capabilities
- Improves evaluation reproducibility
Proxy Metrics
Evaluation metrics better suited for smaller models are introduced.
Conventional accuracy-based evaluation fails to capture the capabilities of smaller models. Proxy metrics use continuous indicators that consider the model’s output distribution and confidence, rather than a binary correct/incorrect judgment.
Representative proxy metrics:
- Masked Perplexity: Measures the model’s prediction probability for specific tokens
- Probability-based metrics: Evaluates the probability assigned to the correct answer choice
- Calibration metrics: Measures the alignment between the model’s confidence and actual accuracy
These metrics can capture training progress even at stages where the model has not yet fully learned.
Signal-to-Noise Ratio Improvement
The signal-to-noise ratio of evaluation tasks is analyzed, and only highly reliable tasks are selected.
Not all tasks are equally useful. OlmoBaseEval measures the signal-to-noise ratio of each task and preferentially adopts tasks that produce stable results even for smaller models.
Methods for analyzing signal-to-noise ratio:
- Confirming consistency of evaluation results across multiple model sizes
- Measuring variance across multiple evaluations of the same model
- Verifying consistency with scaling laws
New Benchmarks
OlmoBaseEval includes the following four new benchmarks.
BasicSkills
Six tasks that measure basic language understanding and reasoning capabilities.
- Reading Comprehension: Understanding short texts and extracting information
- Fact Recall: Memorization of basic knowledge
- Simple Logic: Fundamental logical reasoning
- Pattern Recognition: Recognizing and predicting patterns
- Basic Math: Arithmetic-level numerical computation
- Common Sense: Common-sense reasoning
These tasks are designed so that even smaller models can achieve performance above random chance.
Gen2MC
Five tasks that convert generation tasks into multiple-choice format.
Traditionally, generation tasks have been considered unsuitable for evaluating base models. Gen2MC solves this problem by evaluating generation quality in a multiple-choice format.
- Summarization: Judging summarization quality from answer choices
- Translation: Evaluating translation accuracy
- Paraphrasing: Measuring the validity of paraphrases
- Question Generation: Evaluating the quality of generated questions
- Title Generation: Judging the appropriateness of generated titles
This format enables probability-based evaluation even for generation tasks.
MT MBPP
A multilingual programming benchmark supporting 17 languages.
MT MBPP translates MBPP (Mostly Basic Programming Problems) into 17 natural languages. It simultaneously evaluates a model’s multilingual comprehension and coding capabilities.
Examples of supported languages:
- European languages: English, Spanish, French, German
- Asian languages: Japanese, Chinese, Korean, Hindi
- Others: Arabic, Russian, Portuguese
By evaluating the same problem set across each language, cross-lingual performance differences can be analyzed.
Masked Perplexity
An evaluation method that measures prediction accuracy for masked tokens.
Specific important tokens (nouns, verbs, etc.) are masked, and the model is evaluated on how accurately it can predict them. This method provides continuous scores even for smaller models, enabling fine-grained tracking of training progress.
Characteristics of Masked Perplexity:
- Continuous scoring (not binary judgment)
- Direct measurement of contextual understanding
- Low computational cost
- Detects meaningful differences from the early stages of training
Evaluation in Practice
Evaluation using OlmoBaseEval follows this workflow:
- Baseline measurement: Evaluate the randomly initialized model before training begins
- Periodic evaluation: Run automated evaluation at regular training steps
- Cluster analysis: Track performance changes for each task cluster
- Scaling prediction: Estimate large-model performance from small-model results
For smaller models (under 1B parameters), evaluation every few hundred steps is recommended. For larger models, the evaluation frequency should be adjusted considering computational costs.
Integration with Scaling Analysis
OlmoBaseEval provides further insights when combined with scaling law analysis.
Based on evaluation results from smaller models, the performance of larger models can be predicted. This allows the effectiveness of training strategies to be verified before actually training large-scale models.
The following factors are important for improving prediction accuracy:
- Evaluation data across multiple model sizes
- Scaling curves for each task cluster
- Correlation analysis with training data volume
Computational Efficiency
OlmoBaseEval significantly reduces computational costs compared to conventional evaluation suites.
+----------------------------------+
| Efficiency Comparison |
+----------------------------------+
| Traditional: 100 GPU hours |
| OlmoBaseEval: 10 GPU hours |
| Reduction: 90% |
+----------------------------------+
Factors contributing to efficiency:
- Optimized number of tasks (elimination of redundancy)
- Reduced evaluation time through proxy metrics
- Optimized batch processing
Summary
OlmoBaseEval brings the following innovations to base model evaluation:
- Reliability for smaller models: Achieves evaluation beyond random-chance performance
- Efficiency: Significantly reduces computational costs
- Multi-faceted evaluation: Comprehensive capability measurement through task clustering
- Predictability: Integration with scaling analysis
This evaluation suite enables the verification of data-efficient and compute-efficient training strategies from the earliest stages of model development.