flowchart TB
A["<b>Video / Image Input</b><br/>Video: max 128 frames @ 2fps<br/>(384 for long ctx)<br/>Image: 1 crop + up to K=8<br/>overlapping crops"]
B["<b>Vision Transformer (ViT)</b><br/>Extracts patch-level features"]
C["<b>Vision-Language Connector</b><br/>3rd-to-last & 9th-from-last<br/>ViT layer features used<br/>Attention pooling:<br/>2×2 (images) / 3×3 (video)<br/>Shared MLP projection"]
D["<b>LLM (Qwen3 or OLMo)</b><br/>Visual tokens + text timestamps<br/>(video) or image indices<br/>Bi-directional attention<br/>between vision tokens<br/>Output: text + points<br/>(for grounding)"]
A --> B --> C --> D
Molmo2 Technical Report Summary
Overview
Molmo2 (Multimodal Open Language Model 2) is a fully open Vision-Language Model (VLM) family developed by the Allen Institute for AI (AI2) and the University of Washington. Its most notable feature is the introduction of video grounding capabilities.
Previous VLMs could understand and describe the contents of images and videos, but lacked the ability to precisely indicate (“ground”) when and where specific events or objects occurred. Molmo2 achieves spatiotemporal pointing and tracking within videos, reaching state-of-the-art performance among open-source models.
Paper: arXiv:2601.10611
Key contributions:
- 9 novel datasets (built entirely without reliance on proprietary models)
- Video grounding (pointing & tracking)
- Ultra-detailed video captions (average 924 words/video)
- Fully open (models, data, code)
Model sizes:
- Molmo2-4B (Qwen3 LLM backbone)
- Molmo2-8B (Qwen3 LLM backbone)
- Molmo2-O-7B (OLMo LLM backbone, fully open)
Motivation: The Importance of Video Grounding
Currently, the most powerful Video-Language Models (VLMs) are proprietary, with weights, data, and training recipes undisclosed. Moreover, many open-weight models rely on “distillation” – generating synthetic data from proprietary models – lacking a fully independent open foundation.
Furthermore, existing VLMs lack grounding capabilities. Grounding refers to the ability of a model to output spatiotemporal coordinates for each grasping event when asked “How many times did the robot grasp the red block?”, or to return the trajectory (track) of a cup when asked “When did the cup fall off the table?”
Image grounding has become a standard capability, but video grounding was only supported in a limited fashion by some proprietary systems, remaining an unexplored area in open source.
Molmo2 was developed to bridge this gap.
Datasets: 9 Novel Datasets
At the core of Molmo2 are 9 novel datasets. All are constructed without any distillation from proprietary models, using human annotation and LLM-based synthetic pipelines.
Many open models (LLaVA-Video, PLM, ShareGPT4Video, etc.) adopt a “distillation” approach that generates synthetic data from proprietary models such as GPT-4V and Gemini.
This approach has the following problems:
- Lack of transparency: Data quality is opaque because it depends on proprietary model capabilities
- Bias inheritance: Biases and errors from proprietary models are directly inherited
- Improvement ceiling: It is difficult to achieve performance that surpasses the source model
Molmo2 achieves fully independent dataset construction by using only human annotation and in-house models (Molmo, Claude Sonnet 4.5). This provides the open-source community with a foundation to surpass SOTA.
1. Molmo2-Cap (Human Annotation)
- Contents: 104k video-level captions + 431k clip-level captions
- Features: Ultra-detailed captions averaging 924 words/video
- Comparison with existing datasets: Video Localized Narratives (75 words), ShareGPT4-Video (280 words), LLaVA-Video (547 words)
- Pipeline:
- Annotators describe short clips via voice (enables more detailed descriptions than typing)
- Transcription with Whisper-1
- Text refinement with LLM
- Frame-level captions generated with Molmo and integrated
Details: Dense Video Captioning
2. Molmo2-AskModelAnything (Human Annotation)
- Contents: 140k video QA pairs
- Features: Detailed questions and answers by human annotators
- Pipeline:
- Videos clustered into 31 categories to ensure diversity
- Annotators create detailed questions
- Claude Sonnet 4.5 generates initial answers
- Annotators iteratively refine the answers
3. Molmo2-CapQA & Molmo2-SubtitleQA (Synthetic)
- CapQA: 1M QA pairs (200k videos, 5 QA/video)
- Videos segmented into scenes, each scene captioned
- LLM generates QA from captions
- SubtitleQA: 300k QA pairs (100k videos, 3 QA/video)
- Subtitles extracted with Whisper-1
- Reasoning questions generated that require both visual information and subtitles
4. Molmo2-VideoPoint (Human Annotation)
Contents: 650k video pointing queries (280k videos, average 6 points/video)
Categories: 8 types
- Objects, Animals, Actions/Events
- Referring expressions, Indirect references
- Spatial references, Comparative references
- Visual artifacts/anomalies (for generated videos)
Pipeline:
- LLM generates queries from captions
- Annotators click on the correct frame (at 2 fps) and precise location
Details: Video Grounding: Pointing & Tracking
5. Molmo2-VideoTrack (Human Annotation)
- Contents: 3.6k video clips, 15k complex natural language queries (average 2.28 objects/query)
- Features: Complex text queries created for existing tracking annotations
- Pipeline:
- Segmentation or bounding box tracks are displayed
- Annotators create non-trivial queries that apply to subsets of objects
- Verification in a separate round
Details: Video Grounding: Pointing & Tracking
6 & 7. AcademicVideoPoint & AcademicVideoTrack (Curation)
- VideoPoint: 49k pointing/counting QA converted from 6 datasets
- VideoTrack: 7 Ref-VOS datasets + 11 bounding box tracking datasets converted (segmentation masks generated with SAM-2)
8. Molmo2-MultiImageQA (Human Annotation)
- Contents: 45k image sets (96k unique images), 72k QA pairs
- Features: QA over semantically related image sets (2-5 images, average 2.73 images)
- Pipeline:
- Images grouped by caption similarity
- Annotators create questions
- Answers refined through iterative loop with LLM
Details: Multi-Image Understanding
9. Molmo2-MultiImagePoint & Molmo2-SynMultiImageQA (Synthetic)
- MultiImagePoint: 470k pointing/counting examples (generated from PixMo-Points via clustering)
- SynMultiImageQA: 188k synthetic multi-image examples (extending CoSyn; charts, tables, documents, etc.)
Architecture
Molmo2 adopts a standard VLM architecture.
Key design choices:
- Cropping: Up to 24 crops for images (at inference), 2 fps sampling for video
- Bi-directional attention: Image tokens can mutually attend to each other (improves performance)
- Pointing format: Normalized (x, y, timestamp/image_index, object_id) output as plain text
Details: Vision-Language Connector
Training
3-Stage Training Pipeline
Molmo2 is trained in 3 stages.
Stage 1: Pre-training (Images Only)
- Data: PixMo-Cap (captions), PixMo-Points (pointing), Tulu (NLP)
- Mixing ratio: 60% captions, 30% pointing, 10% NLP
- Steps: 32k steps, batch size 128 (approximately 4 epochs)
- Learning rate: Set independently for ViT, Connector, and LLM
Stage 2: Supervised Fine-Tuning (SFT)
- Data: PixMo + Molmo2 datasets + open-source video/image datasets
- Category-based sampling: Manually tuned sampling rates (see Table 1)
- Steps: 30k steps, batch size 128, max sequence length 16,384
| Category | Sampling Rate | # Datasets | # Examples |
|---|---|---|---|
| Captions/Long QA | 13.6% | 6 | 1.2M |
| Image QA | 22.7% | 32 | 2.4M |
| Video QA | 18.2% | 32 | 2.4M |
| Image Pointing | 9.1% | 4 | 1.1M |
| Video Pointing | 13.6% | 7 | 0.37M |
| Video Tracking | 13.6% | 22 | 0.80M |
| NLP | 9.1% | 1 | 0.99M |
Stage 3: Long-Context SFT
- Context length: 36,864 (2.25x Stage 2)
- Number of frames: F = 384 (3x Stage 2)
- Steps: 2k steps
- Parallelism: Context Parallelism (CP), processed across 8 GPUs
- Note: Conducted for a short duration only due to high overhead
Details: Long-Context Training
Key Training Techniques
Token Weighting
The data ranges from multiple-choice questions with single-token outputs to long video captions with 4000+ tokens. If long outputs dominate the loss, performance on short-answer tasks degrades.
Solution: Adjust weighting per task
- Video captions: weight 0.1
- Pointing: weight 0.2
- Others: \(\frac{4}{\sqrt{n}}\) (\(n\) = number of answer tokens)
Details: Token Weighting
Packing
Since the number of tokens varies greatly across examples (from hundreds to 16k+), packing is used to avoid padding. For vision-language models, both crops for the ViT and tokens for the LLM need to be packed efficiently.
Molmo2 developed an on-the-fly packing algorithm, achieving a 15x training efficiency improvement.
Details: Packing & Message Trees
Message Trees
When a single image/video has multiple annotations, they are encoded as a message tree. The visual input becomes the first message, and each annotation becomes a different branch. The tree is linearized into a single sequence, and a custom attention mask prevents cross-attention between branches.
On average, examples in the data have 4 annotations, and packing allows an average of 3.8 examples to be packed into a 16,348-token sequence.
Details: Packing & Message Trees
Evaluation
Overall Results (Short-form Video, Captioning, Counting)
Molmo2 is evaluated on standard video benchmarks and novel captioning/counting benchmarks.
Key results:
- Short-form video understanding: SOTA among open-weight models
- NextQA: 86.2 (Molmo2-8B)
- PerceptionTest: 82.1
- MVBench: 75.9
- MotionBench: 62.2
- Captioning: F1 Score 43.2 on Molmo2-CapTest (Molmo2-8B)
- Following GPT-5 (50.1) and Gemini 2.5 Pro (42.1)
- Counting: 35.5% accuracy on Molmo2-VideoCount (Molmo2-8B)
- Significantly outperforms Qwen3-VL-8B (29.6%)
- Long-form video: Does not match the best open-weight models (e.g., Eagle2.5-8B)
- Cause: Insufficient open-source training data for long-form (10+ minute) videos
Molmo2 faces challenges on the following long-form video benchmarks:
- LongVideoBench: 67.5 (Eagle2.5-8B: 66.4, PLM-8B: 56.9)
- MLVU: 60.2 (Eagle2.5-8B: 60.4, PLM-8B: 52.6)
- LVBench: 52.8 (Eagle2.5-8B: 50.9, PLM-8B: 44.5)
Causes:
- Insufficient open data: Lack of high-quality annotations for videos longer than 10 minutes
- Computational constraints: Long-Context Training (Stage 3) was conducted for only 2,000 steps (due to high overhead)
- Trade-offs: Prioritizing caption quality led to slightly lower performance on long-form video tasks
However, Molmo2’s long-form video performance still surpasses many open models, and considering it uses only fully open data, the results are commendable.
Details: Long-Context Training
Human Preference Study:
- Elo score: 1057 (Molmo2-8B)
- Ranked 5th, following Gemini 3 Pro (1082) and Gemini 2.5 Flash (1084)
- Highest performance among fully open models
Grounding Results (Video Pointing & Tracking)
Molmo2’s greatest strength is video grounding.
Video Pointing:
- F1 Score 38.4 on Molmo2-VP benchmark (novel)
- Significantly outperforms Gemini 3 Pro (20.0)
- Highest performance including proprietary models
Video Tracking:
- Accuracy 56.2 on BURST (test)
- J&F 41.1 on Molmo2-VC (novel benchmark)
- Outperforms Gemini 3 Pro (details vary by benchmark)
Since existing open-weight models (e.g., Qwen3-VL) do not provide video tracking functionality, Molmo2 can be said to have pioneered a new capability.
Details: Video Grounding: Pointing & Tracking
Image Results
Molmo2 maintains strong performance on image tasks as well.
- MMMU: 47.9 (Molmo2-8B)
- MathVista: 63.1
- ChartQA: 79.5
- AI2D: 84.5
It has been confirmed that adding video capabilities does not compromise image task performance.
Ablations
The paper examines the impact of the following components:
- Bi-directional attention on vision tokens: Effective (improves performance)
- Token weighting: Effective (improves balance between long and short outputs)
- Packing: 15x efficiency improvement
- Message trees: Efficient learning from multiple annotations
Details: Token Weighting, Packing & Message Trees
Conclusion
Molmo2, as a fully open VLM, achieves the following:
- 9 novel datasets constructed (zero reliance on proprietary models)
- Video grounding (pointing & tracking) realized
- SOTA in short-form video understanding among open models
- Captioning & counting performance approaching proprietary models
- Fully open (models, data, code)
As a challenge, performance on long-form video (10+ minutes) does not match the best open-weight models, but this is due to the scarcity of open-source long-form data.
Molmo2 provides a solid foundation for the open-source community to build SOTA VLMs.