graph TD
V["Visual Input<br/>(Root)"]
V --> A1["Annotation 1<br/>(Branch 1)"]
V --> A2["Annotation 2<br/>(Branch 2)"]
V --> A3["Annotation 3<br/>(Branch 3)"]
V --> A4["Annotation 4<br/>(Branch 4)"]
style V fill:#e1f5ff,stroke:#2196F3,stroke-width:2px
style A1 fill:#fff4e1,stroke:#FF9800
style A2 fill:#fff4e1,stroke:#FF9800
style A3 fill:#fff4e1,stroke:#FF9800
style A4 fill:#fff4e1,stroke:#FF9800
Packing & Message Trees
What is Packing?
Packing is a technique that merges multiple short training examples into a single long sequence to avoid wasteful padding during batch creation. The token counts of training examples vary widely, ranging from a few hundred tokens (pure text or small images) to over 16,000 tokens (videos with subtitles or long videos during long-context training).
Challenges in Vision-Language Models
Packing in VLMs presents non-trivial challenges for the following reasons:
- Dual packing requirements: Both ViT crops and LLM tokens need to be efficiently packed
- Model diversity: Must support models with different approaches to converting images and video into tokens
Molmo2’s On-the-Fly Packing Algorithm
Molmo2 developed an on-the-fly packing algorithm that constructs maximally efficient packed sequences from a small in-memory pool of examples. This algorithm can be integrated into a standard PyTorch data loader.
What are Message Trees?
Message Trees are a method for encoding videos or images that have multiple annotations. They enable efficient processing of multiple different annotations (question-answer pairs, captions, pointing, etc.) for a single visual input.
Structure of Message Trees
Message Trees represent data in a tree structure as follows:
Specifically:
- The visual input is encoded as the first message
- Each annotation becomes a different branch
- The tree structure is linearized as a single sequence
- Custom attention masks are used to prevent cross-attention between branches
Attention Mask Implementation
Message Trees use custom attention masks to maintain independence between branches. This prevents different annotations (branches) from attending to each other.
graph TB
V[Vision Input]
V --> A1[Annotation 1<br/>Branch 1]
V --> A2[Annotation 2<br/>Branch 2]
V --> A3[Annotation 3<br/>Branch 3]
V --> A4[Annotation 4<br/>Branch 4]
style V fill:#e1f5ff
style A1 fill:#fff4e1
style A2 fill:#fff4e1
style A3 fill:#fff4e1
style A4 fill:#fff4e1
classDef attention stroke:#2196F3,stroke-width:2px
class V attention
Each branch can attend to the visual input, but cannot cross-attend to other branches. For the detailed attention mask pattern, refer to Figure 3 of the original paper.
Synergy of Packing and Message Trees
By combining Packing and Message Trees, Molmo2 achieves the following:
- High-density training data utilization: Efficiently leverages multiple annotations for a single visual input
- Minimal padding: Efficiently packs examples of different lengths, making effective use of GPU memory
- Accelerated training: 15x efficiency improvement accelerates training on large-scale data
These two techniques form a critical foundation for Molmo2’s efficient training.