Packing & Message Trees

What is Packing?

Packing is a technique that merges multiple short training examples into a single long sequence to avoid wasteful padding during batch creation. The token counts of training examples vary widely, ranging from a few hundred tokens (pure text or small images) to over 16,000 tokens (videos with subtitles or long videos during long-context training).

Challenges in Vision-Language Models

Packing in VLMs presents non-trivial challenges for the following reasons:

Dual packing requirements: Both ViT crops and LLM tokens need to be efficiently packed
Model diversity: Must support models with different approaches to converting images and video into tokens

Molmo2’s On-the-Fly Packing Algorithm

Molmo2 developed an on-the-fly packing algorithm that constructs maximally efficient packed sequences from a small in-memory pool of examples. This algorithm can be integrated into a standard PyTorch data loader.

Efficiency Improvement

During SFT, an average of 3.8 examples can be packed into a 16,348-token sequence, achieving 15x training efficiency.

What are Message Trees?

Message Trees are a method for encoding videos or images that have multiple annotations. They enable efficient processing of multiple different annotations (question-answer pairs, captions, pointing, etc.) for a single visual input.

Structure of Message Trees

Message Trees represent data in a tree structure as follows:

graph TD
    V["Visual Input<br/>(Root)"]
    V --> A1["Annotation 1<br/>(Branch 1)"]
    V --> A2["Annotation 2<br/>(Branch 2)"]
    V --> A3["Annotation 3<br/>(Branch 3)"]
    V --> A4["Annotation 4<br/>(Branch 4)"]

    style V fill:#e1f5ff,stroke:#2196F3,stroke-width:2px
    style A1 fill:#fff4e1,stroke:#FF9800
    style A2 fill:#fff4e1,stroke:#FF9800
    style A3 fill:#fff4e1,stroke:#FF9800
    style A4 fill:#fff4e1,stroke:#FF9800

Figure 1: Tree structure of Message Trees

Specifically:

The visual input is encoded as the first message
Each annotation becomes a different branch
The tree structure is linearized as a single sequence
Custom attention masks are used to prevent cross-attention between branches

Data Statistics

Examples in the training data have an average of 4 annotations.

Attention Mask Implementation

Message Trees use custom attention masks to maintain independence between branches. This prevents different annotations (branches) from attending to each other.

graph TB
    V[Vision Input]

    V --> A1[Annotation 1<br/>Branch 1]
    V --> A2[Annotation 2<br/>Branch 2]
    V --> A3[Annotation 3<br/>Branch 3]
    V --> A4[Annotation 4<br/>Branch 4]

    style V fill:#e1f5ff
    style A1 fill:#fff4e1
    style A2 fill:#fff4e1
    style A3 fill:#fff4e1
    style A4 fill:#fff4e1

    classDef attention stroke:#2196F3,stroke-width:2px
    class V attention

Figure 2: Attention structure of Message Trees

Each branch can attend to the visual input, but cannot cross-attend to other branches. For the detailed attention mask pattern, refer to Figure 3 of the original paper.

Synergy of Packing and Message Trees

By combining Packing and Message Trees, Molmo2 achieves the following:

High-density training data utilization: Efficiently leverages multiple annotations for a single visual input
Minimal padding: Efficiently packs examples of different lengths, making effective use of GPU memory
Accelerated training: 15x efficiency improvement accelerates training on large-scale data

These two techniques form a critical foundation for Molmo2’s efficient training.