Video Grounding: Pointing & Tracking

What Is Video Grounding?

Video Grounding is the ability of a model to accurately localize objects and events in a video in both space and time (grounding). While conventional Vision-Language Models (VLMs) could answer questions like “What is in this video?”, they could not return precise timestamps and positions for queries such as “How many times was the red block grasped, and where was each grasp?”

Molmo2 implements two grounding capabilities to bridge this gap: Video Pointing and Video Tracking.

Motivation

Image-level grounding (pointing) has already become a standard capability, supported by Molmo2’s predecessor Molmo1 as well as GPT-4V, Gemini, and others. However, video-level grounding was only partially supported in a few proprietary systems and remained unexplored in the open-source landscape.

Video grounding is important for a variety of practical use cases:

Robotics: Returning the spatiotemporal coordinates of each grasping event for queries like “How many times did the robot grasp the red block?”
Video search: Returning the track (trajectory) of an object for queries like “When did the cup fall off the table?”
Generated video quality assessment: Automatically detecting locations of visual artifacts/anomalies in generated videos

Video Pointing vs Video Tracking

Molmo2 provides two types of video grounding capabilities.

Video Pointing

Video Pointing is the task of indicating the position of a specific object or event with points in particular frames of a video. It may span multiple frames, but each frame is treated independently.

Example:

Query: “Point to the waterfall”
Response: <points coords="t^1 count_1 x_1 y_1 t^2 count_2 x_2 y_2 ...">waterfall</points>

Characteristics:

Often used in combination with object counting
Provides spatial information (“where exactly?”) in addition to answering “how many?”
Records the position in each frame individually, even if the object moves across frames

Video Tracking

Video Tracking is the task of following (tracking) a specific object through time within a video. When the same object moves across multiple frames, its trajectory is recorded consistently.

Example:

Query: “Track the red car”
Response: A unique ID is assigned to each object, and the position in each frame is recorded

Characteristics:

Consistency of object identity is critical (the same object retains the same ID)
Handles complex natural language queries (e.g., “the second player from the left”, “the person wearing the green shirt”)
Supports simultaneous tracking of multiple objects (average of 2.28 objects per query)

Pointing vs Tracking: The Difference

Pointing: Independent location information per frame (counting-oriented)
Tracking: Temporal consistency of object identity (trajectory-oriented)

In practice, Pointing is suitable when you want to know “when and where something is”, while Tracking is suitable when you want to know “how something moved”.

Molmo2-VideoPoint Dataset

Molmo2-VideoPoint is a human-annotated dataset for pointing to objects and events within videos.

Basic Statistics

Number of videos: 280k videos
Number of queries: Over 650k
Average number of points: 6 points per video
Frame rate: Sampled at 2 fps

Eight Categories

Molmo2-VideoPoint covers the following eight diverse categories:

Objects: Common objects (e.g., “car”, “cup”)
Animals: Animal detection and counting
Actions/Events: Temporal events (e.g., “jump”, “throw”)
Referring expressions: Complex descriptions (e.g., “the second person from the left”)
Indirect references: Indirect descriptions (e.g., “the thing he is holding”)
Spatial references: Spatial relationships (e.g., “the thing on the table”)
Comparative references: Comparative descriptions (e.g., “the largest dog”)
Visual artifacts/anomalies: Anomaly detection in generated videos

Anomaly Detection in Generated Videos

Category 8, Visual artifacts/anomalies, is designed for quality assessment of AI-generated videos. Using 10k videos generated by approximately 25 text-to-video (T2V) models, the model learns to detect anomalies such as Vanishing Subject, Physical Incongruity, and Temporal Dysmorphia.

Data Collection Pipeline

Query generation: An LLM generates pointing queries from video captions produced by Molmo2-Cap
Frame selection: Annotators identify frames where the object appears (sampled at 2 fps)
Position annotation: Annotators click the precise location of the object
Formatting: Timestamp (frame index), count, and normalized (x, y) coordinates are recorded

Distribution Characteristics

Count distribution: Objects with counts of 0–5 dominate (skewed toward low counts)
- Medium and high count examples are upsampled during training
Number of frames: The distribution of annotated frames is left-skewed (most examples have only a few frames)
Categories: Action/Event, Object, and Referring Expression are the most frequent (as these are more challenging to learn)

Molmo2-VideoTrack Dataset

Molmo2-VideoTrack is an object tracking dataset designed for complex natural language queries.

Basic Statistics

Number of video clips: 3.6k (training) + 1.3k (evaluation) = approximately 5k total
Number of queries: 15k complex natural language queries (training)
Average number of objects: 2.28 objects per query (most queries track multiple objects)
Average query length: 8.21 words per query
Video length: Up to 2 minutes; most are 10–30 seconds
Average annotations: 6.08 objects per video

Data Sources

Molmo2-VideoTrack is built on existing segmentation and bounding box tracking datasets, augmented with complex text queries written by human annotators.

Segmentation-based (general object tracking):

SAM-V, VIPSeg, MOSE, MOSEv2

Bounding box-based (domain-specific):

Sports: TeamTrack, SoccerNet, SportsMOT
Autonomous driving: BDD100K
Animals: APTv2, AnimalTrack, BFT
UAV (drone): UAV-MOTD, SeaDrones
People: MOT20, PersonPath, DanceTrack

Converting Bounding Boxes to Segmentation Masks

For bounding box-based datasets, the center point of the box may not lie on the object. SAM 2 was therefore used to convert each bounding box into a segmentation mask.

Conversion process:

Feed the initial bounding box as a prompt to SAM 2
Generate a segmentation mask and propagate it across the entire video
Discard tracks with IoU below 0.5
Sample points near the center of the generated masks

This yields high-confidence point-based tracking annotations.

Data Collection Pipeline

The collection process for Molmo2-VideoTrack follows the Ref-VOS (Referring Video Object Segmentation) approach.

Display existing tracks: Show annotators the segmentation or bounding box tracks
Query authoring: Annotators write non-trivial text queries that apply to a subset of objects
- Examples: “the second player from the left wearing a green shirt”, “the red cup on the table”
Verification: A separate annotator checks query quality in a verification round
- After verification, approximately 70% of queries are retained

Category Distribution

Molmo2-VideoTrack covers a diverse range of domains:

General objects: Everyday items (from segmentation datasets)
Sports: Soccer players, team members, athletes
Traffic: Cars, pedestrians, bicycles
Animals: Wildlife, pets
UAV: Tracking in drone footage
People: Pedestrians, dancers

Multi-object tracking is the primary focus, with the majority of queries describing multiple objects simultaneously (average of 3.31 objects per query).

Academic Datasets

Molmo2 also uses Academic datasets, which are existing open-source datasets converted into the Pointing and Tracking format.

AcademicVideoPoint

Existing object tracking annotations were converted into 49k pointing and counting QA pairs.

Source datasets (6):

MeViS, ReVOS, LV-VIS, OVIS, BURST, Ref-DAVIS17

Conversion process:

Obtain the timestamp of the frame where the object first appears
Randomly sample points within the object mask (Gaussian distribution, centered near the mask center)
Convert to the pointing QA format

AcademicVideoTrack

Existing video object segmentation (VOS) and tracking datasets were converted.

Segmentation-based (7 Ref-VOS datasets):

MeViS, ReVOS, LV-VIS, OVIS, BURST, Ref-Youtube-VOS, Ref-DAVIS17

Bounding box-based (11 tracking datasets):

TrackingNet, VastTrack, GOT-10k, LaSOT, TNL2K, WebUAV, WebUOT, LVOS V1/V2, UW-COT220, TNLLT, YouTube-VIS, MoCA-Video

SAM 2 was used to convert bounding boxes to segmentation masks and generate point-based tracking tasks.

Scale of AcademicVideoTrack

AcademicVideoTrack constitutes the bulk of the training data, providing 130k queries and 800k examples (measured in tokens). In comparison, Molmo2-VideoTrack has 8k queries but contains more complex and diverse text queries.

Evaluation Results: Outperforming Proprietary Models

Molmo2 achieves state-of-the-art performance in video grounding, surpassing even proprietary models.

Video Counting & Pointing

The following table shows performance on BURST-VideoCount (VC), Molmo2-VideoCount (Molmo2-VC), and Molmo2-VideoPoint (Molmo2-VP).

Model	BURST-VC Acc.	BURST-VC Close Acc.	Molmo2-VC Acc.	Molmo2-VC Close Acc.	Molmo2-VP F1	Molmo2-VP Recall	Molmo2-VP Precision
API Only
GPT-5	43.1	73.7	35.8	50.3	4.1	4.4	4.2
GPT-5 mini	46.0	73.0	29.8	49.3	2.2	2.2	2.2
Gemini 3 Pro	44.0	71.7	37.1	53.1	20.0	27.4	19.8
Gemini 2.5 Pro	41.6	70.0	35.8	56.5	13.0	14.5	13.6
Gemini 2.5 Flash	38.7	70.0	31.9	48.2	11.1	11.2	12.2
Claude Sonnet 4.5	42.4	72.6	27.2	45.1	3.5	3.7	4.3
Open Weights Only
Qwen3-VL-4B	38.9	74.7	25.3	44.3	0.0	0.0	0.0
Qwen3-VL-8B	42.0	74.4	29.6	47.7	1.5	1.5	1.5
Molmo2 Family
Molmo2-4B	61.5	76.1	34.3	56.1	39.9	42.7	39.4
Molmo2-8B	60.8	75.0	35.5	53.3	38.4	39.3	38.7
Molmo2-O-7B	61.6	76.0	33.2	50.5	35.8	35.8	37.9

Key Results

BURST-VC: Molmo2 achieves the highest accuracy among all models (61.5%)
Molmo2-VP: Molmo2-4B achieves an F1 Score of 39.9, roughly 2x the performance of Gemini 3 Pro (20.0)
Comparison with Qwen3-VL: Qwen3-VL provides virtually no video pointing support (F1 Score of 0.0–1.5)

Molmo2 achieves state-of-the-art video pointing performance not only among open-weight models but also when compared against proprietary models.

Explanation of evaluation metrics:

Accuracy: Exact match
Close Accuracy: Considered correct if the error is within delta = 1 + floor(0.05 * gt) (the higher the ground-truth count, the larger the tolerance)
F1, Recall, Precision: Measures whether generated points fall within the ground-truth mask

Video Tracking

The following table shows performance on major video tracking benchmarks.

Model	MeViS valid J&F	MeViS valid-u J&F	Ref-YT-VOS valid J&F	Ref-Davis test J&F	ReasonVOS J&F
API Only
GPT-5	23.4	26.5	30.9	25.2	24.7
GPT-5 mini	15.7	15.4	16.2	8.4	14.6
Gemini 3 Pro	42.5	51.1	55.0	66.6	52.6
Gemini 2.5 Pro	40.7	52.8	45.1	45.6	44.0
Gemini 2.5 Flash	27.6	31.8	36.0	31.6	26.5
Open Weights Only
Qwen3-VL-4B	29.7	30.6	32.1	44.4	26.5
Qwen3-VL-8B	35.1	34.4	48.3	41.0	24.9
Specialized Open Models
VideoLISA	44.4	53.2	63.7	68.8	47.5
Molmo2 Family
Molmo2-4B	56.2	62.1	67.2	65.4	56.5
Molmo2-8B	56.1	60.4	67.8	64.5	55.6
Molmo2-O-7B	54.5	59.8	64.8	62.1	51.9

Comparison with Specialized Models

VideoLISA is a model specialized for Ref-VOS, and it shows comparable or superior performance to Molmo2 on some benchmarks (MeViS valid-u, Ref-YT-VOS, Ref-Davis). However, the key difference is that Molmo2 is a general-purpose video understanding model that supports a wide range of tasks including video QA, captioning, and counting.

Explanation of evaluation metrics:

J&F: A metric for segmentation mask quality (average of Jaccard Index and Contour Accuracy)
F1, HOTA: Metrics for object tracking accuracy

Key results:

MeViS: Molmo2-4B achieves a J&F of 56.2, outperforming Gemini 3 Pro (42.5) by 13.7 points
Ref-YT-VOS: Molmo2-8B achieves a J&F of 67.8, the highest among open models (surpassing VideoLISA at 63.7)
Comparison with Qwen3-VL: Molmo2 achieves roughly 1.6x the performance of Qwen3-VL-8B (35.1 J&F)

Pointing Format: Plain-Text Coordinates

Molmo2 uses plain-text coordinates for video grounding output. This is an approach that achieves grounding using only the LLM’s text generation capabilities, without special tokens or external tools.

Format Example

<points coords="t^1 count_1 x_1 y_1 t^2 count_2 x_2 y_2 t^3 count_3 x_3 y_3">
object_label
</points>

Element descriptions:

t^i: Frame index (or timestamp)
count_i: The count of the object in that frame (i.e., which instance)
x_i, y_i: Normalized coordinates (0.0–1.0)
object_label: The name or label of the object

Tracking Format

For tracking, a unique ID (count_i) is assigned to each object and maintained across multiple frames.

<points coords="t^1 1 0.45 0.32 t^2 1 0.48 0.35 t^3 1 0.51 0.38">
red car
</points>
<points coords="t^1 2 0.62 0.55 t^2 2 0.65 0.57 t^3 2 0.68 0.59">
blue car
</points>

In this example, 1 identifies the red car and 2 identifies the blue car, with the position recorded in each frame (t^1, t^2, t^3).

Advantages of Plain-Text Coordinates

Simple: No special tokens or external tools required
Flexible: Directly leverages the LLM’s generation capabilities
Scalable: Naturally extends to multiple objects and multiple frames
Human-readable: Easy to debug and analyze

On the other hand, coordinate accuracy depends on the LLM’s text generation precision. For applications requiring very high coordinate accuracy, approaches that add a dedicated head (e.g., Grounding-DINO) may be more advantageous.

Summary

Molmo2 delivers Video Grounding as a new capability in a fully open model.

Key achievements:

Two grounding capabilities:
- Video Pointing: Per-frame location information and counting
- Video Tracking: Temporal trajectory tracking of objects
Large-scale human-annotated datasets:
- Molmo2-VideoPoint: 650k queries across 8 diverse categories
- Molmo2-VideoTrack: 15k queries with an average of 2.28 objects per query
Leveraging Academic datasets:
- Existing open-source datasets converted to Pointing/Tracking format
- 49k Pointing QA pairs, 130k Tracking queries
Outperforming proprietary models:
- Video Pointing F1 Score of 39.9 (roughly 2x Gemini 3 Pro)
- Video Tracking J&F of 56.2 (13.7 points above Gemini 3 Pro)
Plain-Text Coordinates format:
- A simple and extensible output format
- Directly leverages the LLM’s generation capabilities

Molmo2’s video grounding capabilities pave the way for a wide range of practical applications, including robotics, video search, and generated video quality assessment.