NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...4 min read

MolmoMotion: Language-guided 3D motion forecasting

Share
NOW LET US Article – MolmoMotion: Language-guided 3D motion forecasting

MolmoMotion is a groundbreaking AI model that predicts future 3D trajectories of objects based on a single frame and textual instructions, released alongside the massive MolmoMotion-1M dataset and PointMotionBench.

Machines have become remarkably good at perceiving motion. Given a video, modern models can track how objects and points move through a scene with exceptionally high confidence. But perception is inherently retrospective: it explains motion that has already happened. Many of the systems and applications we want to build need to look forward instead. A robot reaching for a cup has to anticipate how the cup will move before it touches it. A video generator has to know what realistic motion comes next if it's going to produce physically plausible frames.

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

This idea was the motivation behind MolmoMotion, a new motion forecasting model we're releasing today. Given a video frame, 3D points marked on an object, and written instructions describing the intended action (e.g., “Move and rotate the wooden bowl with fruit on the table”), MolmoMotion predicts where those points will move over the next few seconds in 3D space—achieving substantially stronger performance than existing forecasting methods.

Alongside the model, we're publishing MolmoMotion-1M, the largest collection of 3D point trajectories paired with action descriptions, drawn from 1.16M videos. We're also releasing PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy, containing 2.7K video clips.

We find that motion forecasters like MolmoMotion can be useful across a range of downstream tasks, from robot planning to controllable video generation. We're releasing the model weights, the MolmoMotion-1M dataset, and our PointMotionBench benchmark openly for the community to study, improve, and customize.

MolmoMotion represents motion in a deliberate, highly efficient way: as object-attached 3D points in world space, which capture motion without the cost of rendering full video. We chose it because we needed a general motion representation with three properties:

  • Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category.
  • View-stable: the same physical motion should be represented consistently across cameras and viewpoints.
  • Directly usable by downstream systems that need to reason about physical motion.

Among the representations we considered, it was the only one that satisfied all three. A sparse set of surface points can describe rigid, articulated, and (within limits) deformable motion without assuming the type of object being moved. Because the points live in a shared world frame, their trajectories remain stable across camera motion and viewpoint change. And because they're compact explicit trajectories in 3D space, they can be passed directly to systems such as robot policies or video generation models.

To forecast those trajectories, MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image. Given a short video history, an action description, and a set of query points with their initial 3D positions, the model first identifies the object being referred to, the query points, and the motion the instruction describes. It then predicts the future 3D trajectory of each point.

We train two variants of MolmoMotion:

  • The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text, following the coordinate-style prediction used by VLMs, and writes out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined.
  • The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion, which makes it better suited for representing uncertainty when an instruction admits multiple plausible futures.

To train MolmoMotion, we needed data that didn’t yet exist: large-scale videos with 3D point trajectories grounded to specific objects and paired with action descriptions. Existing 3D-track datasets were small and domain-limited, and while internet videos have all the scale and diversity we wanted for a forecaster like MolmoMotion, they didn’t include 3D annotations. So we built an automatic pipeline that extracts object-grounded 3D trajectories from unconstrained video.

Given an input video and its action description, our annotation pipeline produces object-grounded 3D point trajectories in metric world coordinates. The challenging part is that raw tracks from unconstrained video are noisy – with depth and tracking errors that leave points jittering and drifting – and that objects often stay still for much of a video. To make the data more trustworthy, we filter out points that don't move coherently with the rest of the object, smooth the remaining trajectories, and segment each clip to the window where the object actually moves.

Running our pipeline at scale yielded MolmoMotion-1M—to our knowledge the largest corpus of action-described, object-grounded 3D point trajectories assembled to date, spanning 736 motion types and 5.6K distinct objects.

To evaluate MolmoMotion’s forecasting performance, we also built PointMotionBench, a human-validated benchmark of held-out 3D trajectories. It covers 2.7K clips spanning 111 object categories and 61 motion types, including indoor manipulation, egocentric hand-object interaction, and outdoor dynamic scenes. For each clip, models are given the current observation, object query points, and an action description, and are evaluated on how accurately their predicted 3D point trajectories match the object’s actual future motion. This gives us a direct quantitative test of 3D motion forecasting rather than relying on whether a generated point track merely looks plausible.

We evaluate MolmoMotion in three ways. First, we test whether it forecasts future 3D motion more accurately than existing methods. Second, we test whether what it has learned about motion helps a robot carry out manipulation tasks. Third, we test whether that same knowledge can help guide the motion in generated video.

On PointMotionBench, MolmoMotion outperforms all existing 3D motion forecasting methods we tested – including pixel-space video generators, parametric 3D methods, and a simple constant-velocity baseline – across a range of objects.

© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Who Owns Your ATProto Identity? Hint: It's Probably Not You

dev-tools

Who Owns Your ATProto Identity? Hint: It's Probably Not You

An in-depth analysis of ATProto's key management system reveals critical security risks, where PDS operators hold absolute control over user identities and actions across the entire ecosystem.

NOW LET US Related – Beyond All Reason (Free Total Annihilation Inspired RTS)

dev-tools

Beyond All Reason (Free Total Annihilation Inspired RTS)

Inspired by the legendary Total Annihilation, Beyond All Reason is a free-to-play RTS game that redefines the genre with unmatched scale and real-time physics simulation.

NOW LET US Related – A 3D voxel game engine written in APL

dev-tools

A 3D voxel game engine written in APL

An experimental project demonstrates the viability of using the APL array programming language to build a 3D voxel game engine, challenging traditional game development workflows.

NOW LET US Related – The 100k Whys of AI

dev-tools

The 100k Whys of AI

While large language models (LLMs) are highly sophisticated, distinguishing AI-generated content remains feasible due to the quasi-deterministic nature of the technology. The flood of repetitive books on Amazon clearly demonstrates how AI tends to replicate itself when faced with similar prompts.

NOW LET US Related – Public Service Announcement: Don't Say You Use AI for Writing

dev-tools

Public Service Announcement: Don't Say You Use AI for Writing

Publicly admitting to using AI for writing can inadvertently ruin your professional reputation. When the line between 'assisting' and 'replacing' becomes blurred, readers and professionals will likely question your actual capabilities.

NOW LET US Related – Building reliable agentic AI systems

dev-tools

Building reliable agentic AI systems

A case study on how Bayer and Thoughtworks built PRINCE, an agentic AI platform that transforms preclinical drug discovery data retrieval using Agentic RAG and Text-to-SQL.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.