Logo ThinkAct:
Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

1 NVIDIA     2 National Taiwan University
*Work done during internship
NVIDIA
NTU
ThinkAct-Teaser

We introduce ThinkAct, a reasoning VLA framework capable of thinking before acting. Through reasoning reinforced by our action-aligned visual feeedback, ThinkAct enables capabilities of few-shot adaptation, long-horizon planning, and self-correction in embodied tasks.

Abstract

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

Method

ThinkAct-Method

Overview of ThinkAct. (a) Given observation \( o_t \) and instruction \( l \), ThinkAct advances action-aligned rewards derived from visual trajectory \( \tau \) to incentivize embodied reasoning capability of reasoning MLLM \( F_\theta \). (b) Conditioned on the visual plan latent \( c_t \), the DiT-based Action Model \( \pi_\phi \) learns to predict executable action while keeping \( F_\theta \) frozen. Note that, during inference, \( \pi_\phi \) and \( F_\theta \) could operate asynchronously to enable slow thinking and fast control for VLA reasoning tasks.

Experiment Results

Robot Manipulation Tasks

Quantitative comparisons of robot manipulation tasks on SimplerEnv and LIBERO benchmarks. Bold denotes the best result.

ThinkAct-Robot-Table

Embodied Reasoning Tasks

Quantitative comparisons of embodied reasoning tasks on EgoPlan-Bench2, RoboVQA, and OpenEQA benchmarks. Note that, Qwen2.5-VL* indicates fine-tuning the original Qwen2.5-VL using EgoPlan-IT and RoboVQA datasets. Bold denotes the best result.

ThinkAct-QA-Table

Few-shot Adaptation

Few-shot adaptation results on LIBERO. We use 5 and 10 demonstrations per task for fine-tuning.

ThinkAct-Few-shot

Visualization

Diverse Scene and Long-horizon Robot Manipulation Tasks

Qualitative results of intermediate reasoning steps and visualized trajectory for robot manipulation tasks on SimplerEnv and LIBERO benchmarks.

Reflection & Self-correction

Demonstration of self-reflection and correction capability of ThinkAct. The reasoning MLLM identifies the failure and generates a revised plan that guides the action model to complete the task.

Qualitative Comparisons of Manipulation Results

Qualitative comparisons of robot execution results between DiT-Policy, OpenVLA, and ThinkAct.

“Put carrot on plate”
“Close bottom drawer”
“Pick up the black bowl next to the cookie box and place it on the plate”
“Pick up the ketchup and place it in the basket”
“Open the top drawer and put the bowl inside”
“Turn on the stove and put the moka pot on it”

Embodied Reasoning

Visualization of reasoning process and the derived answer for embodied reasoning tasks on EgoPlan-Bench2, OpenEQA, and RoboVQA benchmarks.

BibTeX

@article{huang2025thinkact,
  title={ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning},
  author={Huang, Chi-Pin and Wu, Yueh-Hua and Chen, Min-Hung and Wang, Yu-Chiang Frank and Yang, Fu-En},
  journal={arXiv preprint arXiv:2507.16815},
  year={2025}
}