Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.
Overview of ThinkAct. (a) Given observation \( o_t \) and instruction \( l \), ThinkAct advances action-aligned rewards derived from visual trajectory \( \tau \) to incentivize embodied reasoning capability of reasoning MLLM \( F_\theta \). (b) Conditioned on the visual plan latent \( c_t \), the DiT-based Action Model \( \pi_\phi \) learns to predict executable action while keeping \( F_\theta \) frozen. Note that, during inference, \( \pi_\phi \) and \( F_\theta \) could operate asynchronously to enable slow thinking and fast control for VLA reasoning tasks.
Quantitative comparisons of robot manipulation tasks on SimplerEnv and LIBERO benchmarks. Bold denotes the best result.
Quantitative comparisons of embodied reasoning tasks on EgoPlan-Bench2, RoboVQA, and OpenEQA benchmarks. Note that, Qwen2.5-VL* indicates fine-tuning the original Qwen2.5-VL using EgoPlan-IT and RoboVQA datasets. Bold denotes the best result.
Few-shot adaptation results on LIBERO. We use 5 and 10 demonstrations per task for fine-tuning.
Qualitative results of intermediate reasoning steps and visualized trajectory for robot manipulation tasks on SimplerEnv and LIBERO benchmarks.
Demonstration of self-reflection and correction capability of ThinkAct. The reasoning MLLM identifies the failure and generates a revised plan that guides the action model to complete the task.
Qualitative comparisons of robot execution results between DiT-Policy, OpenVLA, and ThinkAct.
Visualization of reasoning process and the derived answer for embodied reasoning tasks on EgoPlan-Bench2, OpenEQA, and RoboVQA benchmarks.
Question: Considering the progress shown in the video and the observation in the final frame, what should be the next action to complete the task: "prepare baking powder"
A. close box
B. remove baking powder sachet
C. pour baking powder into basin
D. put salt on table
Answer: C. pour baking powder into basin
<think>
Let me think about this carefully. The person is preparing something in a green bowl. They have already added salt, and now they are dealing with a yellow packet labeled "Mandazi." The action seems to be related to preparing a mixture, possibly for baking.
Hmm, the person is opening the yellow packet and seems to be taking something out. Wait, is it a baking powder? It looks like they are pouring something from the packet into the bowl.
Oh, I see. The person is likely preparing a mixture that requires baking powder. Let's break it down: they have salt and now they are adding something from the packet. It must be the baking powder.
Therefore, the next action should be to pour the baking powder into the basin. This aligns with the process of preparing baking powder as part of the recipe.
</think>
<answer>C</answer>
Question: What shape is the carpet on the floor?
Answer: Rectangular
<think>
Let me think about this for a moment. The image shows a room with a carpet on the floor. I'm trying to figure out its shape.
At first glance, the floral pattern in the center draws attention, which might make the shape seem more complex. But looking closely at the edges, they are straight and meet at right angles. The corners are clearly defined, and the sides are parallel.
So, despite the decorative design, the overall outline of the carpet is rectangular. That seems clear now.
</think>
<answer>The carpet on the floor is rectangular.</answer>
Question: Move the paper to left. Is it possible right now?
Answer: yes
<think>
Let me think about this for a moment. The image shows a desk with various items on it, and a robotic arm is currently holding a piece of paper above the desk.
The question is whether the paper can be moved to the left right now. Since the paper is already lifted off the surface, the arm has full control of its movement.
Given that, there's nothing preventing it from being moved left. So yes, it's possible.
</think>
<answer>yes</answer>
@article{huang2025thinkact,
title={ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning},
author={Huang, Chi-Pin and Wu, Yueh-Hua and Chen, Min-Hung and Wang, Yu-Chiang Frank and Yang, Fu-En},
journal={arXiv preprint arXiv:2507.16815},
year={2025}
}