Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
Overview of Fast-ThinkAct. (a) Given observation \( o_t \) and instruction \( l \), we train a teacher model \( \mathcal{F}_\theta^T \) with textual reasoning using action-aligned visual rewards, then distill it into a student model \( \mathcal{F}_\theta \) that performs compact latent reasoning \( \mathbf{z} \). A verbalizer \( \mathcal{V}_\psi \) provides preference-guided supervision by decoding latents into language. (b) The action model \( \pi_\phi \) learns from the student's visual plan representations, achieving efficient reasoning-enhanced manipulation.
(a)-(e) Quantitative comparisons of robot manipulation tasks on SimplerEnv and LIBERO benchmarks. (f) Latency comparison across 3B and 7B reasoning VLAs.
Quantitative comparisons of bimanual manipulation tasks on RoboTwin2.0. Background colors indicate task length based on expert demonstrations: short (80-100) , medium (110-220) , long (270-470) steps.
Quantitative comparisons of embodied reasoning tasks on EgoPlan-Bench2, RoboVQA, and OpenEQA.
Qualitative results of visualized trace and task execution results.
Qualitative comparisons of robot execution results between RDT (base action model), and Fast-ThinkAct.
"Pick up the black bowl on the stove and place it on the plate"
"Pick up the alphabet soup and place it in the basket"
"Place red block in the middle, then stack green block on it"
"Start stacking the small smooth brown-rimmed bowl from the bottom to the top"
"Put the bowl on the plate"
"Put the white mug on the left plate and put the yellow and white mug on the right plate"
"Use the left arm to press the smooth metal bell's top center"
"Take hold of the microphone with cylindrical handle and hand it over"
The textual reasoning from the teacher model sometimes contains redundant sentences and incorrect content, whereas the verbalized reasoning demonstrates a more compact and accurate reasoning process.
Question: Pick up the glass. Is it satisfied?
GT: yes
Textual reasoning (from teacher model)
Let me think about this. I can see a cluttered table with several objects scattered around. The robot arm moves slowly toward the glass, adjusting its position to avoid bumping into nearby items. The gripper closes gently around the glass and lifts it from the table with a steady motion. The glass is held securely and remains stable in the air. Everything looks correct, so the action is satisfied.
Verbalized reasoning
Let me think. The robot arm moves carefully over the table, aligns with the glass, and grips it. The glass lifts smoothly without tipping or slipping. It looks like the task is successful and satisfied.
Question: Is this room on the ground floor of the building?
GT: no
Textual reasoning (from teacher model)
Let's carefully analyze the image to determine if this room is on the ground floor of the building. First, I notice that there is a window with a view of greenery outside. This suggests we are at a level that allows natural light to filter in, which is typical for ground floor windows. Next, I see a white desk and a chair, which seem like standard office furniture for a ground floor workspace. ... (ignore 67 words) ... Let's also consider the lighting. The natural light coming through the window seems to be coming from the side, which is more likely if we are on the ground floor and looking towards a window that is part of the building's exterior. Final thought: ... (ignore 11 words) ... it's reasonable to conclude that this room is on the ground floor.
Verbalized reasoning
Let's break this down. ... (ignore 40 words) ... I should also consider if there's any sign of elevation. Hmm, it seems like the view outside the window shows that the room is positioned above the ground. Oh, I see, the perspective through the window suggests it's not on the ground floor. Let's confirm that. Yes, it checks out.
(Left) Qualitative examples of corrective guidance for manipulation errors (simulation and real robot). (Right) Quantitative evaluation on RoboFAC-Sim and RoboFAC-Real benchmarks.
Few-shot adaptation results on RoboTwin2.0. We use 10 demonstrations per task for fine-tuning.
@article{huang2026fast,
title={Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning},
author={Huang, Chi-Pin and Man, Yunze and Yu, Zhiding and Chen, Min-Hung and Kautz, Jan and Wang, Yu-Chiang Frank and Yang, Fu-En},
journal={arXiv preprint arXiv:2601.09708},
year={2026}
}