Logo Fast-ThinkAct:
Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

1 NVIDIA   2 National Taiwan University   3 University of Illinois Urbana-Champaign
NVIDIA
Fast-ThinkAct-Teaser

We introduce Fast-ThinkAct, an efficient reasoning VLA framework that compresses verbose textual reasoning into compact latent CoTs. Through verbalizable latent reasoning and action-aligned visual plan distillation, Fast-ThinkAct achieves up to 9.3x inference speedup while maintaining strong reasoning capabilities for embodied AI tasks.

Abstract

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

Method

Fast-ThinkAct-Method

Overview of Fast-ThinkAct. (a) Given observation \( o_t \) and instruction \( l \), we train a teacher model \( \mathcal{F}_\theta^T \) with textual reasoning using action-aligned visual rewards, then distill it into a student model \( \mathcal{F}_\theta \) that performs compact latent reasoning \( \mathbf{z} \). A verbalizer \( \mathcal{V}_\psi \) provides preference-guided supervision by decoding latents into language. (b) The action model \( \pi_\phi \) learns from the student's visual plan representations, achieving efficient reasoning-enhanced manipulation.

Experiment Results

LIBERO, SimplerEnv, and Inference Latency

(a)-(e) Quantitative comparisons of robot manipulation tasks on SimplerEnv and LIBERO benchmarks. (f) Latency comparison across 3B and 7B reasoning VLAs.

Fast-ThinkAct-Simpler-Libero

RoboTwin2.0

Quantitative comparisons of bimanual manipulation tasks on RoboTwin2.0. Background colors indicate task length based on expert demonstrations: short (80-100) , medium (110-220) , long (270-470) steps.

Fast-ThinkAct-RoboTwin2.0

Embodied Reasoning Benchmarks

Quantitative comparisons of embodied reasoning tasks on EgoPlan-Bench2, RoboVQA, and OpenEQA.

Fast-ThinkAct-ER

Visualization

Diverse Scene and Long-horizon Robot Manipulation Tasks

Qualitative results of visualized trace and task execution results.

Qualitative Comparisons of Manipulation Results

Qualitative comparisons of robot execution results between RDT (base action model), and Fast-ThinkAct.

Visualization of Verbalized Reasoning

The textual reasoning from the teacher model sometimes contains redundant sentences and incorrect content, whereas the verbalized reasoning demonstrates a more compact and accurate reasoning process.

Failure Recovery

(Left) Qualitative examples of corrective guidance for manipulation errors (simulation and real robot). (Right) Quantitative evaluation on RoboFAC-Sim and RoboFAC-Real benchmarks.

Failure Recovery

Few-shot Adaptation

Few-shot adaptation results on RoboTwin2.0. We use 10 demonstrations per task for fine-tuning.

Fast-ThinkAct-Few-shot

BibTeX

@article{huang2026fast,
  title={Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning},
  author={Huang, Chi-Pin and Man, Yunze and Yu, Zhiding and Chen, Min-Hung and Kautz, Jan and Wang, Yu-Chiang Frank and Yang, Fu-En},
  journal={arXiv preprint arXiv:2601.09708},
  year={2026}
}