We present Hume, a dual-system vision-language-action model exploring human-like thinking capabilities for dexterous robot control. Equipped with value-guided System-2 thinking and cascaded action denoising, the model achieves superior complex reasoning and control capabilities. The model achieves state-of-the-art performance across a diverse range of evaluations and shows significantly advancement in complex robot control tasks.
The pipeline of Hume. Hume contains two systems working asynchronously. Given the observation, System 2 of Hume first generates \(N\) candidate action chunks with different noise level, and the best-of-N candidate with the highest \(Q\) value will be selected as the optimal candidate \(\mathbf{A}_{t}^{\tau^*}\), which is segmented and conveyed to System 1 for continuous action denoising.
We evaluate Hume across 3 simulation environments and 3 different real-world robotic platforms, covering 15 robot learning scenarios and 21 real-world manipulation tasks.
put carrot on plate
put cup on white plate
put cup on pink cloth
put eggplant in basket
close microwave
lift red pepper
put banana in basket
put pot on cutting board
push handle aside
put penguin on toy car
put tiger on toy car
put blue cube on toy car
put green cube on toy car
put red cube on toy car
pass water
pour water
restock
fold shorts
When a failure occurs, such as missing the grasping position, other policies fall into a failure state, and Hume selects the correct action through value-guided thinking to help it recover from the failure state and successfully complete the task.
Put banana in basket
Put handle aside
Put tiger on toy car
Put blue cube on toy car
Put green cube on toy car
Put red cube on toy car
@article{song2025hume,
title={Hume: Introducing System-2 Thinking in Visual-Language-Action Model},
author={Anonimous Authors},
journal={arXiv preprint arXiv:2505.21432},
year={2025}
}