Tong Zhang 张彤

I received my Ph.D. from the Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University, advised by Prof. Yang Gao. I previously obtained my bachelor’s degree from the Department of Electronic Engineering at Tsinghua University.

During my Ph.D., I spent six months as a visiting scholar at UC Berkeley, advised by Koushil Sreenath in the Hybrid Robotics Group (HRG) and Berkeley Artificial Intelligence Research Lab (BAIR).

My research focuses on Embodied AI, which lies at the intersection of Artificial Intelligence and Robotics.

Email: tongzhangthu [AT] gmail.com

Email  /  Google Scholar  /  Github

profile photo

News

  • [2025.08] One paper (HuB) is accepted at CoRL 2025 (Oral Presentation).
  • [2025.06] One paper (HuB) is accepted at RSS Workshop on Whole-body Control and Bimanual Manipulation, 2025.
  • [2024.09] Three papers (SGRv2, RLFP, and General Flow) are accepted at CoRL 2024.
  • [2023.12] Invited oral presentation at DAI 2023.
  • [2023.10] Invited talk at RLChina.
  • [2023.08] One paper (SGR) is accepted at CoRL 2023.
  • [2023.04] One paper (SGN) is accepted at CVPR 2023 Workshop on 3D Vision and Robotics.
  • Publications

    OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation
    Yingdong Hu*, Haodong Zhu*, Boyuan Zheng*, Yihang Hu*, Tong Zhang*, Zunhao Chen, Junming Zhao, Ruiqian Nai, Yang Gao
    arXiv, 2026
    project page / arXiv / code

    We present OpenHLM, an open-source recipe for whole-body humanoid loco-manipulation. Through controlled studies on whole-body teleoperation, VLA policy design, and heterogeneous co-training, OpenHLM outperforms GR00T N1.6 and Psi0 with less than half the demonstration time.

    Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
    Haoqi Yuan*, Zhixuan Liang*, Anzhe Chen*, Ye Wang*, Haoyang Li*, Pei Lin*, Yiyang Huang*, Zixing Lei*, Tong Zhang*, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu§, Xiong-Hui Chen§
    Technical Report, 2026
    project page / arXiv / blog

    We present Qwen-RobotManip, a Vision-Language-Action foundation model for robotic manipulation that studies how unified alignment across heterogeneous robot data enables scalable multi-source training. The model is evaluated on a range of robotic benchmarks and ranks 1st on the RoboChallenge Table30 v1 Generalist Track, demonstrating strong generalization across unseen tasks, scenes, and robot platforms.

    Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations
    Ruiqian Nai*, Boyuan Zheng*, Junming Zhao*, Haodong Zhu, Sicong Dai, Zunhao Chen, Yihang Hu, Yingdong Hu, Tong Zhang, Chuan Wen, Yang Gao
    arXiv, 2026
    project page / arXiv / X summary

    We present HuMI, a portable framework for humanoid whole-body learning via robot-free data collection. HuMI achieves 3x higher efficiency than teleoperation and a 70% success rate across diverse tasks and unseen environments.

    HuB: Learning Extreme Humanoid Balance
    Tong Zhang*, Boyuan Zheng*, Ruiqian Nai, Yingdong Hu, Yen-Jen Wang, Geng Chen, Fanqi Lin, Jiongye Li, Chuye Hong, Koushil Sreenath, Yang Gao
    CoRL, 2025 (Oral Presentation)
    RSS Workshop on Whole-body Control and Bimanual Manipulation, 2025
    project page / arXiv / X summary

    We propose HuB (Humanoid Balance), a framework that enables humanoids to perform challenging quasi-static balance tasks, including extreme single-legged poses such as the Swallow Balance and Bruce Lee's Kick.

    Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation
    Tong Zhang, Yingdong Hu, Jiacheng You, Yang Gao
    CoRL, 2024
    project page / arXiv / code / X summary

    We introduce SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations. Central to the design of SGRv2 is the incorporation of a critical inductive bias-action locality, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment.

    Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own
    Weirui Ye, Yunsheng Zhang, Haoyang Weng, Xianfan Gu, Shengjie Wang, Tong Zhang, Mengchen Wang, Pieter Abbeel, Yang Gao
    CoRL, 2024 (Oral Presentation)
    project page / arXiv / code

    We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions.

    General Flow as Foundation Affordance for Scalable Robot Learning
    Chengbo Yuan, Chuan Wen, Tong Zhang, Yang Gao
    CoRL, 2024
    project page / arXiv / code

    We build a 3D flow prediction model directly from large-scale RGBD human video datasets. Based on this model, we achieve stable zero-shot human-to-robot skill transfer in the real world.

    Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning
    Yingdong Hu*, Fanqi Lin*, Tong Zhang, Li Yi, Yang Gao
    ICRA Workshop on Vision-Language Models for Navigation and Manipulation, 2024
    project page / arXiv

    We introduce ViLa, a novel approach for long-horizon robotic planning that leverages GPT-4V to generate a sequence of actionable steps. ViLa empowers robots to execute complex tasks with a profound understanding of the visual world.

    A Universal Semantic-Geometric Representation for Robotic Manipulation
    Tong Zhang*, Yingdong Hu*, Hanchen Cui, Hang Zhao, Yang Gao
    CoRL, 2023
    CVPR Workshop on 3D Vision and Robotics, 2023
    project page / arXiv / code

    We present Semantic-Geometric Representation (SGR), a universal perception module for robotics that leverages the rich semantic information of large-scale pre-trained 2D models and inherits the merits of 3D spatial reasoning.