Yongming Rao

I am a fifth year Ph.D student in the Department of Automation at Tsinghua University, advised by Prof. Jiwen Lu . In 2018, I obtained my B.Eng. in the Department of Electronic Engineering, Tsinghua University.

I am interested in computer vision and deep learning. My current research focuses on:

  • Computation-efficient architectures that enable efficient and generic modeling for both 2D and 3D visual data.
  • Data-Efficient representation learning methods that learn discriminative, robust and generalizable representations with fewer data or annotations.
  • Email  /  CV  /  Google Scholar  /  Github

    profile photo
    News

  • 2023-07: VPD is accepted to ICCV 2023.
  • 2023-04: The journal version of GFNet and DynamicViT are accepted to T-PAMI.
  • 2022-09: HorNet and P2P are accepted to NeurIPS 2022.
  • 2022-03: The journal version of PointGLR is accepted to T-PAMI.
  • 2022-03: Check out our work at CVPR 2022 on language-guided dense prediction (DenseCLIP), BERT-style point cloud Transformers (Point-BERT) and our new dataset and method for procedure-aware action quality assessment (FineDiving, oral presentation).
  • 2021-10: Our solution based on PoinTr won the 1st place in the MVP Completion Challenge (ICCV 2021 Workshop).
  • 2021-09: GFNet and DynamicViT are accepted to NeurIPS 2021.
  • Publications

    * indicates equal contribution

    dise Unleashing Text-to-Image Diffusion Models for Visual Perception
    Wenliang Zhao*, Yongming Rao*, Zuyan Liu*, Benlin Liu Jie Zhou, Jiwen Lu
    IEEE International Conference on Computer Vision (ICCV), 2023
    [arXiv] [Code] [Project Page] [Rank 1st on NYUv2 Depth Estimation]

    VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.

    dise HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
    Yongming Rao*, Wenliang Zhao*, Yansong Tang, Jie Zhou , Ser-Nam Lim , Jiwen Lu
    Conference on Neural Information Processing Systems (NeurIPS), 2022
    [arXiv] [Code] [Project Page] [中文解读]

    HorNet is a family of generic vision backbones that perform explicit high-order spatial interactions based on Recursive Gated Convolution.

    dise P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
    Ziyi Wang*, Xumin Yu*, Yongming Rao*, Jie Zhou , Jiwen Lu
    Conference on Neural Information Processing Systems (NeurIPS), 2022
    [arXiv] [Code] [Project Page] [中文解读]

    P2P is a framework to leverage large-scale pre-trained image models for 3D point cloud analysis.

    dise DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
    Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu , Jie Zhou , Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    [arXiv] [Code] [Project Page] [中文解读]

    DenseCLIP is a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.

    dise Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling
    Xumin Yu*, Lulu Tang*, Yongming Rao*, Tiejun Huang, Jie Zhou , Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    [arXiv] [Code] [Project Page] [中文解读]

    Point-BERT is a new paradigm for learning Transformers in an unsupervised manner by generalizing the concept of BERT onto 3D point cloud data.

    dise Global Filter Networks for Image Classification
    Yongming Rao*, Wenliang Zhao*, Zheng Zhu , Jiwen Lu , Jie Zhou
    Conference on Neural Information Processing Systems (NeurIPS), 2021
    [arXiv] [Code] [Project Page] [中文解读]

    Global Filter Networks is a transformer-style architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.

    dise DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
    Yongming Rao, Wenliang Zhao, Benlin Liu , Jiwen Lu , Jie Zhou , Cho-Jui Hsieh
    Conference on Neural Information Processing Systems (NeurIPS), 2021
    [arXiv] [Code] [Project Page] [Video] [中文解读]

    We present a dynamic token sparsification framework to prune redundant tokens in vision transformers progressively and dynamically based on the input.

    dise PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers
    Xumin Yu*, Yongming Rao*, Ziyi Wang, Zuyan Liu, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    Oral Presentation
    [arXiv] [Code] [中文解读]

    PoinTr is a transformer-based framework that reformulates point cloud completion as a set-to-set translation problem.

    dise RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection
    Yongming Rao*, Benlin Liu*, Yi Wei , Jiwen Lu , Cho-Jui Hsieh , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    [arXiv]

    We propose to generate random layouts of a scene by making use of the objects in the synthetic CAD dataset and learn the 3D scene representation by applying object-level contrastive learning on two random scenes generated from the same set of synthetic objects.

    dise Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds
    Yongming Rao, Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2022
    [arXiv] [Code]

    We present an unsupervised point cloud representation learning method based on global-local bidirectional reasoning, which largely advances the state-of-the-art of unsupervised point cloud understanding and outperforms recent supervised methods.

    dise Spherical Fractal Convolution Neural Networks for Point Cloud Recognition
    Yongming Rao, Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
    [PDF] [Supplement]

    We designed Spherical Fractal Convolution Neural Networks (SFCNN) for rotation-invariant point cloud feature learning.

    dise Runtime Network Routing for Efficient Image Classification
    Yongming Rao, Jiwen Lu , Ji Lin , Jie Zhou
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2019
    [PDF] [Code] [Conference Version (NeurIPS 2017)]

    We propose a generic Runtime Network Routing (RNR) framework for efficient image classification, which selects an optimal path inside the network. Our method can be applied to off-the-shelf neural network structures and easily extended to various application scenarios.

    dise NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo
    Yi Wei , Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    Oral Presentation
    [arXiv] [Code] [Project page] [Video]

    We present a new multi-view depth estimation method that utilizes both conventional SfM reconstruction and learning-based priors over the recently proposed neural radiance fields (NeRF).

    dise Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification
    Yongming Rao*, Guangyi Chen*, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    [arXiv] [Code]

    We propose to learn the attention with counterfactual causality, which provides a tool to measure the attention quality and a powerful supervisory signal to guide the learning process.

    dise Structure-Preserving Image Super-Resolution
    Cheng Ma , Yongming Rao, Jiwen Lu , Jie Zhou
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2021
    [arXiv] [Code]

    We propose to learn a neural structure extractor unsupervisedly to extract structural patterns in images and use it to supervise SR models.

    dise Towards Interpretable Deep Metric Learning with Structural Matching
    Wenliang Zhao*, Yongming Rao*, Ziyi Wang, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    [arXiv] [Code]

    We present a deep interpretable metric learning (DIML) that adopts a structural matching strategy to explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images.

    dise Group-aware Contrastive Regression for Action Quality Assessment
    Xumin Yu*, Yongming Rao*, Wenliang Zhao, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    [arXiv] [Code]

    We propose a new contrastive regression (CoRe) framework to learn the relative scores by pair-wise comparison, which highlights the differences between videos and guides the models to learn the key hints for action quality assessment.

    dise PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds
    Yi Wei *, Ziyi Wang*, Yongming Rao*, Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
    [arXiv] [Code]

    We present point-voxel correlation fields for 3D scene flow estimation which migrates the high performance of RAFT and provides a solution to build structured all-pairs correlation fields for unstructured point clouds.

    dise Multi-Proxy Wasserstein Classifier for Image Classification
    Benlin Liu *, Yongming Rao*, Jiwen Lu , Jie Zhou , Cho-Jui Hsieh
    AAAI Conference on Artificial Intelligence (AAAI), 2021
    [PDF]

    We present a new Multi-Proxy Wasserstein Classifier to imporve the image classification models by calculating a non-uniform matching flow between the elements in the feature map of a sample and multiple proxies of a class using optimal transport theory.

    dise Temporal Coherence or Temporal Motion: Which is More Critical for Video-based Person Re-identification?
    Guangyi Chen *, Yongming Rao*, Jiwen Lu , Jie Zhou
    European Conference on Computer Vision (ECCV), 2020
    [PDF]

    We show temporal coherence plays a more critical role than temporal motion for video-based person re-identification and develop a Adversarial Feature Augmentation (AFA) to highlight temporal coherence.

    dise MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation
    Benlin Liu , Yongming Rao, Jiwen Lu , Jie Zhou , Cho-Jui Hsieh
    European Conference on Computer Vision (ECCV), 2020
    [arXiv]

    We boost the performance of CNNs by learning soft targets for shallow layers via meta-learning.

    dise Structure-Preserving Super Resolution with Gradient Guidance
    Cheng Ma , Yongming Rao, Yean Cheng, Ce Chen, Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
    [arXiv] [Code]

    We propose to leverage gradient information as an extra supervision signal to restore structures while generating natural SR images.

    dise Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation
    Cheng Ma , Zhenyu Jiang , Yongming Rao, Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
    [arXiv] [Code]

    We propose a deep face super-resolution method with iterative collaboration between two recurrent networks which focus on facial image recovery and landmark estimation respectively

    dise COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
    Yansong Tang , Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
    [arXiv] [Project Page] [Annotation Tool]

    COIN is the largest and most comprehensive instructional video analysis dataset with rich annotations.

    dise Learning Discriminative Aggregation Network for Video-based Face Recognition and Person Re-identification
    Yongming Rao, Jiwen Lu , Jie Zhou
    International Journal of Computer Vision (IJCV, IF: 6.07), 2019
    [PDF] [Code]

    We propose a discriminative aggregation network (DAN) method for video-based face recognition and person re-identification, which aims to integrate information from video frames for feature representation effectively and efficiently.

    dise Learning Globally Optimized Object Detector via Policy Gradient
    Yongming Rao, Dahua Lin , Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
    Spotlight Presentation
    [PDF] [Supplement]

    We propose a simple yet effective method to learn globally optimized detector for object detection by directly optimizing mAP using the REINFORCE algorithm.

    dise Runtime Neural Pruning
    Ji Lin *, Yongming Rao*, Jiwen Lu , Jie Zhou
    Conference on Neural Information Processing Systems (NeurIPS), 2017
    [PDF] [Code]

    We propose a Runtime Neural Pruning (RNP) framework which prunes the deep neural network dynamically at the runtime.

    dise Learning Discriminative Aggregation Network for Video-Based Face Recognition
    Yongming Rao, Ji Lin , Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2017
    Spotlight Presentation
    [PDF] [Code] [Supplement]

    We propose a discriminative aggregation network (DAN) method for video face recognition, which aims to integrate information from video frames effectively and efficiently.

    dise Attention-aware Deep Reinforcement Learning for Video Face Recognition
    Yongming Rao, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2017
    [PDF]

    We propose an attention-aware deep reinforcement learning (ADRL) method for video face recognition, which aims to discard the misleading and confounding frames and find the focuses of attentions in face videos for person recognition.

    dise V-tree: Efficient KNN Search on Moving Objects with Road-Network Constraints
    Bilong Shen, Ying Zhao, Guoliang Li, Weimin Zheng, Yue Qin, Bo Yuan, Yongming Rao
    IEEE International Conference on Data Engineering (ICDE), 2017
    [PDF]

    We propose a new tree structure for moving objects kNN search with road-network constraints, which can be used in many real-world applications like taxi search.

    Honors and Awards

  • Outstanding Graduate/Doctoral Dissertation of Tsinghua University
  • 2022 Chinese National Scholarship
  • 1st place in the MVP Point Cloud Completion Challenge (ICCV 2021 Workshop)
  • Baidu Top 100 Chinese Rising Stars in AI (百度AI华人新星百强榜)
  • CVPR 2021 Outstanding Reviewer
  • ECCV 2020 Outstanding Reviewer
  • 2019 CCF-CV Academic Emerging Award (CCF-CV 学术新锐奖)
  • 2019 Chinese National Scholarship
  • ICME 2019 Best Reviewer Award
  • 2017 Sensetime Undergraduate Scholarship
  • Academic Services

  • Co-organizer: Tutorial on Deep Reinforcement Learning for Computer Vision at CVPR 2019 [website]
  • Conference Reviewer / PC Member: CVPR 2018-2023, ICCV 2019-2023, ECCV 2020-2022, NeurIPS 2019-2023, ICML 2019-2023, ICLR 2021-2023, SIGGRAPH Asia 2022-2023, AAAI 2020-2023, WACV 2020-2023
  • Senior PC Member: IJCAI 2021
  • Journal Reviewer: T-PAMI, IJCV, T-NNLS, T-IP, T-MM, Pattern Recognition

  • Website Template


    © Yongming Rao | Last updated: July 14, 2023