Yixiao Ge


I am currently a senior researcher at Tencent ARC Lab and Tencent AI Lab, leading an effort on multimodal foundation models, open-world visual comprehension, and efficient AI. Previously, I got my Ph.D. degree from Multimedia Lab (MMLab), the Chinese University of Hong Kong, advised by Prof. Hongsheng Li and Prof. Xiaogang Wang. We are actively looking for self-motivated interns to work on related research topics. Please feel free to reach out if you are interested.


Welcome to check out our SEED ([Project Page])!

  • [Oct 2023] Excited to unveil SEED-LLaMA, featuring multi-turn in-context emergent capabilities.
  • [Sep 2023] Three papers are accepted to NeurIPS 2023.
  • [Aug 2023] Glad to release ViT-Lens, advancing omni-modal representation learning.
  • [Aug 2023] Glad to release SEED-Bench, the most comprehensive MLLM benchmark to date.
  • [July 2023] Glad to release SEED, an image tokenizer tailored for LLM.
  • [July 2023] Four papers are accepted to ICCV 2023.
  • [May 2023] One paper is accepted to KDD 2023.
  • [Apr 2023] One paper is accepted to ICML 2023.
  • [Feb 2023] Four papers are accepted to CVPR 2023.
  • [Jan 2023] One paper is accepted to ICLR 2023.

  • [Jan-Nov 2022] 11 papers were accepted by ICLR/CVPR/IJCAI/ECCV 2022 and AAAI 2023, 2 of which were oral.
  • [Mar-Jul 2021] 5 papers were accepted by CVPR/ICCV 2021.
  • [Jan-Sep 2020] 3 papers were accepted by ICLR/ECCV/NeurIPS 2020, 1 of which was spotlight.

Selected Projects

Multimodal Foundation Models:

  • Vision-language: We aim to develop foundational models that unify visual comprehension and generation tasks within one framework.

    Given the great success of Large Language Models (LLMs), we take the initial step to empower the off-the-shelf LLMs with the ability to perform visual tasks via plugins (GPT4Tools @NeurIPS23). Despite a feasible solution, it is far from multimodal emergent abilities.

    We are further devoted to developing an end-to-end framework that facilitates flexible input/output formats, transitioning and reasoning seamlessly between multimodal signals while acquiring knowledge from an inherently multimodal world. Check out our SEED for details.

    Previously, we focused on pre-training vision-language representations and video-text retrieval, e.g., MCQ @CVPR22(Oral), All-in-One @CVPR23. We also made some interesting applications like Tune-A-Video @ICCV23.

  • Omni-modal: A real AI agent (e.g., a smart robot) should be capable of sensing all modalities. It is non-trivial, especially for those rare modalities. Check out our solution, namely, ViT-Lens. Omni-modal representation has great potential in emergent applications, see our DreamDiffusion.

  • Data-centric: High-quality and large-scale data is the prerequisite for training foundation models. For training data, we collect large-scale TV dramas (PTVD, Tencent Video authorization), as well as memes (Sticker820K, Tencent Search authorization). Besides, we are also focusing on properly evaluating multimodal LLMs, proposing SEED-Bench ([leaderboard]).

Open-world Visual Comprehension:

Efficient AI:

    We have created a new topic of hot-refresh model upgrades (RACT @ICLR22) for large-scale retrieval systems, which is practical in industry and under-explored in academia. Beyond retrieval, upgrading the foundation models in current AI systems is also costly because all downstream modules need to be retrained to adapt. Check out our TaCA for a solution. We are also interested in model selection (SFDA @ECCV22, PED @ICCV23), binarization (BEBR @KDD23), etc.

    Our algorithms helped Tencent effectively reduce costs and increase efficiency. We won the highest technical award within the company and the SZCCF Science and Technology Award.

Publications [Full List]

( *equal contribution   #corresponding author )

Selected Preprints:

  • Making LLaMA SEE and Draw with SEED Tokenizer
    Offers unified multimodal comprehension and generation, featuring multi-turn in-context emergent capabilities, akin to an AI aide.
    Yuying Ge*, Sijie Zhao*, Ziyun Zeng, Yixiao Ge#, Chen Li, Xintao Wang, Ying Shan
  • Planting a SEED of Vision in Large Language Model
    Empowers Large Language Models (LLMs) with the emergent ability to see and draw.
    Yuying Ge*, Yixiao Ge*#, Ziyun Zeng, Xintao Wang, Ying Shan
  • ViT-Lens: Towards Omni-modal Representations
    Advancing omni-modal representation learning with modality lens.
    Weixian Lei, Yixiao Ge#, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou#
  • SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
    Consists of 19K multiple-choice questions with accurate human annotations, spans 12 evaluation dimensions in terms of both spatial and temporal comprehension.
    Bohao Li*, Rui Wang*, Guangzhi Wang*, Yuying Ge#, Yixiao Ge#, Ying Shan
  • TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter
    Enabling new ViTs plugged into the framework (e.g., BLIP-2) with other modules untouched and a performance boost.
    Binjie Zhang, Yixiao Ge#, Xuyuan Xu, Ying Shan, Mike Zheng Shou#
  • What Makes for Good Visual Tokenizers for Large Language Models?
    Rather than simply applying CLIP models, we systematically investigate proper pre-training methods to build good visual tokenizers, making LLMs powerful multimodal LLMs.
    Guangzhi Wang, Yixiao Ge#, Xiaohan Ding, Mohan Kankanhalli, Ying Shan
  • TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
    Producing general-purpose video features that work out of the box. We surpass InternVideo and ImageBind on zero-shot and linear tasks.
    Ziyun Zeng, Yixiao Ge#, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan


  • GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
    Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan
    NeurIPS, 2023 [Project] [Paper] [Demo] [Code] GitHub stars
  • Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
    Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou
    NeurIPS, 2023 [Project] [Paper] [Code] GitHub stars
  • Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
    Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan
    NeurIPS, 2023 [Paper (Coming soon)]
  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
    ICCV, 2023 [Project] [Paper] [Demo] [Code] GitHub stars
  • Exploring Model Transferability through the Lens of Potential Energy
    Xiaotong Li, Zixuan Hu, Yixiao Ge, Ying Shan, Lingyu Duan
    ICCV, 2023 [Paper] [Code] GitHub stars
  • BoxSnake: Polygonal Instance Segmentation with Box Supervision
    Rui Yang, Lin Song, Yixiao Ge, Xiu Li
    ICCV, 2023 [Paper] [Code] GitHub stars
  • Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
    Yuxin Fang*, Shusheng Yang*, Shijie Wang*, Yixiao Ge, Ying Shan, Xinggang Wang
    ICCV, 2023 [Paper] [Code] GitHub stars
  • Binary Embedding-based Retrieval at Tencent
    Yukang Gan*, Yixiao Ge*, Chang Zhou*, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, Ying Shan
    KDD, 2023 [Paper] [Code] GitHub stars
  • π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation
    Chengyue Wu, Teng Wang, Yixiao Ge#, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo
    ICML, 2023 [Paper] [Code] GitHub stars
  • Accelerating Vision-Language Pretraining with Free Language Modeling
    Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie, Ping Luo
    CVPR, 2023 [Paper] [Code] GitHub stars
  • Masked Visual Reconstruction in Language Semantic Space
    Shusheng Yang, Yixiao Ge#, Kun Yi, Dian Li, Ying Shan, Xiaohu Qie, Xinggang Wang#
    CVPR, 2023 [Paper] [Code] GitHub stars
  • Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
    Ziyun Zeng*, Yuying Ge*, Xihui Liu, Bin Chen#, Ping Luo, Shu-Tao Xia, Yixiao Ge#
    CVPR, 2023 [Paper] [Code] GitHub stars
  • All in One: Exploring Unified Video-Language Pre-training
    Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
    CVPR, 2023 [Paper] [Code] GitHub stars
  • Masked Image Modeling with Denoising Contrast
    Kun Yi*, Yixiao Ge*#, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, Xiaohu Qie
    ICLR, 2023 [Paper] [Code] GitHub stars
  • Darwinian Model Upgrades: Model Evolving with Selective Compatibility
    Binjie Zhang*, Shupeng Su*, Yixiao Ge#, Xuyuan Xu, Yexin Wang, Chun Yuan, Mike Zheng Shou, Ying Shan
    AAAI, 2023 [Paper]
  • Video-Text Pre-training with Learned Regions
    Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang
    AAAI, 2023 [Paper] [Code] GitHub stars


  • MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
    Yuying Ge, Yixiao Ge, Xihui Liu, Jinpeng Wang, Jianping Wu, Ying Shan, Xiaohu Qie, Ping Luo
    ECCV, 2022 [Paper] [Code] GitHub stars
  • Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space
    Wenqi Shao#, Xun Zhao, Yixiao Ge#, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo
    ECCV, 2022 [Paper] [Code] GitHub stars
  • mc-BEiT: Multi-choice Discretization for Image BERT Pre-training
    Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Lingyu Duan
    ECCV, 2022 [Paper] [Code] GitHub stars
  • Towards Universal Backward-Compatible Representation Learning
    Binjie Zhang, Yixiao Ge#, Yantao Shen, Shupeng Su, Fanzi Wu, Chun Yuan#, Xuyuan Xu, Yexin Wang, Ying Shan
    IJCAI, 2022 (Long oral) [Paper] [Code] GitHub stars
  • Bridging Video-text Retrieval with Multiple Choice Questions
    Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie, Ping Luo
    CVPR, 2022 (Oral) [Paper] [Code] GitHub stars
  • Object-aware Video-language Pre-training for Retrieval
    Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou
    CVPR, 2022 [Paper] [Code] GitHub stars
  • Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image Retrieval
    Binjie Zhang, Yixiao Ge#, Yantao Shen, Yu Li, Chun Yuan#, Xuyuan Xu, Yexin Wang, Ying Shan
    ICLR, 2022 [Paper] [Code] GitHub stars
  • Dynamic Token Normalization Improves Vision Transformer
    Wenqi Shao, Yixiao Ge, Zhaoyang Zhang, Xuyuan Xu, Xiaogang Wang, Ying Shan, Ping Luo
    ICLR, 2022 [Paper] [Code] GitHub stars
  • Uncertainty Modeling for Out-of-Distribution Generalization
    Xiaotong Li, Yongxing Dai, Yixiao Ge, Jun Liu, Ying Shan, Lingyu Duan
    ICLR, 2022 [Paper] [Code] GitHub stars
  • Structured Domain Adaptation with Online Relation Regularization for Unsupervised Person Re-ID
    Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Xiaogang Wang, Hongsheng Li
    IEEE TNNLS, 2022 [Project] [Paper]


  • Progressive Correspondence Pruning by Consensus Learning
    Chen Zhao*, Yixiao Ge*, Feng Zhu, Rui Zhao, Hongsheng Li, Mathieu Salzmann
    ICCV, 2021 [Project] [Paper] [Code] GitHub stars
  • Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification
    Yi Zheng, Shixiang Tang, Guolong Teng, Yixiao Ge, Kaijian Liu, Donglian Qi, Jing Qin, Dapeng Chen
    ICCV, 2021 [Paper]
  • Refining Pseudo Labels with Clustering Consensus over Generations for Unsupervised Object Re-identification
    Xiao Zhang*, Yixiao Ge*, Yu Qiao, Hongsheng Li
    CVPR, 2021 [Paper]
  • DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network
    Rui Liu, Yixiao Ge, Ching Lam Choi, Xiaogang Wang, Hongsheng Li
    CVPR, 2021 [Paper] [Code] GitHub stars
  • Mutual CRF-GNN Network for Few-shot Learning
    Shixiang Tang, Dapeng Chen, Lei Bai, Kaijian Liu, Yixiao Ge, Wanli Ouyang
    CVPR 2021 [Paper]


  • Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID
    Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Hongsheng Li
    NeurIPS, 2020 [Project] [Paper] [Code] GitHub stars
  • Self-supervising Fine-grained Region Similarities for Large-scale Image Localization
    Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, Hongsheng Li
    ECCV, 2020 (Spotlight) [Project] [Paper] [Code] GitHub stars
  • Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification
    Yixiao Ge, Dapeng Chen, Hongsheng Li
    ICLR, 2020 [Project] [Paper] [Code] GitHub stars

Before 2020:

  • FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification
    Yixiao Ge*, Zhuowan Li*, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, Hongsheng Li
    NeurIPS, 2018 [Project] [Paper] [Code] GitHub stars