Self-distillation with BAtch Knowledge Ensembling

Improves ImageNet Classification

Yixiao Ge1      Xiao Zhang1      Ching Lam Choi1      Ka Chun Cheung3      Peipei Zhao5     
Feng Zhu4      Xiaogang Wang1      Rui Zhao4      Hongsheng Li1,2     
1. Multimedia Laboratory, The Chinese University of Hong Kong          
2. Centre for Perceptual and Interactive Intelligence (CPII)    
3. NVIDIA     4. SenseTime Research     5. School of CST, Xidian University    

Abstract [Full Paper]


Fig. 1 - Conceptual comparison of three knowledge ensembling mechanisms.


The recent studies of knowledge distillation have discovered that ensembling the "dark knowledge" from multiple teachers (see (a)) or students (see (b)) contributes to creating better soft targets for training, but at the cost of significantly more computations and/or parameters.

Our Contributions:

  • We for the first time introduce to produce ensembled soft targets for self-distillation without using multiple networks or additional network branches.

  • We propose a novel BAtch Knowledge Ensembling (BAKE) mechanism to online refine the distillation targets with the cross-sample knowledge, i.e., weightedly aggregating the knowledge from other samples in the same batch (see (c)).

  • Our method is simple yet consistently effective on improving classification performances of various networks and datasets with minimal computational overhead and zero additional parameters, e.g., a significant +0.7% gain of Swin-T on ImageNet with only +1.5% computational overhead.

Method Overview


Fig. 2 - BAKE produces soft targets for self-distillation with a single network (an encoder and a classifier). For an anchor image $x^\text{anchor}$, the knowledge of the other samples $\{x_1, x_2, x_3, \cdots\}$ in the same batch is weightedly propagated and ensembled to form a better soft targets for distillation on-the-fly. Note that $x^\text{anchor}$ and $\{x_1, x_2, x_3, \dots\}$ are fed into the same network.


Fig. 3 - Key differences between our method and related works.

Pseudo Code [Full Code]


Results on ImageNet


Fig. 4 - BAKE improves various architectures with minimal computational overhead. We report the top-1 accuracy (%) on ImageNet. "Vanilla" indicates training with a conventional cross-entropy loss. The time consumption is counted on 8 Titan X GPUs. Please refer to our paper for more results.

Fig. 4 - BAKE improves vision transformers with various scales in terms of the top-1 accuracy (%) on ImageNet. The time consumption is counted on 8 V100 GPUs. Please refer to our paper for more results.

Soft Target Examples on ImageNet


Fig. 5 - We sample three tuples of images (four images in each tuple) from three batches to show the soft targets produced by BAKE. The images are sampled from ImageNet. "GT" denotes the manually annotated ground-truth labels. The knowledge of samples from the same batch is propagated and ensembled to form a better soft learning target for each sample in the batch. Note that only the top-3 classes of soft targets with the highest probabilities are illustrated for brevity.


    title={Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification},
    author={Yixiao Ge and Ching Lam Choi and Xiao Zhang and Peipei Zhao and Feng Zhu and Rui Zhao and Hongsheng Li},


If you have any question, please contact Yixiao Ge at