Self-distillation with BAtch Knowledge Ensembling

Improves ImageNet Classification

Yixiao Ge1      Ching Lam Choi1      Xiao Zhang1      Peipei Zhao3      Feng Zhu4      Rui Zhao4      Hongsheng Li1,2,3     
1. Multimedia Laboratory, The Chinese University of Hong Kong          
2. Centre for Perceptual and Interactive Intelligence (CPII)    
3. School of CST, Xidian University     4. SenseTime Research    

Abstract [Full Paper]


Fig. 1 - Conceptual comparison of three knowledge ensembling mechanisms.


The recent studies of knowledge distillation have discovered that ensembling the "dark knowledge" from multiple teachers (see (a)) or students (see (b)) contributes to creating better soft targets for training, but at the cost of significantly more computations and/or parameters.

Our Contributions:

  • We for the first time introduce to produce ensembled soft targets for self-distillation without using multiple networks or additional network branches.

  • We propose a novel BAtch Knowledge Ensembling (BAKE) mechanism to online refine the distillation targets with the cross-sample knowledge, i.e., weightedly aggregating the knowledge from other samples in the same batch (see (c)).

  • Our method is simple yet consistently effective on improving classification performances of various networks and datasets with minimal computational overhead and zero additional parameters, e.g., a significant +1.2% gain of ResNet-50 on ImageNet with only +3.7% computational overhead.

Method Overview


Fig. 2 - BAKE produces soft targets for self-distillation with a single network (an encoder and a classifier). For an anchor image $x^\text{anchor}$, the knowledge of the other samples $\{x_1, x_2, x_3, \cdots\}$ in the same batch is weightedly propagated and ensembled to form a better soft targets for distillation on-the-fly. Note that $x^\text{anchor}$ and $\{x_1, x_2, x_3, \dots\}$ are fed into the same network.


Fig. 3 - Key differences between our method and related works.

Pseudo Code [Full Code]


Results on ImageNet


Fig. 4 - BAKE improves various architectures with minimal computational overhead. We report the top-1 accuracy (%) on ImageNet. "Vanilla" indicates training with a conventional cross-entropy loss. The time consumption is counted on 8 Titan X GPUs. Please refer to our paper for more results.

Soft Target Examples on ImageNet


Fig. 5 - We sample three tuples of images (four images in each tuple) from three batches to show the soft targets produced by BAKE. The images are sampled from ImageNet. "GT" denotes the manually annotated ground-truth labels. The knowledge of samples from the same batch is propagated and ensembled to form a better soft learning target for each sample in the batch. Note that only the top-3 classes of soft targets with the highest probabilities are illustrated for brevity.


    title={Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification},
    author={Yixiao Ge and Ching Lam Choi and Xiao Zhang and Peipei Zhao and Feng Zhu and Rui Zhao and Hongsheng Li},


If you have any question, please contact Yixiao Ge at