Applied knowledge distillation technique for machine learning models, as source code. More info --> https://neptune.ai/blog/knowledge-distillation
Knowledge Distillation practically means using a better model, to lead a less – complex one.
For response based, student model tries to imitate teacher models logits, and to do this, the algorithm uses a special loss called distillation loss. To further simplify this, the decoder side of the student model tries to act like teacher’s decoder. X is the input, W is the weights, alpha, B, T are scalars, sigma Is the modified SoftMax, and H is the cross entropy loss.

In Feature based knowledge distillation. The student model now tries to copy the encoder side of the teacher, with again punishing dissimilarities with a specific loss function. F’s being respective feature sets, alpha being a hypermeter.

Idea was taken from Efficient-Net family, due to similarities with ResNet family, I utilized 10% to 50% dropout to reduce the computational overload and minimize the model parameters, while preventing overfitting. Detailed structure is given is model.py, number of trainable parameters with dropout is approximately 8.5 M
I used Resnet50-YOLO as teacher, which has around 60 M parameters: https://github.com/makatx/YOLO_ResNet
The PASCAL VOC 2012 Segmentation dataset was used, with appropriate transformations applied to both images and masks. The dataset includes a total of 21 semantic classes, including the background. Mask preprocessing included resizing to 256x256 and conversion to long tensor format, with ignored pixels (255) handled specifically.
Augmentations in train.py was only color jittering, while in knowledge distillation many more like RandomGrayscale, GaussianBlur, AdjustSharpness, AutoContrast, the reason to use different augmentations in those scripts were that train.py metrics became worse with more augmentations.
Adam optimizer was utilized, learning rate being 1×10^(-3) and weight decay is set 1×10^(-4). I have also experimented with SGD, Adam worked better. The cross-entropy function was used, with ignore index 255, since this is suitable for segmentation tasks, Batch size of 64, with 500 epochs, and early stopping with patience 30 combined with learning rate scheduler of ReduceLROnPlateau was used. Main metric was Multiclass Jaccard index.
Below table showcases all results:
All training logic is in train.py and train_kd.py
python train.py --model_name light -> will train from scratch, without any KD
python train_kd.py --kd_method response --model_name light --temperature 4 --alpha 0.3 -> will utilize response-based KD
python train_kd.py --kd_method feature --model_name light -> will utilize feature-based KD
python test.py --training_mode teacher
python test.py --training_mode regular
python test.py --training_mode response
python test.py --training_mode feature
python model.py --m light (for model + torchsummary)




