Skip to content

EYErbil/Knowledge-Distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge-Distillation- Queen's University Elec475-TermProject competition winner.

Applied knowledge distillation technique for machine learning models, as source code. More info --> https://neptune.ai/blog/knowledge-distillation

Knowledge Distillation - Abstract

Knowledge Distillation practically means using a better model, to lead a less – complex one.

Response Based KD

For response based, student model tries to imitate teacher models logits, and to do this, the algorithm uses a special loss called distillation loss. To further simplify this, the decoder side of the student model tries to act like teacher’s decoder. X is the input, W is the weights, alpha, B, T are scalars, sigma Is the modified SoftMax, and H is the cross entropy loss. image

Feature Based KD

In Feature based knowledge distillation. The student model now tries to copy the encoder side of the teacher, with again punishing dissimilarities with a specific loss function. F’s being respective feature sets, alpha being a hypermeter. image

Models

Student Model

Idea was taken from Efficient-Net family, due to similarities with ResNet family, I utilized 10% to 50% dropout to reduce the computational overload and minimize the model parameters, while preventing overfitting. Detailed structure is given is model.py, number of trainable parameters with dropout is approximately 8.5 M

Teacher Model

I used Resnet50-YOLO as teacher, which has around 60 M parameters: https://github.com/makatx/YOLO_ResNet

Training Phase

Dataset

The PASCAL VOC 2012 Segmentation dataset was used, with appropriate transformations applied to both images and masks. The dataset includes a total of 21 semantic classes, including the background. Mask preprocessing included resizing to 256x256 and conversion to long tensor format, with ignored pixels (255) handled specifically.
Augmentations in train.py was only color jittering, while in knowledge distillation many more like RandomGrayscale, GaussianBlur, AdjustSharpness, AutoContrast, the reason to use different augmentations in those scripts were that train.py metrics became worse with more augmentations.

Hypermeters

Adam optimizer was utilized, learning rate being 1×10^(-3) and weight decay is set 1×10^(-4). I have also experimented with SGD, Adam worked better. The cross-entropy function was used, with ignore index 255, since this is suitable for segmentation tasks, Batch size of 64, with 500 epochs, and early stopping with patience 30 combined with learning rate scheduler of ReduceLROnPlateau was used. Main metric was Multiclass Jaccard index.

image

Results

Below table showcases all results:

image

Scratch Training Without KD

image

With Response Based KD

image

With Feature Based KD

image

Code

Train

All training logic is in train.py and train_kd.py
python train.py --model_name light -> will train from scratch, without any KD
python train_kd.py --kd_method response --model_name light --temperature 4 --alpha 0.3 -> will utilize response-based KD
python train_kd.py --kd_method feature --model_name light -> will utilize feature-based KD

Test

python test.py --training_mode teacher
python test.py --training_mode regular
python test.py --training_mode response
python test.py --training_mode feature

to see the model (prepared by me, excels in feature extraction)

python model.py --m light (for model + torchsummary)

About

Apllied Knowledge distillation for machine learning models, as source code.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages