Knowledge-Distillation- Queen's University Elec475-TermProject competition winner.

Applied knowledge distillation technique for machine learning models, as source code. More info --> https://neptune.ai/blog/knowledge-distillation

Knowledge Distillation - Abstract

Knowledge Distillation practically means using a better model, to lead a less – complex one.

Response Based KD

For response based, student model tries to imitate teacher models logits, and to do this, the algorithm uses a special loss called distillation loss. To further simplify this, the decoder side of the student model tries to act like teacher’s decoder. X is the input, W is the weights, alpha, B, T are scalars, sigma Is the modified SoftMax, and H is the cross entropy loss.

Feature Based KD

In Feature based knowledge distillation. The student model now tries to copy the encoder side of the teacher, with again punishing dissimilarities with a specific loss function. F’s being respective feature sets, alpha being a hypermeter.

Models

Student Model

Idea was taken from Efficient-Net family, due to similarities with ResNet family, I utilized 10% to 50% dropout to reduce the computational overload and minimize the model parameters, while preventing overfitting. Detailed structure is given is model.py, number of trainable parameters with dropout is approximately 8.5 M

Teacher Model

I used Resnet50-YOLO as teacher, which has around 60 M parameters: https://github.com/makatx/YOLO_ResNet

Training Phase

Dataset

The PASCAL VOC 2012 Segmentation dataset was used, with appropriate transformations applied to both images and masks. The dataset includes a total of 21 semantic classes, including the background. Mask preprocessing included resizing to 256x256 and conversion to long tensor format, with ignored pixels (255) handled specifically.
Augmentations in train.py was only color jittering, while in knowledge distillation many more like RandomGrayscale, GaussianBlur, AdjustSharpness, AutoContrast, the reason to use different augmentations in those scripts were that train.py metrics became worse with more augmentations.

Hypermeters

Adam optimizer was utilized, learning rate being 1×10^(-3) and weight decay is set 1×10^(-4). I have also experimented with SGD, Adam worked better. The cross-entropy function was used, with ignore index 255, since this is suitable for segmentation tasks, Batch size of 64, with 500 epochs, and early stopping with patience 30 combined with learning rate scheduler of ReduceLROnPlateau was used. Main metric was Multiclass Jaccard index.

Results

Below table showcases all results:

Scratch Training Without KD

With Response Based KD

With Feature Based KD

Code

Train

All training logic is in train.py and train_kd.py
python train.py --model_name light -> will train from scratch, without any KD
python train_kd.py --kd_method response --model_name light --temperature 4 --alpha 0.3 -> will utilize response-based KD
python train_kd.py --kd_method feature --model_name light -> will utilize feature-based KD

Test

python test.py --training_mode teacher
python test.py --training_mode regular
python test.py --training_mode response
python test.py --training_mode feature

to see the model (prepared by me, excels in feature extraction)

python model.py --m light (for model + torchsummary)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
dataloader.py		dataloader.py
model.py		model.py
test.py		test.py
train.py		train.py
train_kd.py		train_kd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Knowledge-Distillation- Queen's University Elec475-TermProject competition winner.

Knowledge Distillation - Abstract

Response Based KD

Feature Based KD

Models

Student Model

Teacher Model

Training Phase

Dataset

Hypermeters

Results

Scratch Training Without KD

With Response Based KD

With Feature Based KD

Code

Train

Test

to see the model (prepared by me, excels in feature extraction)

About

Uh oh!

Releases

Packages

Languages

EYErbil/Knowledge-Distillation

Folders and files

Latest commit

History

Repository files navigation

Knowledge-Distillation- Queen's University Elec475-TermProject competition winner.

Knowledge Distillation - Abstract

Response Based KD

Feature Based KD

Models

Student Model

Teacher Model

Training Phase

Dataset

Hypermeters

Results

Scratch Training Without KD

With Response Based KD

With Feature Based KD

Code

Train

Test

to see the model (prepared by me, excels in feature extraction)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages