How network compression is done? #22

dtdo90 · 2023-09-25T08:11:19Z

dtdo90
Sep 25, 2023

Hi authors,

Thank you very much for putting out an amazing research. I have a question on how network compression without fine-tuning is done.

Let's take a model, say resnet18 for example. By my understanding, we first convert resnet18 (pretrained with imagenet) to an INN by py2opt. After that, do we partition parameters into groups? Will each group be approximated by a function which lead to model compression?

Answered by b1n0

Sep 25, 2023

Hi, thank you for interest in our research.
For structural pruning (resampling) of a neural network we build a list of IntegralGroups. Two parameters belong to the same group if they have a common sampling grid. Example, if you have two sequential convolutional layers conv_1 and conv_2, then conv_1.weight dim=0, conv_1.bias dim=0, conv_2.weight dim=1 belong to the same group. In residual block of the ResNet that looks like conv_2(conv_1(x)) + x (I skipped the activation functions) we have conv_1.weight dim=0, conv_1.bias dim=0, conv_2.weight dim=1 belong to the same group and the other group includes conv_1.weight dim=1 and conv_2.weight dim=0, conv_2.bias dim=0, because input size of conv_1

View full answer

b1n0 · 2023-09-25T09:02:58Z

b1n0
Sep 25, 2023
Maintainer

Hi, thank you for interest in our research.
For structural pruning (resampling) of a neural network we build a list of IntegralGroups. Two parameters belong to the same group if they have a common sampling grid. Example, if you have two sequential convolutional layers conv_1 and conv_2, then conv_1.weight dim=0, conv_1.bias dim=0, conv_2.weight dim=1 belong to the same group. In residual block of the ResNet that looks like conv_2(conv_1(x)) + x (I skipped the activation functions) we have conv_1.weight dim=0, conv_1.bias dim=0, conv_2.weight dim=1 belong to the same group and the other group includes conv_1.weight dim=1 and conv_2.weight dim=0, conv_2.bias dim=0, because input size of conv_1 should be the same as output size of conv_2. In short, IntegralGroup is a set of parameters and it's axis indices along which we must prune or permute in the same way (it's sizes and permutations along the corresponding dim should coincide).
Next, we use a permutation with py2opt to obtain a continuous structure in tensors.
After that we approximate each parameter tensor of the model with it's own continuous function with it's own trainable parameters.
And finally, we train integral network with the algorithm described in the article. Integral network can be initialized from pretrained model or trained from scratch. Once trained such integral network can be pruned to arbitrary size without fine-tuning and featuring only a small performance degradation.

1 reply

dtdo90 Sep 28, 2023
Author

Thank you very much for the fast and clear answer!!! That explains a lot which I didn't understand!

So dim 0 of the previous layer is always grouped with dim 1 of the following layer. The pruning effect comes from the number of sampling grid points that we choose to compute the integral (the smaller the more prune).

However, I am little confused with the concept of total variance of weights. For example, when we consider a standard conv layer, say input x has shape (28,28) and conv=nn.Conv2d(1,16,5) -> the weight has 3 dimensions (16,5,5), or equivalently, it is a stack of 16 5*5 kernels K1,...,K16. How do we define the total_variance in this case? Is total_variance in dim 0 equal to (K2+...+K16)-(K1+...+K15)?

b1n0 · 2023-09-28T07:00:37Z

b1n0
Sep 28, 2023
Maintainer

So dim 0 of the previous layer is always grouped with dim 1 of the following layer. The pruning effect comes from the number of sampling grid points that we choose to compute the integral (the smaller the more prune).
Exactly!

When performing the permutation of the weight tensor we are only interested in total variation along dim=0 or dim=1 so in your case we can consider this (16, 1, 5, 5) tensor as tensor of shape (16, 25) and calculate total variation as
|K1-K0|.sum() + |K2-K1|.sum() + ... + |K15-K14|.sum() = (weight[1:] - weight[:-1]).abs().sum() - the same formula as for 1-dimensional weight function.
You can treat it like integral of the absolute value of a weight function differentiated with respect to a variable varying along the first dimension.

1 reply

dtdo90 Sep 28, 2023
Author

Thank you again for the answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How network compression is done? #22

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How network compression is done? #22

Uh oh!

dtdo90 Sep 25, 2023

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

b1n0 Sep 25, 2023 Maintainer

Uh oh!

Uh oh!

dtdo90 Sep 28, 2023 Author

Uh oh!

Uh oh!

b1n0 Sep 28, 2023 Maintainer

Uh oh!

dtdo90 Sep 28, 2023 Author

dtdo90
Sep 25, 2023

Replies: 2 comments 2 replies

b1n0
Sep 25, 2023
Maintainer

dtdo90 Sep 28, 2023
Author

b1n0
Sep 28, 2023
Maintainer

dtdo90 Sep 28, 2023
Author