RunTime error with Amazon baseline 

Hi, 
I'm trying to run the repo code with the Amazon and Yelp datasets before trying it on some of my own. I am running into the following error with the Amazon dataset and the baseline model. (I set `torch.autograd.set_detect_anomaly(True)` beforehand.)

```Warning: Error detected in MmBackward. Traceback of forward call that caused the error:
  File "/h/vkpriya/CP-VAE/run_baseline.py", line 91, in <module>
    main(args)
  File "/h/vkpriya/CP-VAE/run_baseline.py", line 63, in main
    valid_loss = model.fit()
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/aggressive_vae.py", line 189, in fit
    self.train(epoch)
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/aggressive_vae.py", line 89, in train
    logits, kl = self.vae.loss(batch_data_enc)
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/vae.py", line 41, in loss
    z, KL = self.encode(x, nsamples)
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/vae.py", line 35, in encode
    return self.encoder.encode(x, nsamples)
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/base_network.py", line 72, in encode
    mu, logvar = self.forward(inputs)
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/base_network.py", line 158, in forward
    mean, logvar = self.linear(hidden_repr).chunk(2, -1)
  File "/h/vkpriya/condaenvs/pyt_cu/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/h/vkpriya/condaenvs/pyt_cu/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/h/vkpriya/condaenvs/pyt_cu/torch/nn/functional.py", line 1612, in linear
    output = input.matmul(weight.t())
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:60)
Vocabulary size: 60229
Experiment dir: /h/vkpriya/CP-VAE/outputs/baseline/amazon-amazon/20201201-223426
Traceback (most recent call last):
  File "/h/vkpriya/CP-VAE/run_baseline.py", line 91, in <module>
    main(args)
  File "/h/vkpriya/CP-VAE/run_baseline.py", line 63, in main
    valid_loss = model.fit()
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/aggressive_vae.py", line 189, in fit
    self.train(epoch)
  File "/scratch/ssd001/home/vkpriya/CP-VAE/models/aggressive_vae.py", line 128, in train
    loss.backward()
  File "/h/vkpriya/condaenvs/pyt_cu/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/h/vkpriya/condaenvs/pyt_cu/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1024, 64]], which is output 0 of TBackward, is at version 32; expected version 31 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
```
I am trying to debug it on my end, but any help would be appreciated! 
(P.S: I get the same error with the baseline and CP-VAE models on my own datasets as well)

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RunTime error with Amazon baseline #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RunTime error with Amazon baseline #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions