Skip to content

Conversation

@osma
Copy link
Member

@osma osma commented Jan 13, 2026

This PR reimplements the NN ensemble using PyTorch instead of Keras/TensorFlow.

To test this, you will have to use uv sync --group all --extra torch-cpu or similar (see comments below).

Some notes about the implementation:

  • the neural network architecture should be pretty much the same as before, although some tensors have a different shape than before (not transposed as they used to be)
  • there may be differences in the optimizer setup and loss calculation; I didn't look too closely at them
  • the old code displayed top_k_categorical_accuracy, but this was not easily available in PyTorch, so I switched to using the nDCG metric from the torchmetrics package
  • the progress bar shown during training now uses tdqm, so it looks a bit different than the Keras one; it is also displayed on stderr and not stdout as the old one used to be
  • the old code showed a detailed error message when model loading failed; I couldn't figure out (yet) how to do that with PyTorch models, but the model is stored with metadata (python version, torch version etc.) that may be helpful in implementing such an error message later on if it turns out to be necessary. In general, the models should be pretty much PyTorch-version-agnostic so there may not be a need for this.
  • I only defined torch-cpu and torch-cu128 extras for now, but I think the setup could quite easily be extended to other PyTorch variants such as CUDA 12.6 or 13.0, ROCm or Intel XPU, though obviously these would require more configuration in pyproject.toml

I have not yet measured how well this performs in terms of quality, computational performance or memory usage.

Fixes #895

@osma osma self-assigned this Jan 13, 2026
@osma osma force-pushed the issue895-nn-ensemble-pytorch branch from 5bdbf64 to d82a54a Compare January 13, 2026 11:40
@codecov
Copy link

codecov bot commented Jan 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.63%. Comparing base (27e4ac7) to head (f949e1c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #926      +/-   ##
==========================================
- Coverage   99.63%   99.63%   -0.01%     
==========================================
  Files         103      103              
  Lines        8238     8226      -12     
==========================================
- Hits         8208     8196      -12     
  Misses         30       30              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@osma osma force-pushed the issue895-nn-ensemble-pytorch branch from d82a54a to da479eb Compare January 15, 2026 12:25
@osma
Copy link
Member Author

osma commented Jan 15, 2026

Selecting of the PyTorch variant (CPU or CUDA x.y or ROCm or...) when setting up the development environment using uv sync has been a headache, but I think I've found a workable solution. It's not super elegant, but at least it seems to work.

The problem is that uv sync wants to perform "universal resolution", that is, resolve all the transitive dependencies once and for all, then write the result into the uv.lock file. This can be parameterized by OS, Python version and some other factors, but not by anything that the user could set when running uv sync. Since different PyTorch variants have different dependencies (e.g. CUDA libraries), dependencies for each of them would have to be resolved separately.

But fortunately, it is possible to have some degree of control over the resolution by setting up "extras" and then declaring a "conflict" between them. This causes uv to "fork" the resolution into different "branches", each having their own dependency tree.

So in commit e629963, I added two new extras: torch-cpu (CPU only) and torch-cu128 (CUDA 12.8 GPU), and declared a conflict between them, i.e., you can't install both extras at the same time. (This will unfortunately cause --all-extras to stop working, which is a shame, since it means that lots of specific --extra parameters are needed in typical situations.) These extras are then tied to specific PyTorch package indexes and thus different variants of the torch package.

The end result is that these two extras can be used to select the PyTorch variant at uv sync time. The torch dependency is still also defined for the nn extra, without a specific index. This means that installing only the nn extra will install whatever is the default PyTorch variant (on Linux it is a CUDA variant).

Here are examples of how this works now:

1. uv sync without extras

This installs 439MB of dependencies, no PyTorch.

$ uv sync
Resolved 212 packages in 1.71s
      Built annif @ file:///home/oisuomin/git/Annif
Prepared 1 package in 261ms
Uninstalled 1 package in 0.21ms
Installed 1 package in 0.50ms
 ~ annif==1.5.0.dev0 (from file:///home/oisuomin/git/Annif)

$ du -sh .venv
439M	.venv

2. uv sync with just the nn extra

This installs the default PyTorch CUDA variant, for a total 2.2GB of dependencies.

$ uv sync --extra nn
Resolved 212 packages in 0.77ms
Installed 6 packages in 96ms
 + lmdb==1.7.5
 + mpmath==1.3.0
 + networkx==3.6.1
 + setuptools==80.9.0
 + sympy==1.14.0
 + torch==2.9.1

$ du -sh .venv
2.2G	.venv

3. uv sync with both nn and torch-cpu extras

This switches to the CPU-only variant of PyTorch. Dependencies are now only 1.2GB.

$ uv sync --extra nn --extra torch-cpu
Resolved 212 packages in 0.78ms
Uninstalled 1 package in 69ms
Installed 1 package in 93ms
 - torch==2.9.1
 + torch==2.9.1+cpu

$ du -sh .venv
1.2G	.venv

4. uv sync with both nn and torch-cu128 extras

This installs the PyTorch CUDA 12.8 variant and lots of nvidia-* library packages, for a whopping 7.0GB of dependencies. (I wonder why this isn't the same as the default PyTorch CUDA build that got installed in step 2 above?)

$ uv sync --extra nn --extra torch-cu128
Resolved 212 packages in 0.77ms
Uninstalled 1 package in 72ms
Installed 17 packages in 97ms
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.5
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.3.20
 + nvidia-nvtx-cu12==12.8.90
 - torch==2.9.1+cpu
 + torch==2.9.1+cu128
 + triton==3.5.1

$ du -sh .venv
7.0G	.venv

@osma osma mentioned this pull request Jan 15, 2026
@osma
Copy link
Member Author

osma commented Jan 16, 2026

I refined the above solution by adding an all dependency group (because --all-extras cannot be used anymore). Now a basic developer install with all CPU-only extra features can be installed with:

uv sync --group all --extra torch-cpu

Maybe not ideal, but it works.

@osma osma requested a review from juhoinkinen January 16, 2026 13:54
@osma osma added this to the 1.5 milestone Jan 16, 2026
@osma osma marked this pull request as ready for review January 16, 2026 15:54
@osma osma changed the title [WIP] Reimplement NN ensemble using PyTorch Reimplement NN ensemble using PyTorch Jan 16, 2026
@juhoinkinen
Copy link
Member

juhoinkinen commented Jan 22, 2026

I ran benchmarking runs using Annif-tutorial YSO-NLF dataset on annif-data-kk server (it has 6 CPUs).

The used script and output data are in the benchmarking branch

train

Before (main) -j1 After (this PR) -j1 Before (main) -j6 After (this PR) -j6
user time (seconds) 2810.63 3023.01 2948.25 3208.04
percent CPU 106% 112% 571% 538%
wall time 44:26.96 45:36.19 8:45.21 10:10.04
max RSS 3_368_876 7_076_980 2_599_604 6_764_364
model disk size (bytes) 1_304_759_580 1_131_495_858 (same as -j1) (same as -j1)

eval

Before (main) -j1 After (main) -j6 Before (this PR) -j1 After (this PR) -j6
user time 475.29 471.15 485.92 473.70
percent CPU 99% 99% 498% 507%
wall time 7:58.65 7:53.83 1:38.66 1:34.24
max RSS 2_666_460 2_176_184 2_105_688 1_840_860
nDCG 0.4805 0.4750 0.4775 0.4691

Compared to TensorFlow implementation PyTorch requires twice as much memory in training and is slightly slower (107% in usertime); but in inference the situation is the opposite: PyTorch is faster (~98%) and takes less memory.

@osma
Copy link
Member Author

osma commented Jan 22, 2026

Thanks @juhoinkinen ! The RAM usage doubling is interesting. First hypothesis: Maybe PT uses higher precision floats than TF? I'll investigate.

@osma
Copy link
Member Author

osma commented Jan 23, 2026

The increase in memory use during training was mainly due to the way nDCG scores were calculated, which caused a lot of large tensors to be kept in memory especially towards the end of a training epoch. I switched away from the torchmetrics implementation, instead implementing the calculation with a custom function that doesn't keep the tensors allocated.

@juhoinkinen
Copy link
Member

juhoinkinen commented Jan 23, 2026

I re-run the benchmarks, full output here.

Memory usage during training is now even lower than it was with TensorFlow!

train

Before (main) -j1 After (this PR) -j1 Before (main) -j6 After (this PR) -j6
user time (seconds) 2810.63 2905.34 2948.25 3211.42
percent CPU 106% 109% 571% 562%
wall time 44:26.96 45:02.86 8:45.21 9:47.82
max RSS 3_368_876 2_910_240 2_599_604 2_355_956
model disk size (bytes) 1_304_759_580 1_304_759_580 (?) (same as -j1) (same as -j1)

eval

Before (main) -j1 After (main) -j6 Before (this PR) -j1 After (this PR) -j6
user time 475.29 491.40 485.92 509.55
percent CPU 99% 99% 498% 504%
wall time 7:58.65 8:14.79 1:38.66 1:41.86
max RSS 2_666_460 2_186_396 2_105_688 1_836_360
nDCG 0.4805 0.4719 0.4775 0.4719

@sonarqubecloud
Copy link

@juhoinkinen
Copy link
Member

Data for the latest change with Conv1d:

train

After (main) -j6
user time (seconds) 3237.45
percent CPU 563%
wall time 9:57.96
max RSS 2_392_068
model disk size (bytes) 1_304_759_909

eval

After (main) -j6
user time 515.05
percent CPU 507%
wall time 1:42.60
max RSS 1_889_644
nDCG 0.4943

So nDCG improved 0.0224 (and F1@5 0.0127)! 📈

No adverse effects on performance or memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reimplement NN ensemble using Pytorch instead of TensorFlow

3 participants