Reimplement NN ensemble using PyTorch #926

osma · 2026-01-13T11:34:52Z

This PR reimplements the NN ensemble using PyTorch instead of Keras/TensorFlow.

To test this, you will have to use uv sync --group all --extra torch-cpu or similar (see comments below).

Some notes about the implementation:

the neural network architecture should be pretty much the same as before, although some tensors have a different shape than before (not transposed as they used to be)
there may be differences in the optimizer setup and loss calculation; I didn't look too closely at them
the old code displayed top_k_categorical_accuracy, but this was not easily available in PyTorch, so I switched to using the nDCG metric from the torchmetrics package
the progress bar shown during training now uses tdqm, so it looks a bit different than the Keras one; it is also displayed on stderr and not stdout as the old one used to be
the old code showed a detailed error message when model loading failed; I couldn't figure out (yet) how to do that with PyTorch models, but the model is stored with metadata (python version, torch version etc.) that may be helpful in implementing such an error message later on if it turns out to be necessary. In general, the models should be pretty much PyTorch-version-agnostic so there may not be a need for this.
I only defined torch-cpu and torch-cu128 extras for now, but I think the setup could quite easily be extended to other PyTorch variants such as CUDA 12.6 or 13.0, ROCm or Intel XPU, though obviously these would require more configuration in pyproject.toml

I have not yet measured how well this performs in terms of quality, computational performance or memory usage.

Fixes #895

codecov · 2026-01-13T11:43:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.63%. Comparing base (27e4ac7) to head (f949e1c).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #926      +/-   ##
==========================================
- Coverage   99.63%   99.63%   -0.01%     
==========================================
  Files         103      103              
  Lines        8238     8226      -12     
==========================================
- Hits         8208     8196      -12     
  Misses         30       30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…now)

…ctionality (for now)

osma · 2026-01-15T14:57:51Z

Selecting of the PyTorch variant (CPU or CUDA x.y or ROCm or...) when setting up the development environment using uv sync has been a headache, but I think I've found a workable solution. It's not super elegant, but at least it seems to work.

The problem is that uv sync wants to perform "universal resolution", that is, resolve all the transitive dependencies once and for all, then write the result into the uv.lock file. This can be parameterized by OS, Python version and some other factors, but not by anything that the user could set when running uv sync. Since different PyTorch variants have different dependencies (e.g. CUDA libraries), dependencies for each of them would have to be resolved separately.

But fortunately, it is possible to have some degree of control over the resolution by setting up "extras" and then declaring a "conflict" between them. This causes uv to "fork" the resolution into different "branches", each having their own dependency tree.

So in commit e629963, I added two new extras: torch-cpu (CPU only) and torch-cu128 (CUDA 12.8 GPU), and declared a conflict between them, i.e., you can't install both extras at the same time. (This will unfortunately cause --all-extras to stop working, which is a shame, since it means that lots of specific --extra parameters are needed in typical situations.) These extras are then tied to specific PyTorch package indexes and thus different variants of the torch package.

The end result is that these two extras can be used to select the PyTorch variant at uv sync time. The torch dependency is still also defined for the nn extra, without a specific index. This means that installing only the nn extra will install whatever is the default PyTorch variant (on Linux it is a CUDA variant).

Here are examples of how this works now:

1. `uv sync` without extras

This installs 439MB of dependencies, no PyTorch.

$ uv sync
Resolved 212 packages in 1.71s
      Built annif @ file:///home/oisuomin/git/Annif
Prepared 1 package in 261ms
Uninstalled 1 package in 0.21ms
Installed 1 package in 0.50ms
 ~ annif==1.5.0.dev0 (from file:///home/oisuomin/git/Annif)

$ du -sh .venv
439M	.venv

2. `uv sync` with just the `nn` extra

This installs the default PyTorch CUDA variant, for a total 2.2GB of dependencies.

$ uv sync --extra nn
Resolved 212 packages in 0.77ms
Installed 6 packages in 96ms
 + lmdb==1.7.5
 + mpmath==1.3.0
 + networkx==3.6.1
 + setuptools==80.9.0
 + sympy==1.14.0
 + torch==2.9.1

$ du -sh .venv
2.2G	.venv

3. `uv sync` with both `nn` and `torch-cpu` extras

This switches to the CPU-only variant of PyTorch. Dependencies are now only 1.2GB.

$ uv sync --extra nn --extra torch-cpu
Resolved 212 packages in 0.78ms
Uninstalled 1 package in 69ms
Installed 1 package in 93ms
 - torch==2.9.1
 + torch==2.9.1+cpu

$ du -sh .venv
1.2G	.venv

4. `uv sync` with both `nn` and `torch-cu128` extras

This installs the PyTorch CUDA 12.8 variant and lots of nvidia-* library packages, for a whopping 7.0GB of dependencies. (I wonder why this isn't the same as the default PyTorch CUDA build that got installed in step 2 above?)

$ uv sync --extra nn --extra torch-cu128
Resolved 212 packages in 0.77ms
Uninstalled 1 package in 72ms
Installed 17 packages in 97ms
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.5
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.3.20
 + nvidia-nvtx-cu12==12.8.90
 - torch==2.9.1+cpu
 + torch==2.9.1+cu128
 + triton==3.5.1

$ du -sh .venv
7.0G	.venv

…h variant

…document it in README

osma · 2026-01-16T10:44:17Z

I refined the above solution by adding an all dependency group (because --all-extras cannot be used anymore). Now a basic developer install with all CPU-only extra features can be installed with:

uv sync --group all --extra torch-cpu

Maybe not ideal, but it works.

juhoinkinen · 2026-01-22T13:55:56Z

I ran benchmarking runs using Annif-tutorial YSO-NLF dataset on annif-data-kk server (it has 6 CPUs).

The used script and output data are in the benchmarking branch

train

	Before (main) -j1	After (this PR) -j1	Before (main) -j6	After (this PR) -j6
user time (seconds)	2810.63	3023.01	2948.25	3208.04
percent CPU	106%	112%	571%	538%
wall time	44:26.96	45:36.19	8:45.21	10:10.04
max RSS	3_368_876	7_076_980	2_599_604	6_764_364
model disk size (bytes)	1_304_759_580	1_131_495_858	(same as -j1)	(same as -j1)

eval

	Before (main) -j1	After (main) -j6	Before (this PR) -j1	After (this PR) -j6
user time	475.29	471.15	485.92	473.70
percent CPU	99%	99%	498%	507%
wall time	7:58.65	7:53.83	1:38.66	1:34.24
max RSS	2_666_460	2_176_184	2_105_688	1_840_860
nDCG	0.4805	0.4750	0.4775	0.4691

Compared to TensorFlow implementation PyTorch requires twice as much memory in training and is slightly slower (107% in usertime); but in inference the situation is the opposite: PyTorch is faster (~98%) and takes less memory.

osma · 2026-01-22T14:07:43Z

Thanks @juhoinkinen ! The RAM usage doubling is interesting. First hypothesis: Maybe PT uses higher precision floats than TF? I'll investigate.

…torchmetrics

…ization (makes no difference)

annif/backend/nn_ensemble.py

osma · 2026-01-23T12:04:32Z

The increase in memory use during training was mainly due to the way nDCG scores were calculated, which caused a lot of large tensors to be kept in memory especially towards the end of a training epoch. I switched away from the torchmetrics implementation, instead implementing the calculation with a custom function that doesn't keep the tensors allocated.

juhoinkinen · 2026-01-23T13:39:30Z

I re-run the benchmarks, full output here.

Memory usage during training is now even lower than it was with TensorFlow!

train

	Before (main) -j1	After (this PR) -j1	Before (main) -j6	After (this PR) -j6
user time (seconds)	2810.63	2905.34	2948.25	3211.42
percent CPU	106%	109%	571%	562%
wall time	44:26.96	45:02.86	8:45.21	9:47.82
max RSS	3_368_876	2_910_240	2_599_604	2_355_956
model disk size (bytes)	1_304_759_580	1_304_759_580 (?)	(same as -j1)	(same as -j1)

eval

	Before (main) -j1	After (main) -j6	Before (this PR) -j1	After (this PR) -j6
user time	475.29	491.40	485.92	509.55
percent CPU	99%	99%	498%	504%
wall time	7:58.65	8:14.79	1:38.66	1:41.86
max RSS	2_666_460	2_186_396	2_105_688	1_836_360
nDCG	0.4805	0.4719	0.4775	0.4719

sonarqubecloud · 2026-01-23T15:36:43Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

juhoinkinen · 2026-01-26T09:13:59Z

Data for the latest change with Conv1d:

train

	After (main) -j6
user time (seconds)	3237.45
percent CPU	563%
wall time	9:57.96
max RSS	2_392_068
model disk size (bytes)	1_304_759_909

eval

	After (main) -j6
user time	515.05
percent CPU	507%
wall time	1:42.60
max RSS	1_889_644
nDCG	0.4943

So nDCG improved 0.0224 (and F1@5 0.0127)! 📈

No adverse effects on performance or memory usage.

osma self-assigned this Jan 13, 2026

osma added the enhancement label Jan 13, 2026

osma force-pushed the issue895-nn-ensemble-pytorch branch from 5bdbf64 to d82a54a Compare January 13, 2026 11:40

osma added 3 commits January 15, 2026 14:25

switch dependency from tensorflow-cpu to torch (only cpu variant for …

e04644e

…now)

NN ensemble basic functionality implemented using PyTorch

2d3e434

add pytorch and python version to NN model; remove model metadata fun…

da479eb

…ctionality (for now)

osma force-pushed the issue895-nn-ensemble-pytorch branch from d82a54a to da479eb Compare January 15, 2026 12:25

enable selecting PyTorch CPU or CUDA (12.8) variant through extras

e629963

osma mentioned this pull request Jan 15, 2026

Add ebm backend #914

Open

osma added 3 commits January 15, 2026 17:15

use torch-cpu extra in CI/CD and Dockerfile to select CPU-only PyTorc…

3784155

…h variant

define dependency group 'all' as a substitute for '--all-extras' and …

541f2af

…document it in README

drop torch-cpu from 'all' group as it is not needed

e3fc7f9

osma added 4 commits January 16, 2026 15:03

add progress bar (using tqdm) for NN ensemble training loop

ff8c692

calculate nDCG after every training epoch (using torchmetrics package)

1660273

specify num_workers and weight_decay parameters

bf0cba0

cleanup

85057cd

osma requested a review from juhoinkinen January 16, 2026 13:54

osma added this to the 1.5 milestone Jan 16, 2026

osma added 2 commits January 16, 2026 16:07

remove unnecessary TensorFlow log level adjustment

fb38ef8

remove test for TF log level setting

9681ab3

osma marked this pull request as ready for review January 16, 2026 15:54

osma changed the title ~~[WIP] Reimplement NN ensemble using PyTorch~~ Reimplement NN ensemble using PyTorch Jan 16, 2026

osma added 3 commits January 21, 2026 14:05

adjust PyTorch model to better match old Keras model

faf3de7

fix test that broke

311a29c

switch to BCELoss (requires clamping output values)

1437d43

osma added 3 commits January 23, 2026 13:37

implement ndcg calculation inline instead of relying on memory-heavy …

49b7d48

…torchmetrics

fix ndcg calculation bug; remove Keras-emulating hidden layer initial…

23bdcd6

…ization (makes no difference)

switch to AdamW optimizer, seems to converge slightly faster

099e08d

github-advanced-security bot found potential problems Jan 23, 2026

View reviewed changes

annif/backend/nn_ensemble.py Fixed Show fixed Hide fixed

annif/backend/nn_ensemble.py Fixed Show fixed Hide fixed

osma added 2 commits January 23, 2026 14:05

drop unnecessary line

9f35d09

avoid allocating unnecessary variable n_labels

4ef0600

use a Conv1d layer instead of torch.mean for averaging input

f949e1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement NN ensemble using PyTorch #926

Reimplement NN ensemble using PyTorch #926

osma commented Jan 13, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 13, 2026 •

edited

Loading

Uh oh!

osma commented Jan 15, 2026 •

edited

Loading

Uh oh!

osma commented Jan 16, 2026

Uh oh!

juhoinkinen commented Jan 22, 2026 •

edited

Loading

Uh oh!

osma commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

osma commented Jan 23, 2026

Uh oh!

juhoinkinen commented Jan 23, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jan 23, 2026

Uh oh!

juhoinkinen commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reimplement NN ensemble using PyTorch #926

Are you sure you want to change the base?

Reimplement NN ensemble using PyTorch #926

Conversation

osma commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

osma commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. uv sync without extras

2. uv sync with just the nn extra

3. uv sync with both nn and torch-cpu extras

4. uv sync with both nn and torch-cu128 extras

Uh oh!

osma commented Jan 16, 2026

Uh oh!

juhoinkinen commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

train

eval

Uh oh!

osma commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

osma commented Jan 23, 2026

Uh oh!

juhoinkinen commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

train

eval

Uh oh!

sonarqubecloud bot commented Jan 23, 2026

Quality Gate passed

Uh oh!

juhoinkinen commented Jan 26, 2026

train

eval

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

osma commented Jan 13, 2026 •

edited

Loading

codecov bot commented Jan 13, 2026 •

edited

Loading

osma commented Jan 15, 2026 •

edited

Loading

1. `uv sync` without extras

2. `uv sync` with just the `nn` extra

3. `uv sync` with both `nn` and `torch-cpu` extras

4. `uv sync` with both `nn` and `torch-cu128` extras

juhoinkinen commented Jan 22, 2026 •

edited

Loading

juhoinkinen commented Jan 23, 2026 •

edited

Loading