Add Xtransformer to backend #798

Lakshmi-bashyam · 2024-09-16T15:14:34Z

This PR adds xtransformer as an optional dependency, incorporating minor changes and updating the backend implementation to align with the latest Annif version, building on the previous xtransformer PR #540

annif/backend/xtransformer.py

annif/util.py

codecov · 2024-09-17T06:21:27Z

Codecov Report

Attention: Patch coverage is 30.68182% with 183 lines in your changes missing coverage. Please review.

Project coverage is 97.25%. Comparing base (6bae2e5) to head (0e9ad2c).

Files with missing lines	Patch %	Lines
tests/test_backend_xtransformer.py	9.27%	88 Missing ⚠️
annif/backend/xtransformer.py	8.42%	87 Missing ⚠️
annif/backend/__init__.py	16.66%	5 Missing ⚠️
tests/test_backend.py	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #798      +/-   ##
==========================================
- Coverage   99.64%   97.25%   -2.40%     
==========================================
  Files          99      101       +2     
  Lines        7349     7606     +257     
==========================================
+ Hits         7323     7397      +74     
- Misses         26      209     +183

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

annif/backend/xtransformer.py

osma · 2024-09-25T10:33:50Z

tests/test_backend.py


+@pytest.mark.skipif(
+    importlib.util.find_spec("pecos") is not None,
+    reason="test requires that YAKE is NOT installed",


PECOS, not YAKE, right?

Oops, yes. Thanks for catching it.

osma · 2024-09-25T10:57:19Z

Thanks a lot for this new PR @Lakshmi-bashyam ! It really helps to have a clean starting point based on the current code.

We've now tested this briefly. We used the PLC (YKL) classification task, because it seemed simpler than predicting YSO subjects and the current classification quality (mainly using Omikuji Parabel and Bonsai) are not that good, so it seems likely that a new algorithm could achieve better results. (And it did!)

I set this up in the University of Helsinki HPC environment. We got access to an A100 GPU (which is way overkill for this...) so it was possible to train and evaluate models in a reasonable time.

Here are some notes, comments and observations:

Default BERT model missing

Training a model without setting model_shortcut didn't work for me. Apparently the model distilbert-base-multilingual-uncased cannot be found on HuggingFace Hub (maybe it has been deleted?). I set model_shortcut="distilbert-base-multilingual-cased" and it started working. (Later I changed to another BERT model, see below)

Documentation and advice

There was some advice and a suggested config in this comment from Moritz. I think we would need something like this to guide users (including us at NLF!) on how to use the backend and what configuration settings to use. Eventually this could be a wiki page for the backend like the others we have already, but for now just a comment in this PR would be helpful for testing.

Here is the config I currently use for the YKL classification task in Finnish:

[ykl-xtransformer-fi]
name="YKL XTransformer Finnish"
language="fi"
backend="xtransformer"
analyzer="simplemma(fi)"
vocab="ykl"
batch_size=16
truncate_length=256
learning_rate=0.0001
num_train_epochs=3
max_leaf_size=18000
model_shortcut="TurkuNLP/bert-base-finnish-cased-v1"

Using the Finnish BERT model improved results a bit compared to the multilingual BERT model. It's a little slower and takes slightly more VRAM (7GB instead of 6GB in this task), probably because it's not a DistilBERT model.

This configuration achieves a Precision@1 score of 0.59 on the Finnish YKL classification task, which is slightly higher than what we get with Parabel and Bonsai (0.56-0.57).

If you have any insight in how to choose appropriate configuration settings based on e.g. the training data size, vocabulary size, task type, available hardware etc. then that would be very valuable to include in the documentation. Pecos has tons of hyperparameters!

Example questions that I wonder about:

Does the analyzer setting affect what the BERT model sees? I don't think so?
How to select the number of epochs? (so far I've tried 1, 2 and 3 and got the best results with 3 epochs)
How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?
How to set max_leaf_size?
How to set batch_size?
Are there other important settings/hyperparameters that could be tuned for better results?

Pecos FutureWarning

I saw this warning a lot:

/home/xxx/.cache/pypoetry/virtualenvs/annif-fDHejL2r-py3.10/lib/python3.10/site-packages/pecos/xmc/xtransformer/matcher.py:411: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

However, I think this is a problem in Pecos and probably not something we can easily fix ourselves. Maybe it will be fixed in a later release of Pecos. (I used libpecos 1.25 which is currently the most recent release on PyPI)

Not working under Python 3.11

I first tried Python 3.11, but it seemed that there was no libpecos wheel for this Python version available on PyPI (and it couldn't be built automatically for some reason). So I switched to Python 3.10 for my tests. Again, this is really a problem with libpecos and not with the backend itself.

Unit tests not run under CI

The current tests seem to do a lot of mocking to avoid actually training models. This is probably sensible since actually training a model could require lots of resources. However, the end result is that test coverage is quite low, with less than 10% of lines covered.

Looking more closely, it seems like most of the tests aren't currently executed at all under GitHub Actions CI. I suspect this is because this is an optional dependency and it's not installed at all under the CI environment, so the tests will be skipped. Fixing this in the CI config (.github/workflows/cicd.html) should at least substantially improve the test coverage.

Code style and QA issues

There are some complaints from QA tools about the current code. These should be easy to fix. Not super urgent, but they should be fixed before we can consider merging this. (If some things are hard to fix we can reconsider them case by case)

Lint with Black fails in the CI run. The code doesn't follow Black style. Easy to fix by running black
SonarCloud complains about a few variable names and return types
github-advanced-security complains about imports (see previous comment above)

Dependency on PyTorch

Installing this optional dependency brings in a lot of dependencies, including PyTorch and CUDA. The virtualenv in my case (using poetry install --all-extras) is 5.7GB, while another one for the main branch (without pecos) is 2.6GB, an increase of over 3GB. I wonder if there is any way to reduce this? Especially if we want to include this in the Docker images, the huge size could become a problem.

Also, the NN ensemble backend is implemented using TensorFlow. It seems a bit wasteful to depend on both TensorFlow and PyTorch. Do you think it would make sense to try to reimplement the NN ensemble in PyTorch? This way we could at least drop the dependency on TensorFlow.

Again, thanks a lot for this and apologies for the long silence and the long comments! We can of course do some of the remaining work to get this integrated and merged on our side, because this seems like a very useful addition to the Annif backends. Even if you don't have any time to work on the code, just providing some advice on the configuration side would help a lot! For example, example configurations you've used at ZBW would be nice to see.

sonarqubecloud · 2024-09-25T10:58:22Z

Quality Gate failed

Failed conditions
11.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

juhoinkinen · 2024-09-26T07:13:08Z

Especially if we want to include this in the Docker images, the huge size could become a problem.

I build a Dockerimage from this branch, and its size is 7.21 GB, which is quite much bigger than the size of Annif 1.1 image, which is 2.07 GB.

Not all users and use cases probably won't need Xtransformer, or other optional dependencies, so we could build different variants of the image and push them to quay.io (just by setting different buildargs in GitHub Actions build step and tagging the images appropriately). But that can be done in separate PR; I'll create an issue for this now.

sonarqubecloud · 2025-07-10T12:13:28Z

Quality Gate failed

Failed conditions
1 Security Hotspot
13.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

katjakon · 2025-08-06T09:16:55Z

Draft for X-Transformer Wiki Page

I wrote a first draft of a potential Wiki page about X-Transformer, which includes hyperparameters and notes about optimization.
This can definitely be extended and modified as this PR evolves. Let me know if you have any notes!
Backend-X-Transformer.md

…s#311)

osma · 2025-09-16T13:14:52Z

There were a few changes made just before the Annif 1.4 release that unfortunately caused some conflicts with this PR in annif/util.py (just some import statements) and the dependencies declared in pyproject.toml. These need to be resolved.

In addition, as part of PR #864 the API for some AnnifBackend methods changed a little; documents are no longer passed as text strings but as Document objects. Those changes need to be applied to this backend as well; see the commit b0bb163 where the changes were made for other backends.

We would very much like to include this backend in the next minor release Annif 1.5. However, that will take some work elsewhere in the codebase; I think it would make sense to reimplement the NN ensemble to use Pytorch instead of TensorFlow so that we don't have to depend on two very similar and possibly conflicting ML libraries.

Lakshmi-bashyam · 2025-09-16T14:12:12Z

The PECOS TF-IDF vectorizer has a significant limitation: it does not allow the use of custom tokenizers. Instead, it relies exclusively on its built-in tokenizer.

Technical Details

The core of this limitation lies in the PecosTfidfVectorizerMixin, which utilizes the PECOS.Vectorizer.train() method.
This method is restricted to a predefined set of parameters such as: ngram_range,max_df_ratio, analyzer,min_df_cnt

It lacks a mechanism to accept custom tokenizer functions.

Despite this limitation, performance tests on the ZBW dataset showed that vectorization is 5× faster compared to the standard TF-IDF method.
This performance improvement becomes more pronounced with larger datasets, making it an attractive option for large-scale applications.

we gain a substantial performance improvement (5× speedup) at the cost of losing the flexibility to customize tokenization.

Current Implementation

I have implemented the PECOS TF-IDF vectorizer and included it for XTransformer only for now.
Additionally, I have addressed the document object handling in the suggest method.

osma · 2025-09-16T14:56:54Z

@Lakshmi-bashyam Thanks a lot for the changes, and for the information about the vectorizer. I guess we will have to live with its limitations at least for now. At least it is fast!

I see that you fixed some of the recent merge conflicts, but apparently pyproject.toml is still in a conflict state according to GitHub. Can you take a look?

I opened a new issue about reimplementing the NN ensemble backend using Pytorch: #895

mfakaehler · 2025-09-17T12:48:40Z

Dear @Lakshmi-bashyam and @osma,
we came across this issue with the TFIDF-Vecotrizer in PECOS, too. We can confirm that at least for German it worked reasonably well, in the sense that X-Transformer gives good overall results. So I agree that this is a limitation that one could probably live with.

mfakaehler · 2025-09-17T13:01:35Z

Another topic that I would like to raise is that of dependencies (and I hate to bring it up!). At DNB we are currently developing an Annif-Backend that uses Embedding based matching. @RietdorfC will give an update on this soon.
I have included the dependency specs that we currently aim for here [1]
In particular, this involves transformers v4.52.4. Pecos comes with 'transformers>=4.31.0 if I have correctly spotted that. So in theory, that should be compatible. However, one should probably make sure pecos actually works with an up-to-date version of transformers. I am not sure, whether PECOS has kept track on all breaking changes.
See also this pull request to pecos for our attempts to make an update to pecos to support more modern model architectures. This suggests, that pecos might not be compatible with newer versions of transformers...
So we maybe running into a problem here. Could you confirm this @Lakshmi-bashyam?

[1] ebm_packages.txt

Lakshmi-bashyam · 2025-09-17T13:16:41Z

@mfakaehler You’re correct — PECOS hasn’t yet been updated to work with the latest Transformers versions. I’ve also opened an issue with the PECOS team about this: amzn/pecos#311

Currently, PECOS ≥ 1.2.7 can only be used with the constraint transformers<=4.49.0.

There’s also another dependency conflict: Python 3.11 is supported starting from PECOS ≥ 1.2.7, but those versions require scipy<1.14.0, while Annif requires scipy>=1.15.3.

mfakaehler · 2025-09-18T05:53:34Z

Thanks for the clarification @Lakshmi-bashyam. I am sorry to say, that its not obvious to me what to do about this :(

osma · 2025-09-18T07:13:03Z

What this all boils down to is that it looks like PECOS is not being very actively maintained and relies on versions of libraries that are about to become obsolete. This is a problem if we want to integrate it with Annif (as in this PR), even as an optional dependency, because of the way PECOS sets upper limits on the versions of important library dependencies. While we could try to adjust every other component to accommodate PECOS, if it's even possible to do so, this would only work for a limited time if PECOS stays as it is. The ecosystem always moves on: new versions of libraries are released (possibly with security fixes!), new Python releases will come with new demands on libraries etc.

So unfortunately I don't see any other way of moving forward than trying to work with the PECOS project on bringing the dependencies up to date on their side. In the worst case, this might mean forking it (or at least important parts of it) and taking over maintenance.

Lakshmi-bashyam · 2025-09-18T11:28:53Z

@osma Yeah, I’m with you on this. For now I’ll see if I can update the dependencies without conflicts and send a PR over to the PECOS team.

On the Annif side, at least for this PR, I’ll just downgrade the dependencies temporarily until the PECOS team sort this out.

tests/test_backend_xtransformer.py

+from scipy.sparse import csr_matrix, load_npz
+
+import annif.backend
+import annif.corpus


sonarqubecloud · 2025-10-07T15:54:14Z

Quality Gate failed

Failed conditions
11.1% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

mo-fu and others added 17 commits August 26, 2022 13:47

Add parameter merging to utils

fb13401

Allow atomic save to handle directories.

e249715

Add XTransformer backend.

5cc207b

Remove redundant import in fasttext

5a18d98

Use parsed parameter in suggest batch_size.

6129965

Use provided parameters in xtransformer training.

02ff772

Fix import for Xtransformer

3d06ebe

Split atomic_save in folder and directory variant.

8555bab

Disable gpu use for xtransformer suggest.

c11ba38

Update pecos dependency.

4a82ea2

Adapt xtransformer backend to new vocab model.

367e493

Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master

aa96ebc

Working transformer backend

efbb05c

Working transformer backend

3731f47

Resolve conflicts

6187e91

xtrans test fixed, stwfsa import fixed

3e02a72

Change default to smaller model

7379061

github-advanced-security bot found potential problems Sep 17, 2024

View reviewed changes

Fix linting errors

2078a65

github-advanced-security bot found potential problems Sep 19, 2024

View reviewed changes

annif/backend/xtransformer.py Fixed Show fixed Hide fixed

osma reviewed Sep 25, 2024

View reviewed changes

Lakshmi-bashyam added 2 commits September 25, 2024 12:52

code formatting changes

f1b9c78

security bot fix

5e41dce

typo fix

4c33a31

Flake8 fix

dcb5b97

juhoinkinen mentioned this pull request Sep 26, 2024

Docker image variants #804

Open

Lakshmi-bashyam added 3 commits July 10, 2025 11:24

Merged main branch

8269f59

Add pecos to cicd

6960ee5

Addn xtransformer hyper params

0e9ad2c

Lakshmi-bashyam added 5 commits August 28, 2025 13:27

Pin transformers<=4.49.0 to fix AdamW import issue in pecos (see peco…

8ed3438

…s#311)

Pecos TFIDF vectorizer

22397eb

Resolve merge conflicts

2cf6cf0

Unit test for Pecostfidf

a194ece

merge changes on suggest method

ce6f529

osma mentioned this pull request Sep 16, 2025

Reimplement NN ensemble using Pytorch instead of TensorFlow #895

Open

osma added this to the 1.5 milestone Sep 16, 2025

osma added the enhancement label Sep 16, 2025

Lakshmi-bashyam added 3 commits October 2, 2025 13:22

Merge remote-tracking branch 'origin/main' into xtransformer

1df26aa

Downgrade scipy for pecos

d2e75d3

fix xtrans suggest method in unit test

f0aa91f

github-advanced-security bot found potential problems Oct 2, 2025

View reviewed changes

tests/test_backend_xtransformer.py

from scipy.sparse import csr_matrix, load_npz

import annif.backend

import annif.corpus

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from' Note test

Module 'annif.corpus' is imported with both 'import' and 'import from'.

refactor(deps): Restrict optional dependencies' Python compatibility

3997512

osma mentioned this pull request Dec 1, 2025

Consider switching to uv for dependency management (esp. PyTorch) #919

Closed

osma mentioned this pull request Jan 7, 2026

how to install the supported backends to run the models from huggingfacce NatLibFi/Annif-LLMs4Subjects-GermEval2025#3

Open

Add Xtransformer to backend #798

Are you sure you want to change the base?

Add Xtransformer to backend #798

Uh oh!

Conversation

Lakshmi-bashyam commented Sep 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

osma Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

Lakshmi-bashyam Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

osma commented Sep 25, 2024

Default BERT model missing

Documentation and advice

Pecos FutureWarning

Not working under Python 3.11

Unit tests not run under CI

Code style and QA issues

Dependency on PyTorch

Uh oh!

sonarqubecloud bot commented Sep 25, 2024

Quality Gate failed

Uh oh!

juhoinkinen commented Sep 26, 2024

Uh oh!

sonarqubecloud bot commented Jul 10, 2025

Quality Gate failed

Uh oh!

katjakon commented Aug 6, 2025

Draft for X-Transformer Wiki Page

Uh oh!

osma commented Sep 16, 2025

Uh oh!

Lakshmi-bashyam commented Sep 16, 2025

Technical Details

Current Implementation

Uh oh!

osma commented Sep 16, 2025

Uh oh!

mfakaehler commented Sep 17, 2025

Uh oh!

mfakaehler commented Sep 17, 2025

Uh oh!

Lakshmi-bashyam commented Sep 17, 2025

Uh oh!

mfakaehler commented Sep 18, 2025

Uh oh!

osma commented Sep 18, 2025

Uh oh!

Lakshmi-bashyam commented Sep 18, 2025

Uh oh!

Check notice

sonarqubecloud bot commented Oct 7, 2025

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov bot commented Sep 17, 2024 •

edited

Loading