Skip to content

Conversation

@Lakshmi-bashyam
Copy link
Collaborator

This PR adds xtransformer as an optional dependency, incorporating minor changes and updating the backend implementation to align with the latest Annif version, building on the previous xtransformer PR #540

@codecov
Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 30.68182% with 183 lines in your changes missing coverage. Please review.

Project coverage is 97.25%. Comparing base (6bae2e5) to head (0e9ad2c).

Files with missing lines Patch % Lines
tests/test_backend_xtransformer.py 9.27% 88 Missing ⚠️
annif/backend/xtransformer.py 8.42% 87 Missing ⚠️
annif/backend/__init__.py 16.66% 5 Missing ⚠️
tests/test_backend.py 40.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #798      +/-   ##
==========================================
- Coverage   99.64%   97.25%   -2.40%     
==========================================
  Files          99      101       +2     
  Lines        7349     7606     +257     
==========================================
+ Hits         7323     7397      +74     
- Misses         26      209     +183     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


@pytest.mark.skipif(
importlib.util.find_spec("pecos") is not None,
reason="test requires that YAKE is NOT installed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PECOS, not YAKE, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes. Thanks for catching it.

@osma
Copy link
Member

osma commented Sep 25, 2024

Thanks a lot for this new PR @Lakshmi-bashyam ! It really helps to have a clean starting point based on the current code.

We've now tested this briefly. We used the PLC (YKL) classification task, because it seemed simpler than predicting YSO subjects and the current classification quality (mainly using Omikuji Parabel and Bonsai) are not that good, so it seems likely that a new algorithm could achieve better results. (And it did!)

I set this up in the University of Helsinki HPC environment. We got access to an A100 GPU (which is way overkill for this...) so it was possible to train and evaluate models in a reasonable time.

Here are some notes, comments and observations:

Default BERT model missing

Training a model without setting model_shortcut didn't work for me. Apparently the model distilbert-base-multilingual-uncased cannot be found on HuggingFace Hub (maybe it has been deleted?). I set model_shortcut="distilbert-base-multilingual-cased" and it started working. (Later I changed to another BERT model, see below)

Documentation and advice

There was some advice and a suggested config in this comment from Moritz. I think we would need something like this to guide users (including us at NLF!) on how to use the backend and what configuration settings to use. Eventually this could be a wiki page for the backend like the others we have already, but for now just a comment in this PR would be helpful for testing.

Here is the config I currently use for the YKL classification task in Finnish:

[ykl-xtransformer-fi]
name="YKL XTransformer Finnish"
language="fi"
backend="xtransformer"
analyzer="simplemma(fi)"
vocab="ykl"
batch_size=16
truncate_length=256
learning_rate=0.0001
num_train_epochs=3
max_leaf_size=18000
model_shortcut="TurkuNLP/bert-base-finnish-cased-v1"

Using the Finnish BERT model improved results a bit compared to the multilingual BERT model. It's a little slower and takes slightly more VRAM (7GB instead of 6GB in this task), probably because it's not a DistilBERT model.

This configuration achieves a Precision@1 score of 0.59 on the Finnish YKL classification task, which is slightly higher than what we get with Parabel and Bonsai (0.56-0.57).

If you have any insight in how to choose appropriate configuration settings based on e.g. the training data size, vocabulary size, task type, available hardware etc. then that would be very valuable to include in the documentation. Pecos has tons of hyperparameters!

Example questions that I wonder about:

  1. Does the analyzer setting affect what the BERT model sees? I don't think so?
  2. How to select the number of epochs? (so far I've tried 1, 2 and 3 and got the best results with 3 epochs)
  3. How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?
  4. How to set max_leaf_size?
  5. How to set batch_size?
  6. Are there other important settings/hyperparameters that could be tuned for better results?

Pecos FutureWarning

I saw this warning a lot:

/home/xxx/.cache/pypoetry/virtualenvs/annif-fDHejL2r-py3.10/lib/python3.10/site-packages/pecos/xmc/xtransformer/matcher.py:411: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

However, I think this is a problem in Pecos and probably not something we can easily fix ourselves. Maybe it will be fixed in a later release of Pecos. (I used libpecos 1.25 which is currently the most recent release on PyPI)

Not working under Python 3.11

I first tried Python 3.11, but it seemed that there was no libpecos wheel for this Python version available on PyPI (and it couldn't be built automatically for some reason). So I switched to Python 3.10 for my tests. Again, this is really a problem with libpecos and not with the backend itself.

Unit tests not run under CI

The current tests seem to do a lot of mocking to avoid actually training models. This is probably sensible since actually training a model could require lots of resources. However, the end result is that test coverage is quite low, with less than 10% of lines covered.

Looking more closely, it seems like most of the tests aren't currently executed at all under GitHub Actions CI. I suspect this is because this is an optional dependency and it's not installed at all under the CI environment, so the tests will be skipped. Fixing this in the CI config (.github/workflows/cicd.html) should at least substantially improve the test coverage.

Code style and QA issues

There are some complaints from QA tools about the current code. These should be easy to fix. Not super urgent, but they should be fixed before we can consider merging this. (If some things are hard to fix we can reconsider them case by case)

  • Lint with Black fails in the CI run. The code doesn't follow Black style. Easy to fix by running black
  • SonarCloud complains about a few variable names and return types
  • github-advanced-security complains about imports (see previous comment above)

Dependency on PyTorch

Installing this optional dependency brings in a lot of dependencies, including PyTorch and CUDA. The virtualenv in my case (using poetry install --all-extras) is 5.7GB, while another one for the main branch (without pecos) is 2.6GB, an increase of over 3GB. I wonder if there is any way to reduce this? Especially if we want to include this in the Docker images, the huge size could become a problem.

Also, the NN ensemble backend is implemented using TensorFlow. It seems a bit wasteful to depend on both TensorFlow and PyTorch. Do you think it would make sense to try to reimplement the NN ensemble in PyTorch? This way we could at least drop the dependency on TensorFlow.


Again, thanks a lot for this and apologies for the long silence and the long comments! We can of course do some of the remaining work to get this integrated and merged on our side, because this seems like a very useful addition to the Annif backends. Even if you don't have any time to work on the code, just providing some advice on the configuration side would help a lot! For example, example configurations you've used at ZBW would be nice to see.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
11.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@juhoinkinen
Copy link
Member

Especially if we want to include this in the Docker images, the huge size could become a problem.

I build a Dockerimage from this branch, and its size is 7.21 GB, which is quite much bigger than the size of Annif 1.1 image, which is 2.07 GB.

Not all users and use cases probably won't need Xtransformer, or other optional dependencies, so we could build different variants of the image and push them to quay.io (just by setting different buildargs in GitHub Actions build step and tagging the images appropriately). But that can be done in separate PR; I'll create an issue for this now.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot
13.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@katjakon
Copy link

katjakon commented Aug 6, 2025

Draft for X-Transformer Wiki Page

I wrote a first draft of a potential Wiki page about X-Transformer, which includes hyperparameters and notes about optimization.
This can definitely be extended and modified as this PR evolves. Let me know if you have any notes!
Backend-X-Transformer.md

@osma
Copy link
Member

osma commented Sep 16, 2025

There were a few changes made just before the Annif 1.4 release that unfortunately caused some conflicts with this PR in annif/util.py (just some import statements) and the dependencies declared in pyproject.toml. These need to be resolved.

In addition, as part of PR #864 the API for some AnnifBackend methods changed a little; documents are no longer passed as text strings but as Document objects. Those changes need to be applied to this backend as well; see the commit b0bb163 where the changes were made for other backends.

We would very much like to include this backend in the next minor release Annif 1.5. However, that will take some work elsewhere in the codebase; I think it would make sense to reimplement the NN ensemble to use Pytorch instead of TensorFlow so that we don't have to depend on two very similar and possibly conflicting ML libraries.

@Lakshmi-bashyam
Copy link
Collaborator Author

The PECOS TF-IDF vectorizer has a significant limitation: it does not allow the use of custom tokenizers. Instead, it relies exclusively on its built-in tokenizer.

Technical Details

The core of this limitation lies in the PecosTfidfVectorizerMixin, which utilizes the PECOS.Vectorizer.train() method.
This method is restricted to a predefined set of parameters such as: ngram_range,max_df_ratio, analyzer,min_df_cnt

It lacks a mechanism to accept custom tokenizer functions.

  • Despite this limitation, performance tests on the ZBW dataset showed that vectorization is 5× faster compared to the standard TF-IDF method.
  • This performance improvement becomes more pronounced with larger datasets, making it an attractive option for large-scale applications.

we gain a substantial performance improvement (5× speedup) at the cost of losing the flexibility to customize tokenization.

Current Implementation

  • I have implemented the PECOS TF-IDF vectorizer and included it for XTransformer only for now.
  • Additionally, I have addressed the document object handling in the suggest method.

@osma
Copy link
Member

osma commented Sep 16, 2025

@Lakshmi-bashyam Thanks a lot for the changes, and for the information about the vectorizer. I guess we will have to live with its limitations at least for now. At least it is fast!

I see that you fixed some of the recent merge conflicts, but apparently pyproject.toml is still in a conflict state according to GitHub. Can you take a look?

I opened a new issue about reimplementing the NN ensemble backend using Pytorch: #895

@mfakaehler
Copy link
Collaborator

Dear @Lakshmi-bashyam and @osma,
we came across this issue with the TFIDF-Vecotrizer in PECOS, too. We can confirm that at least for German it worked reasonably well, in the sense that X-Transformer gives good overall results. So I agree that this is a limitation that one could probably live with.

@mfakaehler
Copy link
Collaborator

Another topic that I would like to raise is that of dependencies (and I hate to bring it up!). At DNB we are currently developing an Annif-Backend that uses Embedding based matching. @RietdorfC will give an update on this soon.
I have included the dependency specs that we currently aim for here [1]
In particular, this involves transformers v4.52.4. Pecos comes with 'transformers>=4.31.0 if I have correctly spotted that. So in theory, that should be compatible. However, one should probably make sure pecos actually works with an up-to-date version of transformers. I am not sure, whether PECOS has kept track on all breaking changes.
See also this pull request to pecos for our attempts to make an update to pecos to support more modern model architectures. This suggests, that pecos might not be compatible with newer versions of transformers...
So we maybe running into a problem here. Could you confirm this @Lakshmi-bashyam?

[1] ebm_packages.txt

@Lakshmi-bashyam
Copy link
Collaborator Author

@mfakaehler You’re correct — PECOS hasn’t yet been updated to work with the latest Transformers versions. I’ve also opened an issue with the PECOS team about this: amzn/pecos#311

Currently, PECOS ≥ 1.2.7 can only be used with the constraint transformers<=4.49.0.

There’s also another dependency conflict: Python 3.11 is supported starting from PECOS ≥ 1.2.7, but those versions require scipy<1.14.0, while Annif requires scipy>=1.15.3.

@mfakaehler
Copy link
Collaborator

Thanks for the clarification @Lakshmi-bashyam. I am sorry to say, that its not obvious to me what to do about this :(

@osma
Copy link
Member

osma commented Sep 18, 2025

What this all boils down to is that it looks like PECOS is not being very actively maintained and relies on versions of libraries that are about to become obsolete. This is a problem if we want to integrate it with Annif (as in this PR), even as an optional dependency, because of the way PECOS sets upper limits on the versions of important library dependencies. While we could try to adjust every other component to accommodate PECOS, if it's even possible to do so, this would only work for a limited time if PECOS stays as it is. The ecosystem always moves on: new versions of libraries are released (possibly with security fixes!), new Python releases will come with new demands on libraries etc.

So unfortunately I don't see any other way of moving forward than trying to work with the PECOS project on bringing the dependencies up to date on their side. In the worst case, this might mean forking it (or at least important parts of it) and taking over maintenance.

@Lakshmi-bashyam
Copy link
Collaborator Author

@osma Yeah, I’m with you on this. For now I’ll see if I can update the dependencies without conflicts and send a PR over to the PECOS team.

On the Annif side, at least for this PR, I’ll just downgrade the dependencies temporarily until the PECOS team sort this out.

from scipy.sparse import csr_matrix, load_npz

import annif.backend
import annif.corpus

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from' Note test

Module 'annif.corpus' is imported with both 'import' and 'import from'.
@sonarqubecloud
Copy link

sonarqubecloud bot commented Oct 7, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
11.1% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants