-
Notifications
You must be signed in to change notification settings - Fork 44
Add Xtransformer to backend #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #798 +/- ##
==========================================
- Coverage 99.64% 97.25% -2.40%
==========================================
Files 99 101 +2
Lines 7349 7606 +257
==========================================
+ Hits 7323 7397 +74
- Misses 26 209 +183 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tests/test_backend.py
Outdated
|
|
||
| @pytest.mark.skipif( | ||
| importlib.util.find_spec("pecos") is not None, | ||
| reason="test requires that YAKE is NOT installed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PECOS, not YAKE, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes. Thanks for catching it.
|
Thanks a lot for this new PR @Lakshmi-bashyam ! It really helps to have a clean starting point based on the current code. We've now tested this briefly. We used the PLC (YKL) classification task, because it seemed simpler than predicting YSO subjects and the current classification quality (mainly using Omikuji Parabel and Bonsai) are not that good, so it seems likely that a new algorithm could achieve better results. (And it did!) I set this up in the University of Helsinki HPC environment. We got access to an A100 GPU (which is way overkill for this...) so it was possible to train and evaluate models in a reasonable time. Here are some notes, comments and observations: Default BERT model missingTraining a model without setting Documentation and adviceThere was some advice and a suggested config in this comment from Moritz. I think we would need something like this to guide users (including us at NLF!) on how to use the backend and what configuration settings to use. Eventually this could be a wiki page for the backend like the others we have already, but for now just a comment in this PR would be helpful for testing. Here is the config I currently use for the YKL classification task in Finnish: Using the Finnish BERT model improved results a bit compared to the multilingual BERT model. It's a little slower and takes slightly more VRAM (7GB instead of 6GB in this task), probably because it's not a DistilBERT model. This configuration achieves a Precision@1 score of 0.59 on the Finnish YKL classification task, which is slightly higher than what we get with Parabel and Bonsai (0.56-0.57). If you have any insight in how to choose appropriate configuration settings based on e.g. the training data size, vocabulary size, task type, available hardware etc. then that would be very valuable to include in the documentation. Pecos has tons of hyperparameters! Example questions that I wonder about:
Pecos FutureWarningI saw this warning a lot: However, I think this is a problem in Pecos and probably not something we can easily fix ourselves. Maybe it will be fixed in a later release of Pecos. (I used libpecos 1.25 which is currently the most recent release on PyPI) Not working under Python 3.11I first tried Python 3.11, but it seemed that there was no Unit tests not run under CIThe current tests seem to do a lot of mocking to avoid actually training models. This is probably sensible since actually training a model could require lots of resources. However, the end result is that test coverage is quite low, with less than 10% of lines covered. Looking more closely, it seems like most of the tests aren't currently executed at all under GitHub Actions CI. I suspect this is because this is an optional dependency and it's not installed at all under the CI environment, so the tests will be skipped. Fixing this in the CI config ( Code style and QA issuesThere are some complaints from QA tools about the current code. These should be easy to fix. Not super urgent, but they should be fixed before we can consider merging this. (If some things are hard to fix we can reconsider them case by case)
Dependency on PyTorchInstalling this optional dependency brings in a lot of dependencies, including PyTorch and CUDA. The virtualenv in my case (using Also, the NN ensemble backend is implemented using TensorFlow. It seems a bit wasteful to depend on both TensorFlow and PyTorch. Do you think it would make sense to try to reimplement the NN ensemble in PyTorch? This way we could at least drop the dependency on TensorFlow. Again, thanks a lot for this and apologies for the long silence and the long comments! We can of course do some of the remaining work to get this integrated and merged on our side, because this seems like a very useful addition to the Annif backends. Even if you don't have any time to work on the code, just providing some advice on the configuration side would help a lot! For example, example configurations you've used at ZBW would be nice to see. |
|
I build a Dockerimage from this branch, and its size is 7.21 GB, which is quite much bigger than the size of Annif 1.1 image, which is 2.07 GB. Not all users and use cases probably won't need Xtransformer, or other optional dependencies, so we could build different variants of the image and push them to quay.io (just by setting different buildargs in GitHub Actions build step and tagging the images appropriately). But that can be done in separate PR; I'll create an issue for this now. |
|
Draft for X-Transformer Wiki PageI wrote a first draft of a potential Wiki page about X-Transformer, which includes hyperparameters and notes about optimization. |
|
There were a few changes made just before the Annif 1.4 release that unfortunately caused some conflicts with this PR in In addition, as part of PR #864 the API for some AnnifBackend methods changed a little; documents are no longer passed as text strings but as Document objects. Those changes need to be applied to this backend as well; see the commit b0bb163 where the changes were made for other backends. We would very much like to include this backend in the next minor release Annif 1.5. However, that will take some work elsewhere in the codebase; I think it would make sense to reimplement the NN ensemble to use Pytorch instead of TensorFlow so that we don't have to depend on two very similar and possibly conflicting ML libraries. |
|
The PECOS TF-IDF vectorizer has a significant limitation: it does not allow the use of custom tokenizers. Instead, it relies exclusively on its built-in tokenizer. Technical DetailsThe core of this limitation lies in the It lacks a mechanism to accept custom tokenizer functions.
we gain a substantial performance improvement (5× speedup) at the cost of losing the flexibility to customize tokenization. Current Implementation
|
|
@Lakshmi-bashyam Thanks a lot for the changes, and for the information about the vectorizer. I guess we will have to live with its limitations at least for now. At least it is fast! I see that you fixed some of the recent merge conflicts, but apparently pyproject.toml is still in a conflict state according to GitHub. Can you take a look? I opened a new issue about reimplementing the NN ensemble backend using Pytorch: #895 |
|
Dear @Lakshmi-bashyam and @osma, |
|
Another topic that I would like to raise is that of dependencies (and I hate to bring it up!). At DNB we are currently developing an Annif-Backend that uses Embedding based matching. @RietdorfC will give an update on this soon. [1] ebm_packages.txt |
|
@mfakaehler You’re correct — PECOS hasn’t yet been updated to work with the latest Transformers versions. I’ve also opened an issue with the PECOS team about this: amzn/pecos#311 Currently, PECOS ≥ 1.2.7 can only be used with the constraint There’s also another dependency conflict: Python 3.11 is supported starting from PECOS ≥ 1.2.7, but those versions require |
|
Thanks for the clarification @Lakshmi-bashyam. I am sorry to say, that its not obvious to me what to do about this :( |
|
What this all boils down to is that it looks like PECOS is not being very actively maintained and relies on versions of libraries that are about to become obsolete. This is a problem if we want to integrate it with Annif (as in this PR), even as an optional dependency, because of the way PECOS sets upper limits on the versions of important library dependencies. While we could try to adjust every other component to accommodate PECOS, if it's even possible to do so, this would only work for a limited time if PECOS stays as it is. The ecosystem always moves on: new versions of libraries are released (possibly with security fixes!), new Python releases will come with new demands on libraries etc. So unfortunately I don't see any other way of moving forward than trying to work with the PECOS project on bringing the dependencies up to date on their side. In the worst case, this might mean forking it (or at least important parts of it) and taking over maintenance. |
|
@osma Yeah, I’m with you on this. For now I’ll see if I can update the dependencies without conflicts and send a PR over to the PECOS team. On the Annif side, at least for this PR, I’ll just downgrade the dependencies temporarily until the PECOS team sort this out. |
|




This PR adds xtransformer as an optional dependency, incorporating minor changes and updating the backend implementation to align with the latest Annif version, building on the previous xtransformer PR #540