Skip to content

Releases: NVIDIA/bionemo-framework

NVIDIA BioNeMo Framework 2.1

21 Nov 00:33
cd4f48a

Choose a tag to compare

New Features:

  • ESM2 Implementation
    • Updated the ESM-2 Model Card with detailed performance benchmarks comparing BioNeMo2 training against vanilla pytorch.
    • Added ESM-2 inference endpoint for evaluating pre-trained models
  • Size-Aware Batching
    • Added SizeAwareBatchSampler, a pytorch data sampler that batches elements of varying sizes while ensuring that the total size of each batch does not exceed a specified maximum.
    • Added BucketBatchSampler, another pytorch data sampler that groups elements of varying sizes based on predefined bucket ranges, and create batches with elements from each bucket to ensure that each batch has elements with homogeneous sizes.
  • CLI Support
    • Added pydantic interface for pretraining jobs via parsing JSON configuration files that enables passing customized Model and DataModules classes.
    • Implemented pydantic configuration for Geneformer and ESM2 pretraining and finetuning.
    • Added 'recipes' for generating validated JSON files to be used with pydantic interface.
    • Added installable scripts for 2/3 respectively, bionemo-esm2-recipe, bionemo-esm2-train, bionemo-geneformer-recipe, bionemo-geneformer-train.
  • Geneformer support in BioNeMo2:
    • Tested pre-training scripts and fine-tuning example scripts that can be used as a starting point for users to create custom derivative models.
    • Geneformer 10M and 106M checkpoints ported from BioNeMo v1 into BioNeMo v2 available and included in documentation.
    • Added inference scripts
  • Documentation
    • Cell type classification example notebook which covers the process of converting anndata into our internal format, and running inference on that data with a geneformer checkpoint, as well as making use of the inference results.
    • Updated Getting Started guide, ESM-2 tutorials
    • Added Frequently Asked Questions (FAQ) page

Changes

Read more

NVIDIA BioNeMo Framework 2.0

23 Oct 21:54
291d0ac

Choose a tag to compare

New Features:

  • ESM2 implementation
    • State of the art training performance and equivalent accuracy to the reference implementation
    • 650M, and 3B scale checkpoints available which mirror the reference model
    • Flexible fine-tuning examples that can be copied and modified to accomplish a wide variety of downstream tasks
  • First version of our NeMo v2 based reference implementation which re-imagines bionemo as a repository of megatron models, dataloaders, and training recipes which make use of NeMo v2 for training loops.
    • Modular design and permissible Apache 2 OSS licenses enables the import and use of our framework in proprietary applications.
    • NeMo2 training abstractions allows the user to focus on the model implementation while the training strategy handles distribution and model parallelism.
  • Documentation and documentation build system for BioNeMo 2.

Known Issues:

  • PEFT support is not yet fully functional.
  • Partial implementation of Geneformer is present, use at your own risk. It will be optimized and officially released in the future.
  • Command line interface is currently based on one-off training recipes and scripts. We are working on a configuration based approach that will be released in the future.
  • Fine-tuning workflow is implemented for BERT based architectures and could be adapted for others, but it requires you to inherit from the biobert base model config. You can follow similar patterns in the short term to load weights from an old checkpoint partially into a new model, however in the future we will have a more direct API which is easier to follow.
  • Slow memory leak occurs during ESM-2 pretraining, which can cause OOM during long pretraining runs. Training with a
    microbatch size of 48 on 40 A100s raised an out-of-memory error after 5,800 training steps.
    • Possible workarounds include calling gc.collect(); torch.cuda.empty_cache() at every ~1,000 steps, which appears
      to reclaim the consumed memory; or training with a lower microbatch size and re-starting training from a saved
      checkpoint periodically.

External Partner Contributions

We would like to thank the following organizations for their insightful discussions guiding the development of the BioNeMo Framework and their valuable contributions to the codebase. We are grateful for your collaboration.

Changes

Read more

NVIDIA BioNeMo Framework 1.10

23 Oct 21:54
9ba9b2c

Choose a tag to compare

Changes

  • Migrated development from NVIDIA internal to GitHub
  • License changed from NVIDIA proprietary to Apache 2.0
  • 1.10 release is functionally equivalent to 1.9 release, previous Release Notes can be found in the documentation directory of the GitHub repository