A classification framework to enhance your habitat distribution models
View framework Β· Report Bug Β· Request FeatureThis is the code for the framework of the paper A deep-learning framework for enhancing habitat identification based on species composition published in Applied Vegetation Science.
If you use this code for your work and wish to credit the authors, you can cite the paper:
@article{leblanc2024deep,
title={A deep-learning framework for enhancing habitat identification based on species composition},
author={Leblanc, C{\'e}sar and Bonnet, Pierre and Servajean, Maximilien and Chytr{\`y}, Milan and A{\'c}i{\'c}, Svetlana and Argagnon, Olivier and Bergamini, Ariel and Biurrun, Idoia and Bonari, Gianmaria and Campos, Juan A and others},
journal={Applied Vegetation Science},
volume={27},
number={3},
pages={e12802},
year={2024},
publisher={Wiley Online Library}
}
This framework aims to facilitate the training and sharing of Habitat Distribution Models (HDMs) using various types of input covariates including cover abundances of plant species and information about plot location.
- Prerequisites
- Data
- Installation
- Examples
- Models
- Roadmap
- Unlicense
- Contributing
- Troubleshooting
- Team
- Structure
Python version 3.7 or higher and CUDA are required.
On many systems Python comes pre-installed. You can try running the following command to check and see if a correct version is already installed:
python --version
If Python is not already installed or if it is installed with version 3.6 or lower, you will need to install a functional version of Python on your system by following the official documentation that contains a detailed guide on how to setup Python.
To check whether CUDA is already installed or not on your system, you can try running the following command:
nvcc --version
If it is not, make sure to follow the instructions here.
The framework is optimized for data files from the European Vegetation Archive (EVA). These files contain all the information required for the proper functioning of the framework, i.e., for each vegetation plot the full list of vascular plant species, the estimates of cover abundance of each species, the location and the EUNIS classification. Once the database is downloaded (more information here), make sure you rename species and header data files respectively as eva_species.csv and eva_header.csv. All columns from the files are not needed, but if you decide to remove some of them to save space on your computer, make sure that the values are still tab-separated and that you keep at least:
- the columns
PlotObservationID,Matched conceptandCover %from the species file (vegetation-plot data) - the columns
PlotObservationID,Cover abundance scale,Date of recording,Expert System,LongitudeandLatitudefrom the header file (plot attributes)
Firstly, hdm-framework can be installed via repository cloning:
git clone https://github.com/cesar-leblanc/hdm-framework.git
cd hdm-framework
Secondly, make sure that the dependencies listed in the environment.yml and requirements.txt files are installed.
One way to do so is to use conda:
conda env create -f environment.yml
conda activate hdm-env
Thirdly, to check that the installation went well, use the following command:
python main.py --pipeline 'check'
If the framework was properly installed, it should output:
No missing files.
No missing dependencies.
Environment is properly configured.
Make sure to place the species and header data files inside the Datasets folder before going further.
To pre-process the data from the European Vegetation Archive and create the input data and the target labels, run the following command:
python main.py --pipeline 'dataset'
Some changes can be made from this command to create another dataset. Here is an example to only keep vegetation plots from France and Germany that were recorded after 2000 and classified to the level 2 of the EUNIS hierarchy:
python main.py --pipeline 'dataset' --countries 'France, Germany' --min_year 2000 --level 2
To evaluate the parameters of a classifier on the dataset previously obtained using cross validation, run the following command:
python main.py --pipeline 'evaluation'
Some changes can be made from this command to evaluate other parameters. Here is an example to evaluate a TabNet Classifier using the top-3 macro average multiclass accuracy:
python main.py --pipeline 'evaluation' --model 'tnc' --average 'macro' --k 3
To train a classifier from the dataset previously obtained and save its weights, run the following command:
python main.py --pipeline 'training'
Some changes can be made from this command to train another classifier. Here is an example to train a Random Forest Classifier with 50 trees using the cross-entropy loss:
python main.py --pipeline 'training' --model 'rfc' --n_estimators 50 --criterion 'log_loss'
Before making predictions, make sure you include two new files that describe the vegetation data of your choice in the Datasets folder: test_species.csv and test_header.csv. The two files should contain the following columns (with tab-separated values):
PlotObservationID(integer),Matched concept(string) andCover %(float) for the species data, which respectively describe the plot identifier, the taxon names and the percentage coverPlotObservationID(integer),Longitude(float) andLatitude(float) for the header data, which respectively describe the plot identifier, the plot longitude and the plot latitude
To predict the classes of the new samples using a previously trained classifier, make sure the weights of the desired model are stored in the Models folder and then run the following command:
python main.py --pipeline 'prediction'
Some changes can be made from this command to predict differently. Here is an example to predict using a XGBoosting Classifier without the external criteria nor the GBIF normalization:
python main.py --pipeline 'prediction' --model 'xgb' --features 'species' --gbif_normalization False
This section lists every major frameworks/libraries used to create the models included in the project:
- MultiLayer Perceptron classifier (MLP)
- Random Forest Classifier (RFC)
- XGBoost classifier (XGB)
- TabNet Classifier (TNC)
- Feature Tokenizer + Transformer classifier (FTT)
This roadmap outlines the planned features and milestones for the project. Please note that the roadmap is subject to change and may be updated as the project progresses.
- Implement multilingual user support
- English
- French
- Integrate new popular algorithms
- MLP
- RFC
- XGB
- TNC
- FTT
- KNN
- GNB
- Add more habitat typologies
- EUNIS
- NPMS
- Include other data aggregators
- EVA
- TAVA
- Offer several powerful frameworks
- PyTorch
- TensorFlow
- JAX
- Allow data parallel training
- Multithreading
- Multiprocessing
- Supply different classification strategies
- Top-k classification
- Average-k classification
This framework is distributed under the Unlicense, meaning that it is dedicated to public domain. See UNLICENSE.txt for more information.
If you plan to contribute new features, please first open an issue and discuss the feature with us. See CONTRIBUTING.md for more information.
It is strongly unadvised to:
- not perform normalization of species names against the GBIF backbone, as it could become a major obstacle in your ecological studies if you seek to combine multiple datasets
- not include the external criteria when preprocessing the datasets, as it could lead to inconsistencies while training models or making predictions
hdm-framework is a community-driven project with several skillful engineers and researchers contributing to it.
hdm-framework is currently maintained by CΓ©sar Leblanc with major contributions coming from Alexis Joly, Pierre Bonnet, Maximilien Servajean, and the amazing people from the Pl@ntNet Team in various forms and means.
βββ data <- Folder containing data-related scripts.
β βββ __init__.py <- Initialization script for the 'data' package.
β βββ load_data.py <- Module for loading data into the project.
β βββ preprocess_data.py <- Module for data preprocessing operations.
β βββ save_data.py <- Module for saving data or processed data.
β
βββ Data <- Folder containing the created data.
βββ Datasets <- Folder containing various datasets for the project.
β βββ EVA <- Folder containing original EVA datasets.
β βββ NPMS <- Folder containing original NPMS datasets.
β βββ arborescent_species.npy <- List of all arborescent species.
β βββ digital_elevation_model.tif <- Digital elevation model data in TIFF format.
β βββ eunis_habitats.xlsx <- Excel file containing the list of EUNIS habitat.
β βββ red_list_habitats.xlsx <- Excel file containing the list of red list habitat data.
β βββ ecoregions.dbf <- Database file for ecoregion data.
β βββ ecoregions.prj <- Projection file for ecoregion shapefile.
β βββ ecoregions.shp <- Shapefile for ecoregion data.
β βββ ecoregions.shx <- Index file for ecoregion shapefile.
β βββ united_kingdom_regions.dbf <- Database file for United Kingdom regions data.
β βββ united_kingdom_regions.prj <- Projection file for United Kingdom regions shapefile.
β βββ united_kingdom_regions.shp <- Shapefile for United Kingdom regions data.
β βββ united_kingdom_regions.shx <- Index file for United Kingdom regions shapefile.
β βββ vegetation.dbf <- Database file for vegetation data.
β βββ vegetation.prj <- Projection file for vegetation shapefile.
β βββ vegetation.shp <- Shapefile for vegetation data.
β βββ vegetation.shx <- Index file for vegetation shapefile.
β βββ world_countries.dbf <- Database file for world countries data.
β βββ world_countries.prj <- Projection file for world countries shapefile.
β βββ world_countries.shp <- Shapefile for world countries data.
β βββ world_countries.shx <- Index file for world countries shapefile.
β βββ world_seas.dbf <- Database file for world seas data.
β βββ world_seas.prj <- Projection file for world seas shapefile.
β βββ world_seas.shp <- Shapefile for world seas data.
β βββ world_seas.shx <- Index file for world seas shapefile.
β
βββ Experiments <- Folder for experiment-related files.
β βββ ESy <- Folder containing the expert system.
β βββ cmd_lines.txt <- Text file with command line instructions.
β βββ data_visualization.ipynb <- Jupyter notebook for data visualization.
β βββ results_analysis.ipynb <- Jupyter notebook for results analysis.
β βββ model_interpretability.py <- Module for model interpretability.
β βββ test_set.ipynb <- Jupyter notebook for creating a test set.
β
βββ Images <- Folder for image resources.
β βββ hdm-framework.pdf <- Overview of hdm-framework image.
β βββ logo.png <- Project logo image.
β βββ neuron-based_models.pdf <- Key aspect of neuron-based models image.
β βββ tasks.png <- List of tasks image.
β βββ transformer-based_models.pdf <- Key aspect of transformer-based models image.
β βββ tree-based_models.pdf <- Key aspect of tree-based models image.
β
βββ models <- Folder for machine learning models.
β βββ ftt.py <- Module for the FTT model.
β βββ __init__.py <- Initialization script for the 'models' package.
β βββ mlp.py <- Module for the MLP model.
β βββ rfc.py <- Module for the RFC model.
β βββ tnc.py <- Module for the TNC model.
β βββ xgb.py <- Module for the XGB model.
β
βββ Models <- Folder containing the trained models.
βββ pipelines <- Folder containing pipeline-related scripts.
β βββ check.py <- Module for checking the configuration.
β βββ dataset.py <- Module for creating the train dataset.
β βββ evaluation.py <- Module for evaluating the models.
β βββ __init__.py <- Initialization script for the 'pipelines' package.
β βββ prediction.py <- Module for making predictions.
β βββ training.py <- Module for training the models.
β
βββ .github <- Folder for GitHub-related files.
β βββ ISSUE_TEMPLATE <- Folder for issues-related files.
β β βββ bug_report.md <- Template for reporting bugs.
β β βββ feature_request.md <- Template for requesting new features.
β β
β βββ pull_request_template.md <- Template for creating pull requests.
β
βββ cli.py <- Command-line interface script for the project.
βββ CODE_OF_CONDUCT.md <- Code of conduct document for project contributors.
βββ CONTRIBUTING.md <- Guidelines for contributing to the project.
βββ environment.yml <- YAML file specifying project dependencies.
βββ __init__.py <- Initialization script for the root package.
βββ main.py <- Main script for running the project.
βββ README.md <- README file containing project documentation.
βββ requirements.txt <- Text file listing project requirements.
βββ SECURITY.md <- Security guidelines and information for the project.
βββ UNLICENSE.txt <- License information for the project (Unlicense).
βββ utils.py <- Utility functions for the project.
