GenoSieve 🧬

Intelligent Subsampling for Genomic Epidemiology

GenoSieve is a Python tool for the intelligent subsampling of large genomic sequence datasets. It strategically reduces a dataset to a target size while ensuring the subset is both representative (preserving geographic and temporal distributions) and genetically diverse (maximizing sequence variability).

This tool is designed for researchers in genomic epidemiology, virology, and evolutionary biology who need to create meaningful, high-quality subsets for phylogenetic inference, modeling, or other computationally intensive analyses.

🤔 Why GenoSieve?

Randomly downsampling a large dataset is fast but risky. You might lose rare variants, under-represent emerging clades, or skew the geographic distribution of your samples.

GenoSieve solves this problem by using a multi-step, objective-driven approach. It treats subsampling not as a random process, but as an optimization problem: finding the best possible subset that preserves the most critical information from the original data.

🔬 How It Works

GenoSieve follows a three-stage pipeline for each user-defined group (e.g., by month and clade):

Stratify: The dataset is first grouped by a time period (--date_freq) and clade. The subsampling process runs independently on each group.
Allocate: For each group, GenoSieve determines how many sequences to select from each geographic region. This allocation is proportional to the number of available sequences but can be fine-tuned with diversity-aware weights.
Select: Finally, the tool selects the specific sequences from each region. It uses a Genetic Algorithm or a Hybrid Heuristic to find the most genetically diverse subset that matches the allocated size.

✨ Key Features

Smart Proportional Allocation: Guarantees fair representation for each region using a smoothing factor (--alpha) and a "one-per-region" guarantee.
Diversity-Weighted Allocation: Optionally adjusts allocation to prioritize regions with higher intra-clade genetic diversity (--use_diversity).
Advanced Diversity Maximization:
- Genetic Algorithm (GA): A powerful optimization engine to find the subset that best satisfies a diversity objective (--objective_function).
- Hybrid Sampling: A fast and effective heuristic that prioritizes unique sequences (singletons).
Sequence Vectorization: Transforms sequences into numerical vectors using k-mers and TF-IDF, enabling robust distance calculations.
Stratified Workflow: Preserves the temporal and phylogenetic structure of your data by processing groups independently.
Flexible Command-Line Interface (CLI): Offers granular control over every step of the pipeline.

⚙️ Installation

Requirements:

Git
Miniconda or Anaconda

Follow these steps in your terminal:

# 1. Clone the repository
git clone https://github.com/vcarius/genosieve.git
cd genosieve

# 2. Create the Conda environment from the YAML file
# This creates an environment named "genosieve"
conda env create -f environment.yml

# 3. Activate the environment before running the script
conda activate genosieve

You are now ready to run GenoSieve!

🚀 Usage

GenoSieve is run from the command line. The script is named genosieve.py.

Basic Command Structure

python genosieve.py --METADATA <path_to_metadata.tsv> \
                    --FASTA_ALN <path_to_alignment.fasta> \
                    --target_N <samples_per_group> \
                    [OPTIONS]

Example 1: Fast Hybrid Sampling (Default)

This command subsamples to 500 sequences per month/clade group using the default fast heuristic.

python genosieve.py --METADATA data/metadata.tsv \
                    --FASTA_ALN data/sequences.fasta \
                    --target_N 500 \
                    --date_freq M

Example 2: Advanced Genetic Algorithm Optimization

This command uses the Genetic Algorithm to select the 300 most diverse sequences, maximizing the minimum distance between any two sequences (--objective_function min).

python genosieve.py --METADATA data/metadata.tsv \
                    --FASTA_ALN data/sequences.fasta \
                    --target_N 300 \
                    --use_GA \
                    --objective_function min \
                    --generations 150 \
                    --verbose

📌 Command-Line Arguments

Here are the most important arguments. For a full list, run python genosieve.py --help.

Argument	Description	Default
`--METADATA`	Required. Path to metadata TSV file. Must contain `name`, `date`, `region`, `clade`.	`None`
`--FASTA_ALN`	Required. Path to the sequence alignment file in FASTA format.	`None`
`--target_N`	Required. The total number of sequences to select per group (time/clade).	`None`
`--use_GA`	Enable the Genetic Algorithm for diversity maximization. If false, uses the faster hybrid method.	`False`
`--objective_function`	The fitness function for the GA. `min` (maximize minimum distance) is recommended.	`min`
`--alpha`	Controls the smoothness of proportional allocation (0=uniform, 1=proportional).	`0.5`
`--date_freq`	The frequency for grouping dates (`D`=day, `W`=week, `M`=month, `Y`=year).	`M`
`--dedup`	If enabled, removes identical sequences within the same region before subsampling.	`False`
`--use_diversity`	If enabled, uses sequence diversity to adjust allocation weights.	`False`
`--verbose`	Print detailed progress, especially useful for monitoring the GA.	`False`

📄 Citation

If you use GenoSieve in your research, please cite this repository:

Souza, V.C. (2025). GenoSieve: Intelligent Subsampling for Genomic Epidemiology. GitHub. https://github.com/vcarius/genosieve

(Please update the year and details as needed when you publish.)

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

👨‍💻 Author & Maintainer

Developed by Vinicius Carius de Souza.

GitHub: @vcarius
LinkedIn: [@vcarius](https://www.linkedin.com/````

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
GenoSieve.py		GenoSieve.py
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenoSieve 🧬

🤔 Why GenoSieve?

🔬 How It Works

✨ Key Features

⚙️ Installation

🚀 Usage

Basic Command Structure

Example 1: Fast Hybrid Sampling (Default)

Example 2: Advanced Genetic Algorithm Optimization

📌 Command-Line Arguments

📄 Citation

📜 License

👨‍💻 Author & Maintainer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenoSieve 🧬

🤔 Why GenoSieve?

🔬 How It Works

✨ Key Features

⚙️ Installation

🚀 Usage

Basic Command Structure

Example 1: Fast Hybrid Sampling (Default)

Example 2: Advanced Genetic Algorithm Optimization

📌 Command-Line Arguments

📄 Citation

📜 License

👨‍💻 Author & Maintainer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages