Tools for ensuring the quality of multiple sequence alignments (MSAs) in Rfam.
This repository includes a modular validation script for Stockholm format alignment files (.so, .sto, .stk).
python3 validate_stockholm.py [-v] [--fix] [--output-mode {stdout,file}] [--cm-db Rfam.cm] <file1.so> [file2.so ...]Options:
-v, --verbose: Print detailed validation information--fix: Attempt to fix fixable errors automatically--output-mode {stdout,file}: Output mode for fixed files (default: file)file: Create a new file with_correctedsuffixstdout: Print corrected content to stdout
--cm-db <path>: Path to Rfam CM database for filtering sequences matching known Rfam families (optional)
Fatal Errors (must be fixed manually):
- Missing
# STOCKHOLM 1.0header - Missing
//terminator - No sequences found in alignment
- All sequences must have the same length
- Sequences must not contain whitespace characters
Fixable Errors (can be auto-corrected with --fix):
- Duplicate sequences (same accession, coordinates, and sequence data)
- Missing coordinates (sequences without start/end positions)
- Overlapping sequences (sequences from the same accession that overlap by >=1 bp)
Warnings (non-critical):
- Sequences matching known Rfam families (when
--cm-dbis provided) — flagged but not removed - Missing 2D structure consensus annotation (
#=GC SS_cons) #=GC SS_conslength mismatch with sequence length#=GC SS_consinvalid WUSS notation characters or unbalanced brackets- Lines exceeding 10,000 character limit
The validation logic is split into separate modules in the scripts/ directory:
fatal_errors.py: Errors that cannot be automatically fixedfixable_errors.py: Errors that can be automatically corrected (including coordinate fixing, overlap removal, and BLAST fallback)stockholm_warnings.py: Non-critical issuesparser.py: Stockholm file parsing utilities (supports both single-block and interleaved multi-block format)alignment_stats.py: Pairwise identity computation for alignment quality assessmentconfig.py: Configurable parameters (BLAST thresholds, cmscan E-value, NCBI request delay)
Sequences should follow the format: ACCESSION/START-END where:
ACCESSIONis the sequence identifier (e.g., from GenBank likeAF228364.1)START-ENDare the coordinates indicating which portion of the original sequence is included (e.g.,1-74)
Example: AF228364.1/1-74
Sequence data:
- May contain any characters except whitespace
- Gaps may be indicated by
.or-
The script can detect and remove duplicate sequences using the --fix flag. Duplicates are defined as sequences that have:
- The same accession/identifier
- The same coordinates
- The exact same sequence data
When sequences are missing coordinates (e.g., NZ_CP038662.1 instead of NZ_CP038662.1/4723652-4723704), the --fix flag will:
- NCBI Direct Lookup: Download the FASTA sequence from NCBI and find coordinates by matching the alignment sequence
- BLAST Fallback: If direct lookup fails, perform a BLAST search against NCBI's nucleotide database (accepts hits with >=95% identity, >=90% coverage, e-value <=1e-10)
- Removal: Sequences that cannot be resolved via either method are removed from the output
Requirements: BioPython (pip install biopython)
Sequences from the same accession (species) that overlap by at least 1 bp are detected. When using --fix, overlapping sequences are removed using greedy interval scheduling to keep the maximum number of non-overlapping sequences.
When using --fix, all sequences are validated against NCBI to ensure they match the source data at the given coordinates:
- NCBI Validation: Fetches the source sequence from NCBI and compares it against the Stockholm sequence at the given coordinates
- BLAST Fallback: If the accession is not found in NCBI or the sequence doesn't match, a BLAST search is performed. If a hit is found (>=95% identity, >=90% coverage, e-value <=1e-10), the hit's accession is re-fetched from NCBI and the sequence is verified for an exact match before the rescue is accepted. This ensures compatibility with Rfam's
rfsearch.plMD5 validation. - Removal: Sequences that fail both NCBI validation and BLAST (or whose BLAST hit does not exactly match NCBI) are removed from the output
When building a new Rfam family, you can check whether sequences already belong to existing Rfam families using the --cm-db option. This requires:
- Download Rfam.cm from https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
- Decompress:
gunzip Rfam.cm.gz - Press the database:
cmpress Rfam.cm(generates.i1f,.i1i,.i1m,.i1pindex files) - Install Infernal:
cmscanmust be available in your PATH (http://eddylab.org/infernal/)
cmscan is run with --cut_ga to use each model's GA (gathering) threshold, matching how Rfam curates family membership. Hits that pass GA are flagged as warnings but not removed. An additional E-value safety filter (CMSCAN_MAX_EVALUE in scripts/config.py, default 1e-3) is applied on top.
# Check for known families during fixing
python3 validate_stockholm.py -v --fix --cm-db Rfam.cm input.stoThe released Rfam.cm is static and only updated with each Rfam release. To filter against the latest curated families (including those not yet released), you can build a live CM database directly from the Rfam SVN repository:
# Build live CM database (downloads all family CMs and runs cmpress)
python3 scripts/build_live_cm.py
# Custom output path
python3 scripts/build_live_cm.py --output /path/to/Rfam_live.cm
# Then use with the validator
python3 validate_stockholm.py -v --fix --cm-db Rfam_live.cm input.stoThis fetches CM files from https://svn.rfam.org/svn/data_repos/trunk/Families/ (publicly readable, no credentials needed) and concatenates them into a single pressed CM database. Requires wget and cmpress (Infernal) in your PATH.
When running with --fix -v, the script computes and displays the average pairwise identity for each sequence in the final alignment. Sequences with identity significantly below the alignment average (>20 percentage points) are flagged as potential outliers.
After all sequence filtering and corrections are applied, the script runs esl-reformat --mingap stockholm on the final alignment to remove any all-gap columns. All-gap columns can be introduced when sequences are removed during earlier steps (duplicate removal, overlap resolution, NCBI validation). This step ensures the corrected alignment has no unnecessary gap-only columns.
Requires esl-reformat from the Easel library, which is distributed as part of Infernal. If esl-reformat is not found in PATH, this step is skipped and a warning is printed.
The report includes the number of all-gap columns removed and the resulting alignment width.
When using --fix in file output mode, a report file (<stem>_Report.txt) is generated alongside the corrected alignment, capturing the full verbose output of the validation and fixing process.
Parameters can be adjusted in scripts/config.py:
| Parameter | Default | Description |
|---|---|---|
BLAST_MIN_IDENTITY |
95 | Minimum identity % for BLAST hits |
BLAST_MIN_COVERAGE |
90 | Minimum coverage % for BLAST hits |
BLAST_MAX_EVALUE |
1e-10 | Maximum e-value for BLAST hits |
BLAST_HITLIST_SIZE |
5 | Number of BLAST hits to retrieve |
CMSCAN_MAX_EVALUE |
1e-3 | Maximum e-value for cmscan hits (known family warnings) |
NCBI_REQUEST_DELAY |
0.5 | Delay between NCBI requests in seconds |
# Validate a single file
python3 validate_stockholm.py example_valid.so
# Validate multiple files with verbose output
python3 validate_stockholm.py -v file1.so file2.so file3.so
# Fix errors and create corrected file
python3 validate_stockholm.py --fix file.so
# Fix and output to stdout
python3 validate_stockholm.py --fix --output-mode stdout file.so
# Fix with known Rfam family filtering
python3 validate_stockholm.py -v --fix --cm-db Rfam.cm file.stoThe repository includes a GitHub Action that automatically validates Stockholm files in pull requests. The CI check will:
- Trigger when a PR modifies
.so,.sto, or.stkfiles - Run the validation script on all changed files
- Pass only if all files are valid
Stockholm format is used for multiple sequence alignments. Basic structure:
# STOCKHOLM 1.0
AF228364.1/1-74 CGGCAGAUGAUGAU-UUUACUUGGAUUCCCCUUCAGAACAUUUA
AF228365.1/1-73 CGGCAGAUGAUGAU-UUUACUUGGAUUCCCCUUCAGAACAUUU
#=GC SS_cons <<<<_______..________.__._.______.___.___.___
//
The #=GC SS_cons line is the 2D structure consensus annotation, which represents the secondary structure of the RNA alignment. While not strictly required, it is recommended for Rfam alignments.
For more information, see the Stockholm format specification.
If you have any questions or feedback, feel free to submit a GitHub issue or email us.