Skip to content

Materials for the hands-on session 'Workflows for natural product genome mining with antiSMASH, BiG-SCAPE and MIBiG', of the CIIMAR short course Bioinformatics for Natural Product Discovery 2025

License

Notifications You must be signed in to change notification settings

medema-group/Genome-Mining-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

Genome Mining Workshop 2025

Materials for the hands-on session 'Workflows for natural product genome mining with antiSMASH, BiG-SCAPE and MIBiG', of the CIIMAR short course Bioinformatics for Natural Product Discovery 2025

This hands-on session features three main tools/resources:

  • antiSMASH & MIBiG (online)
  • BiG-SCAPE (offline, on your laptop or server)

In this repo you'll find the materials for the antiSMASH part of this hands-on session. We recommend you start with this section. Once you're done, or approximately 45min into the session, we'll switch over to the BiG-SCAPE, find the materials for that part here.

antiSMASH Documentation

The official documentation for antiSMASH can be found here.

See a supporting document with a demo of the antiSMASH online webserver here. Use this file to guide your exploration of antiSMASH.


antiSMASH Exercises

To save time for today, we only will work with the pre-computed antiSMASH example record for Streptomyces coelicolor instead of uploading/running individual sequences.

Exercise 1: secondary metabolite gene clusters of Streptomyces coelicolor

As an example, we will analyze the secondary metabolite gene clusters of the model organism Streptomyces coelicolor, which is provided as pre-computed demo data:

Load S. coelicolor demo data

  1. Open antiSMASH in your web browser
  2. Click “Open example output”

Region overview table

When the analysis is finished, a table of identified regions is displayed

  1. Select individual regions by clicking either on the colored boxes or on the colored “Region XXX” button
  2. Return to the overview table by clicking on “Overview”

Cluster details

  1. Browse through the different regions; use the interactive page and get a feeling by clicking on the genes in the “Gene Cluster description”, Detailed annotation, and “Homologous (Sub)cluster” panels.
  2. Try links to blast or other services that appear on the dropdown windows when clicking on genes/domains.
  3. Analyze the CDA gene cluster (Cluster 11)

Questions:

  • Extract the predicted sequence of the peptide from the antiSMASH results.
  • Compare the predicted amino acid sequence with the experimentally determined structure.
  • Find other organisms, which might produce similar compounds.
  • Find the genes in the CDA gene cluster, which code for enzymes involved in the biosynthesis of the non-proteinogenic amino acid hydroxyphenylglycine (hpg).

Exercise 2: secondary metabolite gene clusters of Bacillus velezensis

  1. Find and download the Bacillus velezensis (amyloliquefaciens) FZB42 genome from NCBI (RefSeq accession: NC_009725.2)
  2. Run antiSMASH with the genome
  3. Analyze the data as discussed above.

Questions:

  • How many BGCs does B. velezensis have?
  • Which are PKS / NRPS / hybrid products
  • Which compounds are known?
  • Can you find these BGCs in antiSMASH DB? and in MIBiG?

Exercise 3: secondary metabolite gene clusters of Streptomyces leeuwenhoekii

  1. Open the antiSMASH results for a recently sequenced streptomycete genome using default settings and 'loose' mode.

Questions:

  • Explore the results produced by the default run settings.

  • How many biosynthetic gene clusters (BGCs) did antiSMASH identify? Based on the results, which known compounds do you expect this strain to be able to produce? Hint: take a look at the detailed knownclusterblast results for each cluster that has at least > 50% similarity on the gene level to a known cluster to assess this.

  • Now take a look at the results produced by the 'loose' mode.

    • Focus on some of the ‘newly added’ BGCs, and specifically look at the smCOG annotations and the knownclusterblast results.
    • Can you identify some clusters that are very probable to encode the biosynthesis of an actual secondary metabolite?
    • And can you find some clusters for which this is very unlikely? Which further methods could you use to identify those putative BGCs that are likely to encode the biosynthesis of a biologically active molecule?

3.1 Chaxamycin

The following molecule is known to be made by the sequenced strain

The molecule (called chaxamycin) is an ansamycin-type polyketide. Ansamycins are characterized by the presence of a macrocycle composed of a benzenic or naphthalenic chromophore, bridged by an aliphatic ansa chain that terminates at the chromophore in an amide linkage. The key precursor of the chromophore is 3-amino-5-hydroxybenzoic acid (AHBA), which is known to be synthesized by proteins encoded by a specific sub-cluster that is found in all ansamycin PKS gene clusters, such as those for the production of rifamycin, macbecin and naphthmycin.

  • Based on the antiSMASH results, which of the gene clusters in the sequenced genome is most likely to produce the chaxamycins?
    • Which antiSMASH feature(s) did you use to conclude this?
  • Look at the ClusterBlast results for the region you identified. Does this region entry represent one single gene cluster, or does it in fact represent two separate but adjacently located gene clusters that are part of a single region? Why do you think so?

3.2 Peptidic Fragments

Based on a mass spectrometry experiment, two apparently new natural products are identified from the strain. Based on fragmentation analysis, both appear to be peptides. For each of the two metabolites, a six amino acid-long fragment is retrieved that is reconstructed based on mass shifts from tandem mass spectra.

Fragment from peptide 1: Ala-Val-Ala-Phe-Orn-Thr Fragment from peptide 2: Leu-Tyr-Gly-Val-Arg-Asn

The total mass of the entire second peptide is approximately twice as large as that of the first peptide. Hint: one of the two peptides is not produced by an NRPS assembly line.

  • Based on the antiSMASH results (default mode), which gene clusters do you think are responsible for the biosynthesis of the two peptides? What strategy did you use to find this out?

3.3 Predicting the chemistry of the product of an unknown BGC

Now have a look at the gene cluster in region 10 (default mode).

  • To which known gene clusters is this cluster similar?

    • What are the known biological activities of the product of these clusters?
    • Look into the literature references at the bottom of the MIBiG entries, which can be reached by clicking the link on the accession number of the knownclusterblast hit.
  • What are the differences between the known gene cluster and this one?

    • Based on the chemistry generated by the enzymes that differ, what do you predict about the potential structural differences between this molecule and the region 10 product? Do you consider this a promising target for potential further investigation?

For your reference, please find a paper describing the entire genome sequence used above here. For results on experiments on the chaxamycin cluster, see here.

Now, we'll switch over to the BiG-SCAPE, find the materials for that part here.


3.1 Chaxamycin bonus questions

An alternative way of finding the cluster would have been to use the NaPDoS tool to identify ketosynthase domains clustering with those of ansamycin PKSs in a phylogeny. To do this:

  • go to the genome sequence
  • click ‘Protein’ under ‘Related information’
  • click ‘Send to’ -> ‘File’ -> ‘Format: FASTA’ -> ‘Create file’.
  • Now upload the proteins belonging to the genome to NaPDoS and use it to construct a phylogenetic tree.
  • View the SVG in your browser and locate the rifamycin PKS reference ketosynthase domains (belonging to the PKS genes RifA, RifB, etc.).
  • Do you see KS domains from your genomes clustering with these?

Sequence similarity networking

Sequence similarity network analysis is a useful technique to identify similar BGCs across multiple genomes. The BiG-SCAPE tool automates this process. Have a look at the BiG-SCAPE 1 output of 96 complete streptomycetes genomes.

  • Can you find chaxamycin? What can you conclude about the taxonomic distribution of chaxamycin gene clusters?
  • Now find its relative geldanamycin in the same output. Can you find some strains in the dataset that are likely to also produce geldanamycin?
  • This network was made with a relatively strict cut-off of 0.3. Based on the structures of the molecules and the architectures of the BGCs, what would likely happen to the geldanamycin and chaxamycin connected components in the network if you would use a less strict cut-off (e.g., 0.7)?

About

Materials for the hands-on session 'Workflows for natural product genome mining with antiSMASH, BiG-SCAPE and MIBiG', of the CIIMAR short course Bioinformatics for Natural Product Discovery 2025

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published