Skip to content

Support for BILN #3541

@ljubica-milovic

Description

@ljubica-milovic

Background

BILN is a file format similar to HELM. The major difference between it and HELM is the monomer type it supports: only amino acids (type: PEPTIDE) and CHEMs (type: CHEM). The advantage of BILN compared to HELM is its human-readability.

Similarly both to HELM and to SMILES one BILN string will map onto only one chemical structure, but one structure can be represented with multiple strings. To ensure the uniqueness of a BILN string, a set of best practises is defined (described in the export section of this ticket), but strings not constructed using those best practises should also be able to be loaded (described in the import section of this ticket).

Ketcher ticket.

Requirements

Export of BILN

  1. A structure/structures can get exported to BILN only if all the monomers in them are types PEPTIDE and/or CHEM* $\color{Red}{\textbf{and}}$ have BILN codes. If that is not true, and the user tries to save to BILN, Indigo should return an error: "Only amino acids and CHEMs with BILN codes can get exported to BILN."

'* BILN only supports monomers explicitly defined as CHEMs. The HELM functionality where unknown small molecules get exported as CHEMs with SMILES is non existent in BILN.

$\color{Red}{\textbf{Logic for constructing the BILN string}}$

  1. Different backbones are separated from each other with a full stop.

Same as HELM, a backbone starts with the first monomer who has an occupied R2, bot doesn't have R1 or doesn't have an occupied R1.

2.1. In case of a circular backbone (where all connections are R1-R2), the BILN string that would be the first alphabetically is chosen.

2.2. In case of multiple backbones, they should be ordered by the decreasing number of amino acid monomers, or if equal by decreasing number of all monomers, or if equal alphabetically.

2.2.1. Every chain with n≤5 monomers in length is considered to be n monomers long irregardless of the amino-acid content.

  1. Monomers in one backbone with default bonds (R1 of one monomer and R2 of another) are written with a hyphen separating them.

  2. Monomers are identified by unique codes:

  • If a code contains a hyphen it should be written with square brackets around it;
  • If a code doesn't contain a hyphen it shouldn't be written with square brackets around it.
  1. Non-backbone bonds are written explicitly in round brackets to the right of monomers that participate in the bond with two pieces of information: a bond identifier and the R-group number:
  • Bond identifier and the R-group number are separated with a comma, and without spaces.
  • A bond identifier is a unique positive integer, and it is 1 for the first explicitly indicated bond in the string, 2 for the second, ...
  • The R-group number indicates the number of the R-group that monomer uses to participate in that bond.

5.1. In case of one monomer participating in multiple non-backbone bonds, the brackets should be put next to each other.

Expand for examples

Requirement 2: $\color{Pink}{\textbf{Different backbones are separated from each other with a full stop.}}$

BILN string: A-C-D.E-F-G
Ketcher: Image

BILN string: A-C(1,3)-A.C(1,3)
Ketcher: Image

Requirement 2.1: $\color{Pink}{\textbf{In case of a circular backbone (where all connections are R1-R2), the BILN string that would be the first alphabetically is chosen.}}$

BILN string: A(1,1)-C-D-E-F(1,2)
Ketcher: Image

BILN string: A(1,2)-A-Tza-C(2,3)-Tza(1,1).Abu(3,1)-dE(2,3)-Bip-alle(3,2)
Ketcher: Image
A-A-Tza-C-Tza is the first alphabetically (two 'a's), and Abu-dE-Bib-alle is the first alphabetically ('ab').

Requirement 2.2: $\color{Pink}{\textbf{In case of multiple backbones, they should be ordered by the decreasing number of amino}}$
$\color{Pink}{\textbf{acid monomers, or if equal by decreasing number of all monomers, or if equal alphabetically.}}$

BILN string: A-A-A-A-A.C-C-C-C.D-D-D.E-E.F
Ketcher: Image
Chains contain different amounts of amino-acid monomers.

BILN string: A-A-A-A-A-A-A-A-A-A.A-A-A-A-A-A-A-A-A6OH-A6OH-A6OH-A6OH
Ketcher: Image
The second string is longer, but contains fewer amino-acid monomers.

BILN string: C-C-C-C-C-A6OH.C-C-A6OH-A6OH-C-C
Ketcher: Image
Both chains are of the same length, but one has more amino-acid monomers, so should be placed first.

BILN string: A-C-D-E-F-A6OH.C-D-E-F-G-A6OH
Ketcher: Image
Both chains contain the same amount of amino-acid monomers, and monomers overall, so are ordered alphabetically.

Requirement 2.2.1: $\color{Pink}{\textbf{Every chain with n≤5 monomers in length is considered to be n monomers long irregardless of the amino-acid content. }}$

BILN string: A-A-A6OH-A6OH-A6OH.C-C-C-C-C
Ketcher: Image
Even though the first string contains fewer amino-acids than the second, they should be ordered alphabetically because they are of the same length and are short (≤5 monomers).

Requirement 3: $\color{Pink}{\textbf{Monomers in one backbone with default bonds (R1 of one monomer and R2 of another) are written with a}}$
$\color{Pink}{\textbf{hyphen separating them.}}$

BILN string: A-C-D-E
Ketcher: Image

Requirement 4: $\color{Pink}{\textbf{Monomers are identified by unique codes: If a code contains a hyphen it should be written with}}$
$\color{Pink}{\textbf{square brackets around it; If a code doesn't contain a hyphen it shouldn't be written with square brackets around it.}}$

BILN string: [D-Cha]-C-[D-Abu]-dC-[D-2Thi]
Ketcher: Image

Requirement 5: $\color{Pink}{\textbf{Non-backbone bonds are written explicitly in round brackets to the right of}}$
$\color{Pink}{\textbf{monomers that participate in the bond with two pieces of information: a bond identifier and the R-}}$
$\color{Pink}{\textbf{group number: - Bond identifier and the R-group number are separated with a comma, and without}}$
$\color{Pink}{\textbf{and without spaces.; A bond identifier is a unique positive integer, and it is 1 for the first explicitly}}$
$\color{Pink}{\textbf{indicated bond in the string, 2 for the second, ...; The R-group number indicates the number of the}}$
$\color{Pink}{\textbf{ R-group that monomer uses to participate in that bond.}}$

BILN string: [D-hPhe](1,1)-K-R(2,3)-L-M(1,2).Cya-C(2,3)-Phe_4I
Ketcher: Image

BILN string: gGlu(1,3)-Gla(2,3)-meE(3,3)-H(4,3)-dH-Hhs(5,3)-K(6,3)-Aad(7,3)-[D-Orn](8,3)-dK(9,3).dC(1,3)-Hcy(2,3)-meC(3,3)-D(4,3)-dD(5,4)-meD(6,3)-E(7,3)-[D-gGlu](8,3)-dE(9,3)
Ketcher: Image

Requirement 5.1: $\color{Pink}{\textbf{In case of one monomer participating in multiple non-backbone bonds, two brackets should be put next to each other.}}$

BILN string: A-[Test-6-Ch](1,3)(2,4)-C.D(1,1).E(2,2)
Ketcher: Image
Note:
This string: A-[Test-6-Ch](1,4)(2,3)-C.D(2,1).E(1,2) wouldn't be the preferred version because D is before E in the string, so the bond between D and the CHEM should have a lower identifier


Import of BILN

  1. If a string cannot be interpreted as a valid BILN string, Indigo should return an error: "The string cannot be interpreted as a valid BILN string."

  2. All the strings written using the best practises (export section) should be able to be loaded.

$\color{Red}{\textbf{Valid strings not constructed using best practises}}$

  1. Circular structures whose BILN string is not alphabetically optimal should be able to be loaded (a BILN string that would be first alphabetically exists).

  2. Strings that do not follow the order of backbone arrangement (largest amount of amino-acids first, then largest amount of monomers, then alphabetically) should be able to be loaded.

  3. Strings with explicitly written backbone connections should be able to be loaded.

  4. Strings containing monomer codes without hyphens in square brackets should be able to be loaded.

  5. Strings that contain incorrectly ordered bonds should be able to be loaded.

Expand for examples

Requirement 3: $\color{Pink}{\textbf{Circular structures whose BILN string is not alphabetically optimal should be able to be loaded (a BILN string}}$
$\color{Pink}{\textbf{that would be first alphabetically exists). }}$

Best practises BILN string: A(1,1)-C-D-E(1,2)
Ketcher: Image
Also valid: C(1,1)-D-E-A(1,2)
Also valid: D(1,2)-C-A-E(1,1)
Also valid: E(1,2)-D-C-A(1,1)

Requirement 4: $\color{Pink}{\textbf{Strings that do not follow the order of backbone arrangement (largest amount of amino-acids first,}}$
$\color{Pink}{\textbf{then largest amount of monomers, then alphabetically) should be able to be loaded.}}$

Best practises BILN string: L-E-R(1,3)-S-T.A(2,1)-C(1,3)-F-G(2,2)
Ketcher: Image
Also valid: A(1,1)-C(2,3)-F-G(1,2).L-E-R(2,3)-S-T

Best practises BILN string: A-A-A-A-A-A.[PEG-2]-C-C-C-C[PEG-2].C-C-C-C
Ketcher: Image
Also valid: A-A-A-A-A-A.C-C-C-C.[PEG-2]-C-C-C-C[PEG-2]
Also valid: [PEG-2]-C-C-C-C[PEG-2].A-A-A-A-A-A.C-C-C-C
Also valid: [PEG-2]-C-C-C-C[PEG-2].C-C-C-C.A-A-A-A-A-A
Also valid: C-C-C-C.A-A-A-A-A-A.[PEG-2]-C-C-C-C[PEG-2]
Also valid: C-C-C-C.[PEG-2]-C-C-C-C[PEG-2].A-A-A-A-A-A

Requirement 5: $\color{Pink}{\textbf{Strings with explicitly written backbone connections should be able to be loaded.}}$

Best practises BILN string: [D-Cit]-aThr-meS
Ketcher: Image
Also valid: [D-Cit](1,2).aThr(1,1)(2,2).meS(2,1)
$\color{Red}{\textbf{Not}}$ valid: [D-Cit](1,2)-aThr(1,1)(2,2)-meS(2,1)

Requirement 6: $\color{Pink}{\textbf{Strings containing monomer codes without hyphens in square brackets should be able to be loaded.}}$

Best practises BILN string: [D-2Thi]-D-[D-gGlu]-meF-G-[Lys-al]
Ketcher: Image
Also valid: [D-2Thi]-[D]-[D-gGlu]-[meF]-[G]-[Lys-al]
Also valid: [D-2Thi]-D-[D-gGlu]-[meF]-G-[Lys-al]
$\color{Red}{\textbf{Not}}$ valid: D-2Thi-D-D-gGlu-meF-G-Lys-al, because 2Thi, Lys, and al are not valid monomer codes.

Requirement 7: $\color{Pink}{\textbf{Strings that contain incorrectly ordered bonds should be able to be loaded.}}$

Best practises BILN string: A-C(1,3)-D(2,3)-E.F-G-H(1,3)-I-K(2,3)
Ketcher: Image
Also valid: A-C(7563,3)-D(3,3)-E.F-G-H(7463,3)-I-K(3,3)
$\color{Red}{\textbf{Not}}$ valid: A-C(-1,3)-D(2,3)-E.F-G-H(-1,3)-I-K(2,3) because the bond identifier must be positive.
$\color{Red}{\textbf{Not}}$ valid: A-C(1.25,3)-D(2,3)-E.F-G-H(1.25,3)-I-K(2,3) because the bond identifier must be a whole number.
$\color{Red}{\textbf{Not}}$ valid: A-C(1,3)-D(1,3)-E.F-G-H(1,3)-I-K(2,3) because the same bond identifier appears ≠ two times.
$\color{Red}{\textbf{Not}}$ valid: A-C(1,4)-D(2,3)-E.F-G-H(1,3)-I-K(2,3) because monomer C doesn't have R4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for Feature.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions