-
Notifications
You must be signed in to change notification settings - Fork 123
Support for BILN #3541
Description
Background
BILN is a file format similar to HELM. The major difference between it and HELM is the monomer type it supports: only amino acids (type: PEPTIDE) and CHEMs (type: CHEM). The advantage of BILN compared to HELM is its human-readability.
Similarly both to HELM and to SMILES one BILN string will map onto only one chemical structure, but one structure can be represented with multiple strings. To ensure the uniqueness of a BILN string, a set of best practises is defined (described in the export section of this ticket), but strings not constructed using those best practises should also be able to be loaded (described in the import section of this ticket).
Requirements
Export of BILN
- A structure/structures can get exported to BILN only if all the monomers in them are types PEPTIDE and/or CHEM*
$\color{Red}{\textbf{and}}$ have BILN codes. If that is not true, and the user tries to save to BILN, Indigo should return an error: "Only amino acids and CHEMs with BILN codes can get exported to BILN."
'* BILN only supports monomers explicitly defined as CHEMs. The HELM functionality where unknown small molecules get exported as CHEMs with SMILES is non existent in BILN.
- Different backbones are separated from each other with a full stop.
Same as HELM, a backbone starts with the first monomer who has an occupied R2, bot doesn't have R1 or doesn't have an occupied R1.
2.1. In case of a circular backbone (where all connections are R1-R2), the BILN string that would be the first alphabetically is chosen.
2.2. In case of multiple backbones, they should be ordered by the decreasing number of amino acid monomers, or if equal by decreasing number of all monomers, or if equal alphabetically.
2.2.1. Every chain with n≤5 monomers in length is considered to be n monomers long irregardless of the amino-acid content.
-
Monomers in one backbone with default bonds (R1 of one monomer and R2 of another) are written with a hyphen separating them.
-
Monomers are identified by unique codes:
- If a code contains a hyphen it should be written with square brackets around it;
- If a code doesn't contain a hyphen it shouldn't be written with square brackets around it.
- Non-backbone bonds are written explicitly in round brackets to the right of monomers that participate in the bond with two pieces of information: a bond identifier and the R-group number:
- Bond identifier and the R-group number are separated with a comma, and without spaces.
- A bond identifier is a unique positive integer, and it is 1 for the first explicitly indicated bond in the string, 2 for the second, ...
- The R-group number indicates the number of the R-group that monomer uses to participate in that bond.
5.1. In case of one monomer participating in multiple non-backbone bonds, the brackets should be put next to each other.
Expand for examples
Requirement 2:
BILN string: A-C-D.E-F-G
Ketcher: 
BILN string: A-C(1,3)-A.C(1,3)
Ketcher: 
Requirement 2.1:
BILN string: A(1,1)-C-D-E-F(1,2)
Ketcher: 
BILN string: A(1,2)-A-Tza-C(2,3)-Tza(1,1).Abu(3,1)-dE(2,3)-Bip-alle(3,2)
Ketcher: 
A-A-Tza-C-Tza is the first alphabetically (two 'a's), and Abu-dE-Bib-alle is the first alphabetically ('ab').
Requirement 2.2:
BILN string: A-A-A-A-A.C-C-C-C.D-D-D.E-E.F
Ketcher: 
Chains contain different amounts of amino-acid monomers.
BILN string: A-A-A-A-A-A-A-A-A-A.A-A-A-A-A-A-A-A-A6OH-A6OH-A6OH-A6OH
Ketcher: 
The second string is longer, but contains fewer amino-acid monomers.
BILN string: C-C-C-C-C-A6OH.C-C-A6OH-A6OH-C-C
Ketcher: 
Both chains are of the same length, but one has more amino-acid monomers, so should be placed first.
BILN string: A-C-D-E-F-A6OH.C-D-E-F-G-A6OH
Ketcher: 
Both chains contain the same amount of amino-acid monomers, and monomers overall, so are ordered alphabetically.
Requirement 2.2.1:
BILN string: A-A-A6OH-A6OH-A6OH.C-C-C-C-C
Ketcher: 
Even though the first string contains fewer amino-acids than the second, they should be ordered alphabetically because they are of the same length and are short (≤5 monomers).
Requirement 3:
Requirement 4:
BILN string: [D-Cha]-C-[D-Abu]-dC-[D-2Thi]
Ketcher: 
Requirement 5:
BILN string: [D-hPhe](1,1)-K-R(2,3)-L-M(1,2).Cya-C(2,3)-Phe_4I
Ketcher: 
BILN string: gGlu(1,3)-Gla(2,3)-meE(3,3)-H(4,3)-dH-Hhs(5,3)-K(6,3)-Aad(7,3)-[D-Orn](8,3)-dK(9,3).dC(1,3)-Hcy(2,3)-meC(3,3)-D(4,3)-dD(5,4)-meD(6,3)-E(7,3)-[D-gGlu](8,3)-dE(9,3)
Ketcher: 
Requirement 5.1:
BILN string: A-[Test-6-Ch](1,3)(2,4)-C.D(1,1).E(2,2)
Ketcher: 
Note:
This string: A-[Test-6-Ch](1,4)(2,3)-C.D(2,1).E(1,2) wouldn't be the preferred version because D is before E in the string, so the bond between D and the CHEM should have a lower identifier
Import of BILN
-
If a string cannot be interpreted as a valid BILN string, Indigo should return an error: "The string cannot be interpreted as a valid BILN string."
-
All the strings written using the best practises (export section) should be able to be loaded.
-
Circular structures whose BILN string is not alphabetically optimal should be able to be loaded (a BILN string that would be first alphabetically exists).
-
Strings that do not follow the order of backbone arrangement (largest amount of amino-acids first, then largest amount of monomers, then alphabetically) should be able to be loaded.
-
Strings with explicitly written backbone connections should be able to be loaded.
-
Strings containing monomer codes without hyphens in square brackets should be able to be loaded.
-
Strings that contain incorrectly ordered bonds should be able to be loaded.
Expand for examples
Requirement 3:
Best practises BILN string: A(1,1)-C-D-E(1,2)
Ketcher: 
Also valid: C(1,1)-D-E-A(1,2)
Also valid: D(1,2)-C-A-E(1,1)
Also valid: E(1,2)-D-C-A(1,1)
Requirement 4:
Best practises BILN string: L-E-R(1,3)-S-T.A(2,1)-C(1,3)-F-G(2,2)
Ketcher: 
Also valid: A(1,1)-C(2,3)-F-G(1,2).L-E-R(2,3)-S-T
Best practises BILN string: A-A-A-A-A-A.[PEG-2]-C-C-C-C[PEG-2].C-C-C-C
Ketcher: 
Also valid: A-A-A-A-A-A.C-C-C-C.[PEG-2]-C-C-C-C[PEG-2]
Also valid: [PEG-2]-C-C-C-C[PEG-2].A-A-A-A-A-A.C-C-C-C
Also valid: [PEG-2]-C-C-C-C[PEG-2].C-C-C-C.A-A-A-A-A-A
Also valid: C-C-C-C.A-A-A-A-A-A.[PEG-2]-C-C-C-C[PEG-2]
Also valid: C-C-C-C.[PEG-2]-C-C-C-C[PEG-2].A-A-A-A-A-A
Requirement 5:
Best practises BILN string: [D-Cit]-aThr-meS
Ketcher: 
Also valid: [D-Cit](1,2).aThr(1,1)(2,2).meS(2,1)
[D-Cit](1,2)-aThr(1,1)(2,2)-meS(2,1)
Requirement 6:
Best practises BILN string: [D-2Thi]-D-[D-gGlu]-meF-G-[Lys-al]
Ketcher: 
Also valid: [D-2Thi]-[D]-[D-gGlu]-[meF]-[G]-[Lys-al]
Also valid: [D-2Thi]-D-[D-gGlu]-[meF]-G-[Lys-al]
D-2Thi-D-D-gGlu-meF-G-Lys-al, because 2Thi, Lys, and al are not valid monomer codes.
Requirement 7:
Best practises BILN string: A-C(1,3)-D(2,3)-E.F-G-H(1,3)-I-K(2,3)
Ketcher: 
Also valid: A-C(7563,3)-D(3,3)-E.F-G-H(7463,3)-I-K(3,3)
A-C(-1,3)-D(2,3)-E.F-G-H(-1,3)-I-K(2,3) because the bond identifier must be positive.
A-C(1.25,3)-D(2,3)-E.F-G-H(1.25,3)-I-K(2,3) because the bond identifier must be a whole number.
A-C(1,3)-D(1,3)-E.F-G-H(1,3)-I-K(2,3) because the same bond identifier appears ≠ two times.
A-C(1,4)-D(2,3)-E.F-G-H(1,3)-I-K(2,3) because monomer C doesn't have R4.
