Support for BILN

## Background

BILN is a file format similar to HELM. The major difference between it and HELM is the monomer type it supports: only amino acids (type: PEPTIDE) and CHEMs (type: CHEM). The advantage of BILN compared to HELM is its human-readability.

Similarly both to HELM and to SMILES one BILN string will map onto only one chemical structure, but one structure can be represented with multiple strings. To ensure the uniqueness of a BILN string, a set of best practises is defined (described in the export section of this ticket), but strings not constructed using those best practises should also be able to be loaded (described in the import section of this ticket).

[Ketcher ticket](https://github.com/epam/ketcher/issues/9456).

## Requirements

### Export of BILN

1. A structure/structures can get exported to BILN only if all the monomers in them are types PEPTIDE and/or CHEM* $\color{Red}{\textbf{and}}$ have BILN codes. If that is not true, and the user tries to save to BILN, Indigo should return an error: "Only amino acids and CHEMs with BILN codes can get exported to BILN."

> '* BILN only supports monomers explicitly defined as CHEMs. The HELM functionality where unknown small molecules get exported as CHEMs with SMILES is non existent in BILN.

$\color{Red}{\textbf{Logic for constructing the BILN string}}$

2. Different backbones are separated from each other with a full stop.

> Same as HELM, a backbone starts with the first monomer who has an occupied R2, bot doesn't have R1 or doesn't have an occupied R1.

2.1. In case of a circular backbone (where all connections are R1-R2), the BILN string that would be the first alphabetically is chosen.

2.2. In case of multiple backbones, they should be ordered by the decreasing number of amino acid monomers, or if equal  by decreasing number of all monomers, or if equal alphabetically.

2.2.1. Every chain with n≤5 monomers in length is considered to be n monomers long irregardless of the amino-acid content. 

3. Monomers in one backbone with default bonds (R1 of one monomer and R2 of another) are written with a hyphen separating them.

4. Monomers are identified by unique codes:
- If a code contains a hyphen it should be written with square brackets around it;
- If a code doesn't contain a hyphen it shouldn't be written with square brackets around it.

5. Non-backbone bonds are written explicitly in round brackets to the right of monomers that participate in the bond with two pieces of information: a bond identifier and the R-group number:
- Bond identifier and the R-group number are separated with a comma, and without spaces.
- A bond identifier is a unique positive integer, and it is 1 for the first explicitly indicated bond in the string, 2 for the second, ...
- The R-group number indicates the number of the R-group that monomer uses to participate in that bond.

5.1. In case of one monomer participating in multiple non-backbone bonds, the brackets should be put next to each other.

<details>

<summary> Expand for examples </summary> 

Requirement 2: $\color{Pink}{\textbf{Different backbones are separated from each other with a full stop.}}$

BILN string: `A-C-D.E-F-G`
Ketcher: <img width="258" height="152" alt="Image" src="https://github.com/user-attachments/assets/c41b63ca-3fe8-4c79-b92e-c9117bcb6944" />

BILN string: `A-C(1,3)-A.C(1,3)`
Ketcher: <img width="263" height="153" alt="Image" src="https://github.com/user-attachments/assets/6ead2163-20d8-4bd6-a0b4-03bfefe81e8f" />

Requirement 2.1: $\color{Pink}{\textbf{In case of a circular backbone (where all connections are R1-R2), the BILN string that would be the first alphabetically is chosen.}}$

BILN string: `A(1,1)-C-D-E-F(1,2)`
Ketcher: <img width="216" height="199" alt="Image" src="https://github.com/user-attachments/assets/7d1549a2-e7ad-41fb-8ddf-820d28d3e6d9" />

BILN string: `A(1,2)-A-Tza-C(2,3)-Tza(1,1).Abu(3,1)-dE(2,3)-Bip-alle(3,2)`
Ketcher: <img width="362" height="346" alt="Image" src="https://github.com/user-attachments/assets/9adc22aa-fd0f-4c75-b46d-b0734bf9f0e3" />
A-A-Tza-C-Tza is the first alphabetically (two 'a's), and Abu-dE-Bib-alle is the first alphabetically ('ab').

Requirement 2.2: $\color{Pink}{\textbf{In case of multiple backbones, they should be ordered by the decreasing number of amino}}$ 
$\color{Pink}{\textbf{acid monomers, or if equal by decreasing number of all monomers, or if equal alphabetically.}}$

BILN string: `A-A-A-A-A.C-C-C-C.D-D-D.E-E.F`
Ketcher: <img width="443" height="424" alt="Image" src="https://github.com/user-attachments/assets/2f3a2f00-f0f1-427d-a6d6-06bc47b72ec0" />
Chains contain different amounts of amino-acid monomers.

BILN string: `A-A-A-A-A-A-A-A-A-A.A-A-A-A-A-A-A-A-A6OH-A6OH-A6OH-A6OH`
Ketcher: <img width="1069" height="155" alt="Image" src="https://github.com/user-attachments/assets/23340509-3fc3-49dd-8da0-796e2b82f161" />
The second string is longer, but contains fewer amino-acid monomers.

BILN string: `C-C-C-C-C-A6OH.C-C-A6OH-A6OH-C-C`
Ketcher: <img width="531" height="156" alt="Image" src="https://github.com/user-attachments/assets/e67264a8-3025-4efa-b6e9-c09ba67ebdbc" />
Both chains are of the same length, but one has more amino-acid monomers, so should be placed first.

BILN string: `A-C-D-E-F-A6OH.C-D-E-F-G-A6OH`
Ketcher: <img width="529" height="156" alt="Image" src="https://github.com/user-attachments/assets/d453b660-5084-4e16-b8c8-f6f274f6cd1e" />
Both chains contain the same amount of amino-acid monomers, and monomers overall, so are ordered alphabetically.

Requirement 2.2.1: $\color{Pink}{\textbf{Every chain with n≤5 monomers in length is considered to be n monomers long irregardless of the amino-acid content. }}$ 

BILN string: `A-A-A6OH-A6OH-A6OH.C-C-C-C-C`
Ketcher: <img width="439" height="157" alt="Image" src="https://github.com/user-attachments/assets/a08585d3-6769-419f-bec1-d399e747df6e" />
Even though the first string contains fewer amino-acids than the second, they should be ordered alphabetically because they are of the same length and are short (≤5 monomers).

Requirement 3: $\color{Pink}{\textbf{Monomers in one backbone with default bonds (R1 of one monomer and R2 of another) are written with a}}$
$\color{Pink}{\textbf{hyphen separating them.}}$

BILN string: `A-C-D-E`
Ketcher: <img width="348" height="65" alt="Image" src="https://github.com/user-attachments/assets/aafea61b-5a76-45e7-8b50-4e028c1efbd2" />

Requirement 4: $\color{Pink}{\textbf{Monomers are identified by unique codes: If a code contains a hyphen it should be written with}}$
$\color{Pink}{\textbf{square brackets around it; If a code doesn't contain a hyphen it shouldn't be written with square brackets around it.}}$

BILN string: `[D-Cha]-C-[D-Abu]-dC-[D-2Thi]`
Ketcher: <img width="437" height="64" alt="Image" src="https://github.com/user-attachments/assets/34edd081-c221-4ccc-acbe-5271103dd6f4" />

Requirement 5: $\color{Pink}{\textbf{Non-backbone bonds are written explicitly in round brackets to the right of}}$
$\color{Pink}{\textbf{monomers that participate in the bond with two pieces of information: a bond identifier and the R-}}$
$\color{Pink}{\textbf{group number: - Bond identifier and the R-group number are separated with a comma, and without}}$
$\color{Pink}{\textbf{and without spaces.; A bond identifier is a unique positive integer, and it is 1 for the first explicitly}}$
$\color{Pink}{\textbf{indicated bond in the string, 2 for the second, ...; The R-group number indicates the number of the}}$
$\color{Pink}{\textbf{ R-group that monomer uses to participate in that bond.}}$

BILN string: `[D-hPhe](1,1)-K-R(2,3)-L-M(1,2).Cya-C(2,3)-Phe_4I`
Ketcher: <img width="345" height="343" alt="Image" src="https://github.com/user-attachments/assets/473cd516-bd08-421a-85e6-21a1dfa9d91f" />

BILN string: `gGlu(1,3)-Gla(2,3)-meE(3,3)-H(4,3)-dH-Hhs(5,3)-K(6,3)-Aad(7,3)-[D-Orn](8,3)-dK(9,3).dC(1,3)-Hcy(2,3)-meC(3,3)-D(4,3)-dD(5,4)-meD(6,3)-E(7,3)-[D-gGlu](8,3)-dE(9,3)`
Ketcher: <img width="794" height="379" alt="Image" src="https://github.com/user-attachments/assets/43c3d705-0421-452e-addb-a9c87ab65182" />

Requirement 5.1: $\color{Pink}{\textbf{In case of one monomer participating in multiple non-backbone bonds, two brackets should be put next to each other.}}$

BILN string: `A-[Test-6-Ch](1,3)(2,4)-C.D(1,1).E(2,2)`
Ketcher: <img width="259" height="240" alt="Image" src="https://github.com/user-attachments/assets/44bfce78-7309-4c6e-b782-688c6a3b33d7" />
Note: 
This string: `A-[Test-6-Ch](1,4)(2,3)-C.D(2,1).E(1,2)` wouldn't be the preferred version because D is before E in the string, so the bond between D and the CHEM should have a lower identifier

</details>

---

### Import of BILN

1. If a string cannot be interpreted as a valid BILN string, Indigo should return an error: "The string cannot be interpreted as a valid BILN string."

2. All the strings written using the best practises (export section) should be able to be loaded.

$\color{Red}{\textbf{Valid strings not constructed using best practises}}$

3. Circular structures whose BILN string is not alphabetically optimal should be able to be loaded (a BILN string that would be first alphabetically exists). 

4. Strings that do not follow the order of backbone arrangement (largest amount of amino-acids first, then largest amount of monomers, then alphabetically) should be able to be loaded.

5. Strings with explicitly written backbone connections should be able to be loaded.

6. Strings containing monomer codes without hyphens in square brackets should be able to be loaded.

7. Strings that contain incorrectly ordered bonds should be able to be loaded.

<details>

<summary> Expand for examples </summary> 

Requirement 3: $\color{Pink}{\textbf{Circular structures whose BILN string is not alphabetically optimal should be able to be loaded (a BILN string}}$
$\color{Pink}{\textbf{that would be first alphabetically exists). }}$

Best practises BILN string: `A(1,1)-C-D-E(1,2)`
Ketcher: <img width="174" height="153" alt="Image" src="https://github.com/user-attachments/assets/f9a8c8ec-1ec1-4193-a28e-ebe2394574cf" />
Also valid: `C(1,1)-D-E-A(1,2)`
Also valid: `D(1,2)-C-A-E(1,1)`
Also valid: `E(1,2)-D-C-A(1,1)`

Requirement 4: $\color{Pink}{\textbf{Strings that do not follow the order of backbone arrangement (largest amount of amino-acids first,}}$
$\color{Pink}{\textbf{then largest amount of monomers, then alphabetically) should be able to be loaded.}}$

Best practises BILN string: `L-E-R(1,3)-S-T.A(2,1)-C(1,3)-F-G(2,2)`
Ketcher: <img width="395" height="389" alt="Image" src="https://github.com/user-attachments/assets/a5ae2709-8267-49a8-87c9-83a6e9ca53f2" />
Also valid: `A(1,1)-C(2,3)-F-G(1,2).L-E-R(2,3)-S-T`

Best practises BILN string: `A-A-A-A-A-A.[PEG-2]-C-C-C-C[PEG-2].C-C-C-C`
Ketcher: <img width="533" height="246" alt="Image" src="https://github.com/user-attachments/assets/5defaa7b-44e3-4b2d-98f0-e02a485edd40" />
Also valid: `A-A-A-A-A-A.C-C-C-C.[PEG-2]-C-C-C-C[PEG-2]`
Also valid: `[PEG-2]-C-C-C-C[PEG-2].A-A-A-A-A-A.C-C-C-C`
Also valid: `[PEG-2]-C-C-C-C[PEG-2].C-C-C-C.A-A-A-A-A-A`
Also valid: `C-C-C-C.A-A-A-A-A-A.[PEG-2]-C-C-C-C[PEG-2]`
Also valid: `C-C-C-C.[PEG-2]-C-C-C-C[PEG-2].A-A-A-A-A-A`

Requirement 5: $\color{Pink}{\textbf{Strings with explicitly written backbone connections should be able to be loaded.}}$

Best practises BILN string: `[D-Cit]-aThr-meS`
Ketcher: <img width="259" height="65" alt="Image" src="https://github.com/user-attachments/assets/0092bb1f-8306-446e-b371-c94d80fb641f" />
Also valid: `[D-Cit](1,2).aThr(1,1)(2,2).meS(2,1)`
$\color{Red}{\textbf{Not}}$ valid: `[D-Cit](1,2)-aThr(1,1)(2,2)-meS(2,1)`

Requirement 6: $\color{Pink}{\textbf{Strings containing monomer codes without hyphens in square brackets should be able to be loaded.}}$

Best practises BILN string: `[D-2Thi]-D-[D-gGlu]-meF-G-[Lys-al]`
Ketcher: <img width="525" height="64" alt="Image" src="https://github.com/user-attachments/assets/760dfa2b-97d6-4e5e-b6b0-506052aa25d1" />
Also valid: `[D-2Thi]-[D]-[D-gGlu]-[meF]-[G]-[Lys-al]`
Also valid: `[D-2Thi]-D-[D-gGlu]-[meF]-G-[Lys-al]`
$\color{Red}{\textbf{Not}}$ valid: `D-2Thi-D-D-gGlu-meF-G-Lys-al`, because `2Thi`, `Lys`, and `al` are not valid monomer codes.

Requirement 7: $\color{Pink}{\textbf{Strings that contain incorrectly ordered bonds should be able to be loaded.}}$

Best practises BILN string: `A-C(1,3)-D(2,3)-E.F-G-H(1,3)-I-K(2,3)`
Ketcher: <img width="469" height="304" alt="Image" src="https://github.com/user-attachments/assets/9aa12bcb-7b42-4086-98db-ab4aaa3260be" />
Also valid: `A-C(7563,3)-D(3,3)-E.F-G-H(7463,3)-I-K(3,3)`
$\color{Red}{\textbf{Not}}$ valid: `A-C(-1,3)-D(2,3)-E.F-G-H(-1,3)-I-K(2,3)` because the bond identifier must be positive.
$\color{Red}{\textbf{Not}}$ valid: `A-C(1.25,3)-D(2,3)-E.F-G-H(1.25,3)-I-K(2,3)` because the bond identifier must be a whole number.
$\color{Red}{\textbf{Not}}$ valid: `A-C(1,3)-D(1,3)-E.F-G-H(1,3)-I-K(2,3)` because the same bond identifier appears ≠ two times.
$\color{Red}{\textbf{Not}}$ valid: `A-C(1,4)-D(2,3)-E.F-G-H(1,3)-I-K(2,3)` because monomer `C` doesn't have R4.

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for BILN #3541

Background

Requirements

Export of BILN

Import of BILN

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support for BILN #3541

Description

Background

Requirements

Export of BILN

Import of BILN

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions