The MInChI Demo page includes some interesting mixfiles (well, if you "copy branch" it's basically a JSON mixfile without mixfileVersion) with unknown InChI structures such as:
- No structures at all: BSA blocking buffer + PBS; bechamel sauce
- Partial lack: Dodecacarbonyltriiron
Right now the produced InChI is a little less than informative for these purposes. I propose adding an optional layer /x (external identifiers) to handle this problem.
/x layer
The /x layer consists of the following parts:
- A main part, consisting of percent-encoded strings separated by the character
&. Characters that MUST be encoded are / , &, unprintable characters, and whitespace characters. (I choose this style because it originates in an environment that uses & and /.)
- The use of
+ in place of %20 for encoding a space is permitted. (Purely aesthetic reasons.)
- A mandatory
/n sublayer which is very similar to the /n layer, but with the ability to associate multiple strings to a substance as well as the ability to name a group. (This will cause some duplication of information in the nesting structure. We already do that with /g.)
- An optional
/t sublayer specifying the type of the identifier in the main part. This layer contains a string, each character being a description of the corresponding index in the &-separated field. Acceptable types include (each of these have a Mixfile counterpart):
f: formula (likely used when: unknown connectivity so unable to make InChI, has numbers in a range so unable to make InChI)
s: SMILES
n: Human-readable name
k: InChIKey
- (I could specify one for Molfile here but the size would be comical. A URL-safe base64 encoding of gzipped Molfile? Nah sounds too complicated.)
- (There are some additional database references that can be added, though these will NOT have a Mixfile counterpart. It could make sense to just write another "name" for now.)
The /x layer shall only appear on non-"standard MInChI", i.e. "MInChI=0.00.1" without the "S". There is too much variability for anything to be reproducible here. Lucky we don't have a MInChIKey...
Basic example (with whitespace added)
MInChI=0.00.1//n{{&}&}/g{{466wf-3&534wf-3}91wf-3&909wf-3}
/xbutter&flour&flour+dispersed+in+butter&milk&bechamel+sauce
/n{{1&2}3&4}5
/tnnnnn
Example of three identifiers on the same thing:
MInChI=0.00.1/C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3/n{&1}/g{1:5pp0&}
/xOctacarbonyldicobalt&Co2(CO)8&PubChem_CID:25049
/n{1,2,3&}
/tnfn
On /n
When an /n sublayer is present, it should have the same "shape-of-braces" as the main /n layer. The format is the same as the main /n layer, with the exception that
- each structure can have multiple descriptions for the main part. This is resolved by allowing the use of a comma
, between numbers describing the same part.
- each brace-grouping may have its own label. This is handled by permitting number-lists to be used after the closing brace, before the
&. (This resembles Newick format.)
About names
/x is currently unused and a good sound match. I think it's an acceptable use of a letter, unless someone has some other use in mind (e.g. using /x like the x- prefix of MIME types for experimental/extensions in general).
The MInChI Demo page includes some interesting mixfiles (well, if you "copy branch" it's basically a JSON mixfile without mixfileVersion) with unknown InChI structures such as:
Right now the produced InChI is a little less than informative for these purposes. I propose adding an optional layer
/x(external identifiers) to handle this problem./xlayerThe
/xlayer consists of the following parts:&. Characters that MUST be encoded are/,&, unprintable characters, and whitespace characters. (I choose this style because it originates in an environment that uses&and/.)+in place of%20for encoding a space is permitted. (Purely aesthetic reasons.)/nsublayer which is very similar to the/nlayer, but with the ability to associate multiple strings to a substance as well as the ability to name a group. (This will cause some duplication of information in the nesting structure. We already do that with/g.)/tsublayer specifying the type of the identifier in the main part. This layer contains a string, each character being a description of the corresponding index in the&-separated field. Acceptable types include (each of these have a Mixfile counterpart):f: formula (likely used when: unknown connectivity so unable to make InChI, has numbers in a range so unable to make InChI)s: SMILESn: Human-readable namek: InChIKeyThe
/xlayer shall only appear on non-"standard MInChI", i.e. "MInChI=0.00.1" without the "S". There is too much variability for anything to be reproducible here. Lucky we don't have a MInChIKey...Basic example (with whitespace added)
Example of three identifiers on the same thing:
On
/nWhen an
/nsublayer is present, it should have the same "shape-of-braces" as the main/nlayer. The format is the same as the main/nlayer, with the exception that,between numbers describing the same part.&. (This resembles Newick format.)About names
/xis currently unused and a good sound match. I think it's an acceptable use of a letter, unless someone has some other use in mind (e.g. using/xlike thex-prefix of MIME types for experimental/extensions in general).