Skip to content

Include taxon id with taxon label in facet count of entity search endpoint #386

@vincerubinetti

Description

@vincerubinetti

I'm developing the 3.0 version of the monarch ui/website, and I've run into a limitation. @putmantime

Here is an example response from the /search/entity/{term} endpoint, searching "ssh":

{
  "numFound": 177,
  "docs": [
    {
      "id": "FlyBase:FBgn0029157",
      "id_std": "FlyBase:FBgn0029157",
      "id_eng": "FlyBase:FBgn0029157",
      "id_kw": "FlyBase:FBgn0029157",
      "prefix": "FlyBase",
      "label": ["ssh"],
      "label_std": ["ssh"],
      "label_eng": ["ssh"],
      "label_kw": ["ssh"],
      "edges": 319,
      "taxon": "NCBITaxon:7227",
      "taxon_std": "NCBITaxon:7227",
      "taxon_eng": "NCBITaxon:7227",
      "taxon_kw": "NCBITaxon:7227",
      "taxon_label": "Drosophila melanogaster",
      "taxon_label_std": "Drosophila melanogaster",
      "taxon_label_eng": "Drosophila melanogaster",
      "taxon_label_kw": "Drosophila melanogaster",
      "taxon_label_synonym": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_std": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_eng": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_kw": ["fruit fly", "Sophophora melanogaster"],
      "has_phenotype": false,
      "category": ["gene", "sequence feature"],
      "category_std": ["gene", "sequence feature"],
      "category_eng": ["gene", "sequence feature"],
      "category_kw": ["gene", "sequence feature"],
      "synonym": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_std": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_eng": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_kw": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "equivalent_curie": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_std": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_eng": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_kw": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "leaf": true,
      "_version_": 1696524917734899700,
      "score": 117.35552
    }
  ],
  "facet_counts": {
    "category": {
    },
    "taxon_label": {
      "Sus scrofa": 25,
      "Drosophila melanogaster": 21,
      "Homo sapiens": 18,
      "Mus musculus": 16,
      "Bos taurus": 6,
      "Saccharomyces cerevisiae S288C": 6,
      "Xenopus tropicalis": 6,
      "Danio rerio": 5,
      "Gallus gallus": 4,
      "Anolis carolinensis": 3,
      "Canis lupus familiaris": 3,
      "Felis catus": 3,
      "Macaca mulatta": 3,
      "Monodelphis domestica": 3,
      "Ornithorhynchus anatinus": 3,
      "Pan troglodytes": 3,
      "Rattus norvegicus": 3,
      "Takifugu rubripes": 3,
      "Equus caballus": 2
    }
  },
  "highlighting": {}
}

Notice that taxon_label is being returned for facets, instead of taxon (id). This is nice for displaying a list of taxon facets, but not for actually filtering by them, because the endpoint only supports filtering by taxon (id), not taxon_label.

This requires the frontend to make a hard-coded label to id mapping for taxons. This duplicates information that we already have in biolink, is brittle, and is likely to get out of sync.

And yes, I can look up taxon from docs by finding the corresponding taxon_label field. However, then I would need to make sure all results are in docs so I have all the mappings, and that might go beyond the max rows [per page] param.


Possible solutions:

  • Support a taxon_label filter parameter (in addition to the taxon parameter) in the search endpoint. I guess this would be most useful if it was an exact match, rather than a fuzzy match. If there are multiple taxon ids that map to the same exact taxon label, then this option wouldn't be viable.

  • Return an additional taxon field in facet_counts with all the information I need: id, label, and count. This would leave the taxon_label facet untouched so current applications using biolink don't suddenly break.

  • Have some kind of taxon_map field at the top level of the response so I can go from label to id easily. Though, I think this is pretty ugly... don't want to add a top level thing for a special exception for just one type of facet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions