Skip to content

Possible confusion between H9 T-cell lymphoma cell line and H9/WA09 human embryonic stem cell line in enriched metadata #13

@jqming123

Description

@jqming123

Hi seqout team,

First of all, thank you for building and maintaining seqout. I have found the enriched metadata feature very useful for quickly exploring public sequencing datasets.

I would like to report a possible metadata enrichment issue related to the ambiguous cell line name “H9”.

There are at least two biologically distinct human cell lines commonly referred to as “H9”:

  1. H9 T-cell lymphoma cell line
  1. WA09 / H9 human embryonic stem cell line

In several BioProjects I checked, the enriched metadata appears to map samples that seem to refer to H9 / WA09 human embryonic stem cells to the wrong H9 T-cell lymphoma cell line, or otherwise mixes up the cell line and sex annotations.

Examples where I observed this possible confusion include:

  • PRJNA1084791
  • PRJNA1034017
  • PRJNA798001
  • PRJNA722022
  • PRJNA674865
  • PRJNA606766
  • PRJNA589309

In my search results, this affected roughly one third of the H9-related datasets I examined, although I realize this may be a relatively specific edge case caused by ambiguous cell line naming.

What makes this issue easier to spot is that the two Cellosaurus entries have different donor sex annotations: the lymphoma H9 cell line is male, whereas the WA09/H9 embryonic stem cell line is female. In some cases, the enriched metadata seems inconsistent with this distinction.

At the same time, seqout appears to handle this correctly in some other cases. For example, the enriched metadata ontology vocabulary already seems to contain entries such as “H9/WA09” and “WA09”, and the following pages appear to show the expected interpretation:

https://seqout.org/p/GSE291908
https://seqout.org/p/GSE240435
https://seqout.org/p/GSE147338

I also noticed an example that appears to be correctly recognized as H9 human embryonic stem cells:

https://seqout.org/p/SRP328737

In that case, the cell line name and sex metadata look correct, although the enriched metadata does not seem to include ontology tags yet.

Expected behavior

When a sample refers to H9 human embryonic stem cells, WA09, or H9/WA09, the enriched metadata should ideally distinguish it from the unrelated H9 T-cell lymphoma cell line and map it to the correct Cellosaurus concept where possible.

Actual behavior

Some datasets appear to be mapped to, or annotated as, the wrong H9 cell line, resulting in potentially incorrect cell line and/or sex metadata.

Possible cause

This may be due to ambiguity in the free-text sample metadata, where “H9” alone is used without enough context. However, in many stem cell datasets, nearby terms such as “hESC”, “human embryonic stem cell”, “WA09”, differentiation protocols, or female donor sex may help disambiguate the intended cell line.

I understand that AI-enriched metadata is inherently imperfect and that ambiguous cell line names are difficult to resolve automatically. I wanted to report this because the H9/WA09 case may affect a noticeable fraction of H9-related search results.

Thank you again for the project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions