Skip to content

Fixes to support co assembly in "accessions" column#43

Open
ochkalova wants to merge 16 commits intodevfrom
feature/coassembly_fixes
Open

Fixes to support co assembly in "accessions" column#43
ochkalova wants to merge 16 commits intodevfrom
feature/coassembly_fixes

Conversation

@ochkalova
Copy link
Contributor

@ochkalova ochkalova commented Dec 12, 2025

Little PR with 2 changes, more details below.
I don't know if main is the right branch to merge with?

UPD:

  • Tests for co-assemblies added: end-to-end and for _get_public_run_from_assembly function
  • Also I found a typo in previous test_run_from_assembly that prevent it from testing the public request -> that's how this typo in URL (https://www.ebi.ac.uk/ena/browser/api/xml/analyses/xml/ instead of https://www.ebi.ac.uk/ena/browser/api/xml/ ) became unnoticed 🕵🏻
  • I've added note to README about single quotes in password for someone as silly as me 😄


def _get_public_run_from_assembly(self):
url = f"{self.browser_url}/analyses/xml/{self.accession}"
url = f"{self.browser_url}/{self.accession}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how it happened, but url = f"{self.browser_url}/analyses/xml/{self.accession}" turns into https://www.ebi.ac.uk/ena/browser/api/xml/analyses/xml/ that doesn't exist... I fixed it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks like it was copied from line #300 by mistake... it's been in the code for a while though.

I think to remember that it's needed for private data, does that one work?

Might be worth expanding tests for all possible API queries we have

Comment on lines 312 to +317
def reformatter(xml_doc):
return xml_doc.getElementsByTagName("RUN_REF")[0].attributes["accession"].value
run_refs = xml_doc.getElementsByTagName("RUN_REF")
return sorted({node.getAttribute("accession") for node in run_refs})

result = self._fetch_ena_data(url=url, mode="xml", reformatter=reformatter)
logger.info(f"public run from the assembly {self.accession} returned from ENA")
logger.info(f"public runs for assembly {self.accession} returned from ENA")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this function to collect all RUN_REFs, not just the first one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file I only tweaked TSV validation a little, because previously it failed when accession was an assembly accession and co-assembly = true

Copy link
Contributor Author

@ochkalova ochkalova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left descriptions of changes in the comments above ⬆️

@KateSakharova KateSakharova changed the base branch from main to dev December 17, 2025 12:28
@KateSakharova KateSakharova self-requested a review December 17, 2025 12:28
Copy link
Collaborator

@KateSakharova KateSakharova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding co-assembly fixes! I think we need more examples how genome-uploader works (or not) with co-assemblies.
Lets start with a couple of tests for fixes that you have done, especially url (if ENA really changed it - it should be covered in real call test).

Copy link
Collaborator

@KateSakharova KateSakharova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I left small comments.
Maybe we should also modify combine_ena_info function in that PR? I'm happy to discuss it.

@@ -0,0 +1,2 @@
genome_name genome_path accessions assembly_software binning_software binning_parameters stats_generation_software completeness contamination genome_coverage metagenome co-assembly broad_environment local_environment environmental_medium rRNA_presence NCBI_lineage
SAMPLE_11_bin.1 ./tests/fixtures/SAMPLE_11_bin.1.fa.gz ERZ27228618 metaSPAdes_v3.13.0 metaWRAP_v1.2.1 --maxbin2 --metabat2 --concoct CheckM_v1.1.2 95 0 69.16 human skin metagenome TRUE human skin N-R,N Skin wash,Skin swab FALSE d__Bacteria;p__Actinomycetota;c__Actinomycetes;o__Mycobacteriales;f__Lawsonellaceae;g__Lawsonella;s__Lawsonella clevelandensis No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would script work if I specify a list of runs in accessions column? If yes, then we need a test for that. If no - it should be reflected in docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first wrote the script two things had to happen to accept co-assemblies:

  • more than one accession had to be listed
  • the co-assembly column had to be set to True

I haven't checked the code yet to see if this changed. I agree on both - adding a test and maybe add a line in the documentation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this introduces the problem: what happens when multiple accessions are listed? We need to determine when to combine multiple accessions into a single co-assembly, and instead when to block the execution if multiple co-assembly accessions are listed

Copy link
Member

@Ge94 Ge94 Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or.... Just drop the support for multiple accessions altogether

Copy link
Collaborator

@KateSakharova KateSakharova Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also check that ERR was not specified with ERZ in that case. So it should be ERZ or ERRs. Or we should support mixed?

Copy link
Contributor Author

@ochkalova ochkalova Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would script work if I specify a list of runs in accessions column? If yes, then we need a test for that. If no - it should be reflected in docs

@KateSakharova yes, and it works. I've added the test case for this in:
add test case of coassembly submission where "accessions" column cont…

Copy link
Contributor Author

@ochkalova ochkalova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤓

ena_query = EnaQuery(sample_accession, "sample", self.private)
sample_info = ena_query.build_query()

latitude, longitude = "missing: third party data", "missing: third party data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ge94
Starting from here and until line 629 latitude, longitude are processed. Based on what's done, their value can be only number (already rounded) or "missing: third party data" (public data) or "not provided" (private data). Why do we have different assignments for private and public data in case of na?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's private data, who uses the genome_uploader 1) is either the submitter 2) or in contact with the submitter, therefore metadata can be reinforced or set as "not provided" if absent. Otherwise, if it's public, and data are owned by someone else, then we call them 3rd party data.

To be absolutely sure of this we should start checking if referenced data actually belong to the Webin owner and a few more checks, but the explanation above is already quite close to reality - we tried to make it simple...

country = "missing: third party data"

collection_date = sample_info["collection_date"]
if collection_date.lower() in [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ge94 What do you think about moving those to constant ENA_ACCEPTED_NA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point!

Comment on lines 651 to 652
or collection_date.lower() == "missing"
or collection_date.lower() in ["not available", "na"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ge94 I also don't understand why we are keeping the list above, but here replace "missing" although ENA allows it too, maybe keep "missing" unchanged?
["not available", "na"] replacement makes sense because those are not accepted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"missing" used to be a valid value - now guidelines got stricter, and what you register now should contain "missing:[reason]" explanation here. All these rules in the code have been added manually in the last couple years as fields were changing and we were not aware yet - we were just encountering more and more bugs 🙈

However, you can still find "missing" in old registered samples, and since it's only guidelines, they might not be reinforced (they "strongly encourage"). We try to follow best practices, but 3rd party data might be different

Putting all these in an "NA LIST" in the constants and check every field against those would be the best thing. If there is a match, we then select a different field value based on private/public

raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession))

if country not in GEOGRAPHIC_LOCATIONS:
country = "missing: third party data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ge94 GEOGRAPHIC_LOCATIONS doesn't include the list of "ENA allowed NA synonyms", so here we basically normalise all NAs to "missing: third party data", which is fine but we don't normalise collection_date later 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we should treat this field as the others: "missing: 3rd party data" if it's public, "not available" if it's private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this one is that if "country" is anything but an actual country, then it should forced to a default "NA" value

Copy link
Member

@Ge94 Ge94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much Sonya, I have been wanting to refactor the co-assembly bit in years (no kidding). Lots of good stuff.

A summary of things I have repeated across the comments:

  • There are now two possible settings we need to check for co-assemblies: a list of "normal" runs/samples, and a single sample registered as co-assemblies. It would be good to have a test for both (one is already there)
  • Restructuring all the NAs would make everything much cleaner. I support having a list of them under constants and checking against all of them regardless of the field
  • If a field matches an NA value, it should be defaulted to a different value depending on public/private.


def _get_public_run_from_assembly(self):
url = f"{self.browser_url}/analyses/xml/{self.accession}"
url = f"{self.browser_url}/{self.accession}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks like it was copied from line #300 by mistake... it's been in the code for a while though.

I think to remember that it's needed for private data, does that one work?

Might be worth expanding tests for all possible API queries we have

Behaviour for handling multiple differing values is defined
in the BIN_SAMPLE_FIELDS dictionary ("if_multiple" key).
It also normalises missing values to a defined standard
("normalise_na" key). "biosample_field" key indicates the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe default_on_na instead of normalise?

ena_query = EnaQuery(sample_accession, "sample", self.private)
sample_info = ena_query.build_query()

latitude, longitude = "missing: third party data", "missing: third party data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's private data, who uses the genome_uploader 1) is either the submitter 2) or in contact with the submitter, therefore metadata can be reinforced or set as "not provided" if absent. Otherwise, if it's public, and data are owned by someone else, then we call them 3rd party data.

To be absolutely sure of this we should start checking if referenced data actually belong to the Webin owner and a few more checks, but the explanation above is already quite close to reality - we tried to make it simple...

Returns:
None. Modifies genome_info in place.
"""
BIN_SAMPLE_FIELDS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so great, I just wonder if we should make the distinction for private/public like mentioned in the other discussion

@@ -427,14 +420,21 @@

# check whether all co-assemblies have more than one run associated and vice versa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# check whether all co-assemblies have more than one run associated and vice versa
# raise error it the following check fails:
# - co-assemblies are associated with more than one run
# - co-assemblies are associated with one assembly accession

raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession))

if country not in GEOGRAPHIC_LOCATIONS:
country = "missing: third party data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we should treat this field as the others: "missing: 3rd party data" if it's public, "not available" if it's private.

raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession))

if country not in GEOGRAPHIC_LOCATIONS:
country = "missing: third party data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this one is that if "country" is anything but an actual country, then it should forced to a default "NA" value

country = "missing: third party data"

collection_date = sample_info["collection_date"]
if collection_date.lower() in [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point!

Comment on lines 651 to 652
or collection_date.lower() == "missing"
or collection_date.lower() in ["not available", "na"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"missing" used to be a valid value - now guidelines got stricter, and what you register now should contain "missing:[reason]" explanation here. All these rules in the code have been added manually in the last couple years as fields were changing and we were not aware yet - we were just encountering more and more bugs 🙈

However, you can still find "missing" in old registered samples, and since it's only guidelines, they might not be reinforced (they "strongly encourage"). We try to follow best practices, but 3rd party data might be different

Putting all these in an "NA LIST" in the constants and check every field against those would be the best thing. If there is a match, we then select a different field value based on private/public

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants