Fixes to support co assembly in "accessions" column by ochkalova · Pull Request #43 · EBI-Metagenomics/genome_uploader

ochkalova · 2025-12-12T11:14:59Z

Little PR with 2 changes, more details below.
I don't know if main is the right branch to merge with?

UPD:

Tests for co-assemblies added: end-to-end and for _get_public_run_from_assembly function
Also I found a typo in previous test_run_from_assembly that prevent it from testing the public request -> that's how this typo in URL (https://www.ebi.ac.uk/ena/browser/api/xml/analyses/xml/ instead of https://www.ebi.ac.uk/ena/browser/api/xml/ ) became unnoticed 🕵🏻
I've added note to README about single quotes in password for someone as silly as me 😄

Dev

… of set of runs and co-assembly=True

…_REFs

ochkalova · 2025-12-12T11:19:11Z

genomeuploader/ena.py


    def _get_public_run_from_assembly(self):
-        url = f"{self.browser_url}/analyses/xml/{self.accession}"
+        url = f"{self.browser_url}/{self.accession}"


I don't know how it happened, but url = f"{self.browser_url}/analyses/xml/{self.accession}" turns into https://www.ebi.ac.uk/ena/browser/api/xml/analyses/xml/ that doesn't exist... I fixed it

Thanks! It looks like it was copied from line #300 by mistake... it's been in the code for a while though.

I think to remember that it's needed for private data, does that one work?

Might be worth expanding tests for all possible API queries we have

ochkalova · 2025-12-12T11:19:54Z

genomeuploader/ena.py

        def reformatter(xml_doc):
-            return xml_doc.getElementsByTagName("RUN_REF")[0].attributes["accession"].value
+            run_refs = xml_doc.getElementsByTagName("RUN_REF")
+            return sorted({node.getAttribute("accession") for node in run_refs})

        result = self._fetch_ena_data(url=url, mode="xml", reformatter=reformatter)
-        logger.info(f"public run from the assembly {self.accession} returned from ENA")
+        logger.info(f"public runs for assembly {self.accession} returned from ENA")


I changed this function to collect all RUN_REFs, not just the first one

ochkalova · 2025-12-12T11:22:07Z

genomeuploader/genome_upload.py

In this file I only tweaked TSV validation a little, because previously it failed when accession was an assembly accession and co-assembly = true

ochkalova

I left descriptions of changes in the comments above ⬆️

KateSakharova

Thank you for adding co-assembly fixes! I think we need more examples how genome-uploader works (or not) with co-assemblies.
Lets start with a couple of tests for fixes that you have done, especially url (if ENA really changed it - it should be covered in real call test).

…_get_public_run_from_assembly

…se single quotes

KateSakharova

Very nice! I left small comments.
Maybe we should also modify combine_ena_info function in that PR? I'm happy to discuss it.

genomeuploader/genome_upload.py

KateSakharova · 2026-01-13T13:50:53Z

tests/fixtures/input_coassembly_fixture.tsv

@@ -0,0 +1,2 @@
+genome_name	genome_path	accessions	assembly_software	binning_software	binning_parameters	stats_generation_software	completeness	contamination	genome_coverage	metagenome	co-assembly	broad_environment	local_environment	environmental_medium	rRNA_presence	NCBI_lineage
+SAMPLE_11_bin.1	./tests/fixtures/SAMPLE_11_bin.1.fa.gz	ERZ27228618	metaSPAdes_v3.13.0	metaWRAP_v1.2.1	--maxbin2 --metabat2 --concoct	CheckM_v1.1.2	95	0	69.16	human skin metagenome	TRUE	human skin	N-R,N	Skin wash,Skin swab	FALSE	d__Bacteria;p__Actinomycetota;c__Actinomycetes;o__Mycobacteriales;f__Lawsonellaceae;g__Lawsonella;s__Lawsonella clevelandensis


would script work if I specify a list of runs in accessions column? If yes, then we need a test for that. If no - it should be reflected in docs

When I first wrote the script two things had to happen to accept co-assemblies:

more than one accession had to be listed

the co-assembly column had to be set to True

I haven't checked the code yet to see if this changed. I agree on both - adding a test and maybe add a line in the documentation

Actually this introduces the problem: what happens when multiple accessions are listed? We need to determine when to combine multiple accessions into a single co-assembly, and instead when to block the execution if multiple co-assembly accessions are listed

Or.... Just drop the support for multiple accessions altogether

and also check that ERR was not specified with ERZ in that case. So it should be ERZ or ERRs. Or we should support mixed?

would script work if I specify a list of runs in accessions column? If yes, then we need a test for that. If no - it should be reflected in docs

@KateSakharova yes, and it works. I've added the test case for this in:
add test case of coassembly submission where "accessions" column cont…

…used multiple_element_set function

…ains run_ids

ochkalova

🤓

ochkalova · 2026-01-14T15:25:58Z

genomeuploader/genome_upload.py

                            ena_query = EnaQuery(sample_accession, "sample", self.private)
                            sample_info = ena_query.build_query()

                            latitude, longitude = "missing: third party data", "missing: third party data"


@Ge94
Starting from here and until line 629 latitude, longitude are processed. Based on what's done, their value can be only number (already rounded) or "missing: third party data" (public data) or "not provided" (private data). Why do we have different assignments for private and public data in case of na?

If it's private data, who uses the genome_uploader 1) is either the submitter 2) or in contact with the submitter, therefore metadata can be reinforced or set as "not provided" if absent. Otherwise, if it's public, and data are owned by someone else, then we call them 3rd party data.

To be absolutely sure of this we should start checking if referenced data actually belong to the Webin owner and a few more checks, but the explanation above is already quite close to reality - we tried to make it simple...

ochkalova · 2026-01-14T15:36:51Z

genomeuploader/genome_upload.py

                                country = "missing: third party data"

                            collection_date = sample_info["collection_date"]
                            if collection_date.lower() in [


@Ge94 What do you think about moving those to constant ENA_ACCEPTED_NA?

Yes good point!

ochkalova · 2026-01-14T15:38:52Z

genomeuploader/genome_upload.py

                                or collection_date.lower() == "missing"
                                or collection_date.lower() in ["not available", "na"]


@Ge94 I also don't understand why we are keeping the list above, but here replace "missing" although ENA allows it too, maybe keep "missing" unchanged?
["not available", "na"] replacement makes sense because those are not accepted

"missing" used to be a valid value - now guidelines got stricter, and what you register now should contain "missing:[reason]" explanation here. All these rules in the code have been added manually in the last couple years as fields were changing and we were not aware yet - we were just encountering more and more bugs 🙈

However, you can still find "missing" in old registered samples, and since it's only guidelines, they might not be reinforced (they "strongly encourage"). We try to follow best practices, but 3rd party data might be different

Putting all these in an "NA LIST" in the constants and check every field against those would be the best thing. If there is a match, we then select a different field value based on private/public

ochkalova · 2026-01-14T15:42:30Z

genomeuploader/genome_upload.py

                                    raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession))

                            if country not in GEOGRAPHIC_LOCATIONS:
                                country = "missing: third party data"


@Ge94 GEOGRAPHIC_LOCATIONS doesn't include the list of "ENA allowed NA synonyms", so here we basically normalise all NAs to "missing: third party data", which is fine but we don't normalise collection_date later 🤔

True, we should treat this field as the others: "missing: 3rd party data" if it's public, "not available" if it's private.

The point of this one is that if "country" is anything but an actual country, then it should forced to a default "NA" value

Ge94

Thank you so much Sonya, I have been wanting to refactor the co-assembly bit in years (no kidding). Lots of good stuff.

A summary of things I have repeated across the comments:

There are now two possible settings we need to check for co-assemblies: a list of "normal" runs/samples, and a single sample registered as co-assemblies. It would be good to have a test for both (one is already there)
Restructuring all the NAs would make everything much cleaner. I support having a list of them under constants and checking against all of them regardless of the field
If a field matches an NA value, it should be defaulted to a different value depending on public/private.

Ge94 · 2026-01-19T10:35:16Z

genomeuploader/ena.py


    def _get_public_run_from_assembly(self):
-        url = f"{self.browser_url}/analyses/xml/{self.accession}"
+        url = f"{self.browser_url}/{self.accession}"


Thanks! It looks like it was copied from line #300 by mistake... it's been in the code for a while though.

I think to remember that it's needed for private data, does that one work?

Might be worth expanding tests for all possible API queries we have

Ge94 · 2026-01-19T11:01:43Z

genomeuploader/genome_upload.py

+    Behaviour for handling multiple differing values is defined
+    in the BIN_SAMPLE_FIELDS dictionary ("if_multiple" key).
+    It also normalises missing values to a defined standard 
+    ("normalise_na" key). "biosample_field" key indicates the


maybe default_on_na instead of normalise?

Ge94 · 2026-01-19T11:30:03Z

genomeuploader/genome_upload.py

                            ena_query = EnaQuery(sample_accession, "sample", self.private)
                            sample_info = ena_query.build_query()

                            latitude, longitude = "missing: third party data", "missing: third party data"


If it's private data, who uses the genome_uploader 1) is either the submitter 2) or in contact with the submitter, therefore metadata can be reinforced or set as "not provided" if absent. Otherwise, if it's public, and data are owned by someone else, then we call them 3rd party data.

To be absolutely sure of this we should start checking if referenced data actually belong to the Webin owner and a few more checks, but the explanation above is already quite close to reality - we tried to make it simple...

Ge94 · 2026-01-19T11:30:40Z

genomeuploader/genome_upload.py

    Returns:
        None. Modifies genome_info in place.
    """
+    BIN_SAMPLE_FIELDS = {


This is so great, I just wonder if we should make the distinction for private/public like mentioned in the other discussion

Ge94 · 2026-01-19T11:53:45Z

genomeuploader/genome_upload.py

@@ -427,14 +420,21 @@

        # check whether all co-assemblies have more than one run associated and vice versa


Suggested change

# check whether all co-assemblies have more than one run associated and vice versa

# raise error it the following check fails:

# - co-assemblies are associated with more than one run

# - co-assemblies are associated with one assembly accession

Ge94 · 2026-01-19T11:59:31Z

genomeuploader/genome_upload.py

                                    raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession))

                            if country not in GEOGRAPHIC_LOCATIONS:
                                country = "missing: third party data"


True, we should treat this field as the others: "missing: 3rd party data" if it's public, "not available" if it's private.

Ge94 · 2026-01-19T12:00:22Z

genomeuploader/genome_upload.py

                                    raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession))

                            if country not in GEOGRAPHIC_LOCATIONS:
                                country = "missing: third party data"


The point of this one is that if "country" is anything but an actual country, then it should forced to a default "NA" value

Ge94 · 2026-01-19T12:01:20Z

genomeuploader/genome_upload.py

                                country = "missing: third party data"

                            collection_date = sample_info["collection_date"]
                            if collection_date.lower() in [


Yes good point!

Ge94 · 2026-01-19T12:09:44Z

genomeuploader/genome_upload.py

                                or collection_date.lower() == "missing"
                                or collection_date.lower() in ["not available", "na"]


"missing" used to be a valid value - now guidelines got stricter, and what you register now should contain "missing:[reason]" explanation here. All these rules in the code have been added manually in the last couple years as fields were changing and we were not aware yet - we were just encountering more and more bugs 🙈

However, you can still find "missing" in old registered samples, and since it's only guidelines, they might not be reinforced (they "strongly encourage"). We try to follow best practices, but 3rd party data might be different

Putting all these in an "NA LIST" in the constants and check every field against those would be the best thing. If there is a match, we then select a different field value based on private/public

KateSakharova and others added 5 commits November 28, 2025 11:41

Merge pull request #42 from EBI-Metagenomics/dev

3f144f2

Dev

resolve ValueError raised when bin linked to primary assembly instead…

3560f5d

… of set of runs and co-assembly=True

oops, fix incorrect column name

f6c64cc

fix url for XML request

07e3ecf

update _get_public_run_from_assembly function to support multiple RUN…

a11adb9

…_REFs

ochkalova requested a review from Ge94 December 12, 2025 11:14

ochkalova temporarily deployed to pypi December 12, 2025 11:15 — with GitHub Actions Inactive

correct typo

0ab26c5

ochkalova temporarily deployed to pypi December 12, 2025 11:16 — with GitHub Actions Inactive

ochkalova commented Dec 12, 2025

View reviewed changes

KateSakharova changed the base branch from main to dev December 17, 2025 12:28

KateSakharova self-requested a review December 17, 2025 12:28

KateSakharova assigned ochkalova Dec 17, 2025

KateSakharova requested changes Dec 17, 2025

View reviewed changes

ochkalova added 6 commits December 24, 2025 19:53

update the _get_private_run_from_assembly to return the same type as …

3b2da36

…_get_public_run_from_assembly

fix test_ena_run_from_assembly test and fixture

42122ac

add note about special characters in README, update .env.example to u…

fdde8a2

…se single quotes

add a test case for run id fetching for co-assembly

5d1c21c

add end-to-end test with co-assembly

fce508e

add files generated for unit test to gitignore

dc6f782

ochkalova temporarily deployed to pypi January 13, 2026 11:54 — with GitHub Actions Inactive

ochkalova requested a review from KateSakharova January 13, 2026 12:01

KateSakharova requested changes Jan 13, 2026

View reviewed changes

ochkalova added 2 commits January 13, 2026 15:57

refactor combine_ena_info to use sets for unique values and remove un…

a83fe55

…used multiple_element_set function

refactor combine_ena_info to reduce code duplication and hardcoding

0e4616c

ochkalova temporarily deployed to pypi January 13, 2026 17:51 — with GitHub Actions Inactive

add test case of coassembly submission where "accessions" column cont…

e1ce5cb

…ains run_ids

ochkalova temporarily deployed to pypi January 13, 2026 18:00 — with GitHub Actions Inactive

refactor: move regex patterns for accessions to constants

dd6e5b3

ochkalova had a problem deploying to pypi January 14, 2026 14:01 — with GitHub Actions Error

ochkalova temporarily deployed to pypi January 14, 2026 14:01 — with GitHub Actions Inactive

ochkalova had a problem deploying to pypi January 14, 2026 14:01 — with GitHub Actions Failure

ochkalova commented Jan 14, 2026

View reviewed changes

Ge94 requested changes Jan 19, 2026

View reviewed changes

		@@ -0,0 +1,2 @@
		genome_name genome_path accessions assembly_software binning_software binning_parameters stats_generation_software completeness contamination genome_coverage metagenome co-assembly broad_environment local_environment environmental_medium rRNA_presence NCBI_lineage
		SAMPLE_11_bin.1 ./tests/fixtures/SAMPLE_11_bin.1.fa.gz ERZ27228618 metaSPAdes_v3.13.0 metaWRAP_v1.2.1 --maxbin2 --metabat2 --concoct CheckM_v1.1.2 95 0 69.16 human skin metagenome TRUE human skin N-R,N Skin wash,Skin swab FALSE d__Bacteria;p__Actinomycetota;c__Actinomycetes;o__Mycobacteriales;f__Lawsonellaceae;g__Lawsonella;s__Lawsonella clevelandensis No newline at end of file

		or collection_date.lower() == "missing"
		or collection_date.lower() in ["not available", "na"]

		@@ -427,14 +420,21 @@

		# check whether all co-assemblies have more than one run associated and vice versa

-        # check whether all co-assemblies have more than one run associated and vice versa
+        # raise error it the following check fails:
+        # - co-assemblies are associated with more than one run
+        # - co-assemblies are associated with one assembly accession

Conversation

ochkalova commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UPD:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ochkalova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KateSakharova left a comment

Choose a reason for hiding this comment

Uh oh!

KateSakharova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ge94 Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KateSakharova Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ochkalova Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ochkalova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ge94 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ochkalova commented Dec 12, 2025 •

edited

Loading

ochkalova left a comment •

edited

Loading

Ge94 Jan 13, 2026 •

edited

Loading

KateSakharova Jan 13, 2026 •

edited

Loading

ochkalova Jan 13, 2026 •

edited

Loading