Fixes to support co assembly in "accessions" column#43
Fixes to support co assembly in "accessions" column#43
Conversation
… of set of runs and co-assembly=True
|
|
||
| def _get_public_run_from_assembly(self): | ||
| url = f"{self.browser_url}/analyses/xml/{self.accession}" | ||
| url = f"{self.browser_url}/{self.accession}" |
There was a problem hiding this comment.
I don't know how it happened, but url = f"{self.browser_url}/analyses/xml/{self.accession}" turns into https://www.ebi.ac.uk/ena/browser/api/xml/analyses/xml/ that doesn't exist... I fixed it
There was a problem hiding this comment.
Thanks! It looks like it was copied from line #300 by mistake... it's been in the code for a while though.
I think to remember that it's needed for private data, does that one work?
Might be worth expanding tests for all possible API queries we have
| def reformatter(xml_doc): | ||
| return xml_doc.getElementsByTagName("RUN_REF")[0].attributes["accession"].value | ||
| run_refs = xml_doc.getElementsByTagName("RUN_REF") | ||
| return sorted({node.getAttribute("accession") for node in run_refs}) | ||
|
|
||
| result = self._fetch_ena_data(url=url, mode="xml", reformatter=reformatter) | ||
| logger.info(f"public run from the assembly {self.accession} returned from ENA") | ||
| logger.info(f"public runs for assembly {self.accession} returned from ENA") |
There was a problem hiding this comment.
I changed this function to collect all RUN_REFs, not just the first one
There was a problem hiding this comment.
In this file I only tweaked TSV validation a little, because previously it failed when accession was an assembly accession and co-assembly = true
KateSakharova
left a comment
There was a problem hiding this comment.
Thank you for adding co-assembly fixes! I think we need more examples how genome-uploader works (or not) with co-assemblies.
Lets start with a couple of tests for fixes that you have done, especially url (if ENA really changed it - it should be covered in real call test).
…_get_public_run_from_assembly
KateSakharova
left a comment
There was a problem hiding this comment.
Very nice! I left small comments.
Maybe we should also modify combine_ena_info function in that PR? I'm happy to discuss it.
| @@ -0,0 +1,2 @@ | |||
| genome_name genome_path accessions assembly_software binning_software binning_parameters stats_generation_software completeness contamination genome_coverage metagenome co-assembly broad_environment local_environment environmental_medium rRNA_presence NCBI_lineage | |||
| SAMPLE_11_bin.1 ./tests/fixtures/SAMPLE_11_bin.1.fa.gz ERZ27228618 metaSPAdes_v3.13.0 metaWRAP_v1.2.1 --maxbin2 --metabat2 --concoct CheckM_v1.1.2 95 0 69.16 human skin metagenome TRUE human skin N-R,N Skin wash,Skin swab FALSE d__Bacteria;p__Actinomycetota;c__Actinomycetes;o__Mycobacteriales;f__Lawsonellaceae;g__Lawsonella;s__Lawsonella clevelandensis No newline at end of file | |||
There was a problem hiding this comment.
would script work if I specify a list of runs in accessions column? If yes, then we need a test for that. If no - it should be reflected in docs
There was a problem hiding this comment.
When I first wrote the script two things had to happen to accept co-assemblies:
- more than one accession had to be listed
- the co-assembly column had to be set to True
I haven't checked the code yet to see if this changed. I agree on both - adding a test and maybe add a line in the documentation
There was a problem hiding this comment.
Actually this introduces the problem: what happens when multiple accessions are listed? We need to determine when to combine multiple accessions into a single co-assembly, and instead when to block the execution if multiple co-assembly accessions are listed
There was a problem hiding this comment.
Or.... Just drop the support for multiple accessions altogether
There was a problem hiding this comment.
and also check that ERR was not specified with ERZ in that case. So it should be ERZ or ERRs. Or we should support mixed?
There was a problem hiding this comment.
would script work if I specify a list of runs in accessions column? If yes, then we need a test for that. If no - it should be reflected in docs
@KateSakharova yes, and it works. I've added the test case for this in:
add test case of coassembly submission where "accessions" column cont…
…used multiple_element_set function
| ena_query = EnaQuery(sample_accession, "sample", self.private) | ||
| sample_info = ena_query.build_query() | ||
|
|
||
| latitude, longitude = "missing: third party data", "missing: third party data" |
There was a problem hiding this comment.
@Ge94
Starting from here and until line 629 latitude, longitude are processed. Based on what's done, their value can be only number (already rounded) or "missing: third party data" (public data) or "not provided" (private data). Why do we have different assignments for private and public data in case of na?
There was a problem hiding this comment.
If it's private data, who uses the genome_uploader 1) is either the submitter 2) or in contact with the submitter, therefore metadata can be reinforced or set as "not provided" if absent. Otherwise, if it's public, and data are owned by someone else, then we call them 3rd party data.
To be absolutely sure of this we should start checking if referenced data actually belong to the Webin owner and a few more checks, but the explanation above is already quite close to reality - we tried to make it simple...
| country = "missing: third party data" | ||
|
|
||
| collection_date = sample_info["collection_date"] | ||
| if collection_date.lower() in [ |
There was a problem hiding this comment.
@Ge94 What do you think about moving those to constant ENA_ACCEPTED_NA?
| or collection_date.lower() == "missing" | ||
| or collection_date.lower() in ["not available", "na"] |
There was a problem hiding this comment.
@Ge94 I also don't understand why we are keeping the list above, but here replace "missing" although ENA allows it too, maybe keep "missing" unchanged?
["not available", "na"] replacement makes sense because those are not accepted
There was a problem hiding this comment.
"missing" used to be a valid value - now guidelines got stricter, and what you register now should contain "missing:[reason]" explanation here. All these rules in the code have been added manually in the last couple years as fields were changing and we were not aware yet - we were just encountering more and more bugs 🙈
However, you can still find "missing" in old registered samples, and since it's only guidelines, they might not be reinforced (they "strongly encourage"). We try to follow best practices, but 3rd party data might be different
Putting all these in an "NA LIST" in the constants and check every field against those would be the best thing. If there is a match, we then select a different field value based on private/public
| raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession)) | ||
|
|
||
| if country not in GEOGRAPHIC_LOCATIONS: | ||
| country = "missing: third party data" |
There was a problem hiding this comment.
@Ge94 GEOGRAPHIC_LOCATIONS doesn't include the list of "ENA allowed NA synonyms", so here we basically normalise all NAs to "missing: third party data", which is fine but we don't normalise collection_date later 🤔
There was a problem hiding this comment.
True, we should treat this field as the others: "missing: 3rd party data" if it's public, "not available" if it's private.
There was a problem hiding this comment.
The point of this one is that if "country" is anything but an actual country, then it should forced to a default "NA" value
Ge94
left a comment
There was a problem hiding this comment.
Thank you so much Sonya, I have been wanting to refactor the co-assembly bit in years (no kidding). Lots of good stuff.
A summary of things I have repeated across the comments:
- There are now two possible settings we need to check for co-assemblies: a list of "normal" runs/samples, and a single sample registered as co-assemblies. It would be good to have a test for both (one is already there)
- Restructuring all the NAs would make everything much cleaner. I support having a list of them under constants and checking against all of them regardless of the field
- If a field matches an NA value, it should be defaulted to a different value depending on public/private.
|
|
||
| def _get_public_run_from_assembly(self): | ||
| url = f"{self.browser_url}/analyses/xml/{self.accession}" | ||
| url = f"{self.browser_url}/{self.accession}" |
There was a problem hiding this comment.
Thanks! It looks like it was copied from line #300 by mistake... it's been in the code for a while though.
I think to remember that it's needed for private data, does that one work?
Might be worth expanding tests for all possible API queries we have
| Behaviour for handling multiple differing values is defined | ||
| in the BIN_SAMPLE_FIELDS dictionary ("if_multiple" key). | ||
| It also normalises missing values to a defined standard | ||
| ("normalise_na" key). "biosample_field" key indicates the |
There was a problem hiding this comment.
maybe default_on_na instead of normalise?
| ena_query = EnaQuery(sample_accession, "sample", self.private) | ||
| sample_info = ena_query.build_query() | ||
|
|
||
| latitude, longitude = "missing: third party data", "missing: third party data" |
There was a problem hiding this comment.
If it's private data, who uses the genome_uploader 1) is either the submitter 2) or in contact with the submitter, therefore metadata can be reinforced or set as "not provided" if absent. Otherwise, if it's public, and data are owned by someone else, then we call them 3rd party data.
To be absolutely sure of this we should start checking if referenced data actually belong to the Webin owner and a few more checks, but the explanation above is already quite close to reality - we tried to make it simple...
| Returns: | ||
| None. Modifies genome_info in place. | ||
| """ | ||
| BIN_SAMPLE_FIELDS = { |
There was a problem hiding this comment.
This is so great, I just wonder if we should make the distinction for private/public like mentioned in the other discussion
| @@ -427,14 +420,21 @@ | |||
|
|
|||
| # check whether all co-assemblies have more than one run associated and vice versa | |||
There was a problem hiding this comment.
| # check whether all co-assemblies have more than one run associated and vice versa | |
| # raise error it the following check fails: | |
| # - co-assemblies are associated with more than one run | |
| # - co-assemblies are associated with one assembly accession |
| raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession)) | ||
|
|
||
| if country not in GEOGRAPHIC_LOCATIONS: | ||
| country = "missing: third party data" |
There was a problem hiding this comment.
True, we should treat this field as the others: "missing: 3rd party data" if it's public, "not available" if it's private.
| raise IOError("Longitude could not be parsed. Check metadata for run {}.".format(run_accession)) | ||
|
|
||
| if country not in GEOGRAPHIC_LOCATIONS: | ||
| country = "missing: third party data" |
There was a problem hiding this comment.
The point of this one is that if "country" is anything but an actual country, then it should forced to a default "NA" value
| country = "missing: third party data" | ||
|
|
||
| collection_date = sample_info["collection_date"] | ||
| if collection_date.lower() in [ |
| or collection_date.lower() == "missing" | ||
| or collection_date.lower() in ["not available", "na"] |
There was a problem hiding this comment.
"missing" used to be a valid value - now guidelines got stricter, and what you register now should contain "missing:[reason]" explanation here. All these rules in the code have been added manually in the last couple years as fields were changing and we were not aware yet - we were just encountering more and more bugs 🙈
However, you can still find "missing" in old registered samples, and since it's only guidelines, they might not be reinforced (they "strongly encourage"). We try to follow best practices, but 3rd party data might be different
Putting all these in an "NA LIST" in the constants and check every field against those would be the best thing. If there is a match, we then select a different field value based on private/public
Little PR with 2 changes, more details below.
I don't know if main is the right branch to merge with?
UPD:
https://www.ebi.ac.uk/ena/browser/api/xml/analyses/xml/instead ofhttps://www.ebi.ac.uk/ena/browser/api/xml/) became unnoticed 🕵🏻