-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add idx format #21289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Add idx format #21289
Conversation
|
This tool: https://github.com/bede/deacon need this format. There is not really a documentation about it but the tool create binary index file based on fastq/fasta files. |
| <datatype extension="rd" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" display_in_upload="true" subclass="true" description="Rdeval read sketch"/> | ||
| <datatype extension="safetensors" type="galaxy.datatypes.binary:Safetensors" mimetype="application/octet-stream" display_in_upload="true" description="A simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy)" description_url="https://huggingface.co/docs/safetensors/index"/> | ||
| <datatype extension="spatialdata.zip" type="galaxy.datatypes.binary:SpatialData" mimetype="application/octet-stream" display_in_upload="true" description="A data framework that comprises a FAIR storage format and a collection of python libraries for performant access, alignment, and processing of uni- and multi-modal spatial omics datasets" description_url="https://github.com/scverse/spatialdata"/> | ||
| <datatype extension="idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon (a tool coded in rust)."/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx seems overly generic, maybe deacon_idx ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they us idx as extension here is an example:
# Deplete long reads
deacon filter -d panhuman-1.k31w15.idx reads.fq -o filt.fq
We can change this if there is no problem with the tool. Or otherwise we can still use idx in the commandline but upload/download would be different and i not sure when change it how this will effect a DM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer you could use deacon.idx which would maintain the last extension as idx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work yes i will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsoranzo it should be changes!
| <datatype extension="rd" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" display_in_upload="true" subclass="true" description="Rdeval read sketch"/> | ||
| <datatype extension="safetensors" type="galaxy.datatypes.binary:Safetensors" mimetype="application/octet-stream" display_in_upload="true" description="A simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy)" description_url="https://huggingface.co/docs/safetensors/index"/> | ||
| <datatype extension="spatialdata.zip" type="galaxy.datatypes.binary:SpatialData" mimetype="application/octet-stream" display_in_upload="true" description="A data framework that comprises a FAIR storage format and a collection of python libraries for performant access, alignment, and processing of uni- and multi-modal spatial omics datasets" description_url="https://github.com/scverse/spatialdata"/> | ||
| <datatype extension="deacon.idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon (a tool coded in rust)."/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <datatype extension="deacon.idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon (a tool coded in rust)."/> | |
| <datatype extension="deacon.idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon."/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question is how to handle index versioning. Deacon has changed its index format multiple times already and they're only at version 0.13 of the tool. You probably have to define a dedicated class for this that inspects the file and stores the format version as metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the version is stored somewhere in the metadata in the file istself. I did check with hexasum if there is an header but i didnt check the content. So what would be a good idea for this? write an siffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just ask upstream about versioning and plans for the format. Based on this we can make a better decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, index format used to be tied to the hash table implementation, and the format was not as dense as possible, leading to some churn. As mentioned in bede/deacon#70, format version 3 stores raw 2bit k-mers, meaning hash table implementation could be changed in a future Deacon version for faster performance without breaking the index and requiring a new format.
As of 0.13.0 I consider Deacon more or less feature complete, and 1.0.0 may come soon. I do not anticipate needing to change the index format again.
And yes, the deacon index info shows the index version. Index versions are validated at runtime and fail loudly.
Let me know if you have any further questions, I'm excited to see Galaxy integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for both of your comments! This should help us to find a solution to implement deacon and the index format to galaxy!
|
I open a PR for the tool and the DM for deacon so if therer is anthing needed to be changes for adding the format based on the DM or wrapper we can change it! galaxyproject/tools-iuc#7473 |
(Please replace this header with a description of your pull request. Please include BOTH what you did and why you made the changes. The "why" may simply be citing a relevant Galaxy issue.)
(If fixing a bug, please add any relevant error or traceback)
(For UI components, it is recommended to include screenshots or screencasts)
How to test the changes?
(Select all options that apply)
License