Skip to content

Conversation

@SantaMcCloud
Copy link
Contributor

(Please replace this header with a description of your pull request. Please include BOTH what you did and why you made the changes. The "why" may simply be citing a relevant Galaxy issue.)
(If fixing a bug, please add any relevant error or traceback)
(For UI components, it is recommended to include screenshots or screencasts)

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added this to the 26.0 milestone Nov 13, 2025
@SantaMcCloud
Copy link
Contributor Author

This tool: https://github.com/bede/deacon need this format. There is not really a documentation about it but the tool create binary index file based on fastq/fasta files.

<datatype extension="rd" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" display_in_upload="true" subclass="true" description="Rdeval read sketch"/>
<datatype extension="safetensors" type="galaxy.datatypes.binary:Safetensors" mimetype="application/octet-stream" display_in_upload="true" description="A simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy)" description_url="https://huggingface.co/docs/safetensors/index"/>
<datatype extension="spatialdata.zip" type="galaxy.datatypes.binary:SpatialData" mimetype="application/octet-stream" display_in_upload="true" description="A data framework that comprises a FAIR storage format and a collection of python libraries for performant access, alignment, and processing of uni- and multi-modal spatial omics datasets" description_url="https://github.com/scverse/spatialdata"/>
<datatype extension="idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon (a tool coded in rust)."/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx seems overly generic, maybe deacon_idx ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they us idx as extension here is an example:

# Deplete long reads
deacon filter -d panhuman-1.k31w15.idx reads.fq -o filt.fq

We can change this if there is no problem with the tool. Or otherwise we can still use idx in the commandline but upload/download would be different and i not sure when change it how this will effect a DM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer you could use deacon.idx which would maintain the last extension as idx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work yes i will change it.

Copy link
Contributor Author

@SantaMcCloud SantaMcCloud Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsoranzo it should be changes!

<datatype extension="rd" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" display_in_upload="true" subclass="true" description="Rdeval read sketch"/>
<datatype extension="safetensors" type="galaxy.datatypes.binary:Safetensors" mimetype="application/octet-stream" display_in_upload="true" description="A simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy)" description_url="https://huggingface.co/docs/safetensors/index"/>
<datatype extension="spatialdata.zip" type="galaxy.datatypes.binary:SpatialData" mimetype="application/octet-stream" display_in_upload="true" description="A data framework that comprises a FAIR storage format and a collection of python libraries for performant access, alignment, and processing of uni- and multi-modal spatial omics datasets" description_url="https://github.com/scverse/spatialdata"/>
<datatype extension="deacon.idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon (a tool coded in rust)."/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<datatype extension="deacon.idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon (a tool coded in rust)."/>
<datatype extension="deacon.idx" type="galaxy.datatypes.binary:Binary" subclass="true" mimetype="application/octet-stream" display_in_upload="true" description="Binary index format which is compatible with Deacon."/>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is how to handle index versioning. Deacon has changed its index format multiple times already and they're only at version 0.13 of the tool. You probably have to define a dedicated class for this that inspects the file and stores the format version as metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the version is stored somewhere in the metadata in the file istself. I did check with hexasum if there is an header but i didnt check the content. So what would be a good idea for this? write an siffer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just ask upstream about versioning and plans for the format. Based on this we can make a better decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, index format used to be tied to the hash table implementation, and the format was not as dense as possible, leading to some churn. As mentioned in bede/deacon#70, format version 3 stores raw 2bit k-mers, meaning hash table implementation could be changed in a future Deacon version for faster performance without breaking the index and requiring a new format.

As of 0.13.0 I consider Deacon more or less feature complete, and 1.0.0 may come soon. I do not anticipate needing to change the index format again.

And yes, the deacon index info shows the index version. Index versions are validated at runtime and fail loudly.

Let me know if you have any further questions, I'm excited to see Galaxy integration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for both of your comments! This should help us to find a solution to implement deacon and the index format to galaxy!

@SantaMcCloud
Copy link
Contributor Author

I open a PR for the tool and the DM for deacon so if therer is anthing needed to be changes for adding the format based on the DM or wrapper we can change it! galaxyproject/tools-iuc#7473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants