Skip to content

Conversation

@geishm-ansto
Copy link
Contributor

Add a new flatbuffer description that is useful at ANSTO to support logging and variable strings in the nxs file writer.
The read me file was updated and there is no breaking changes.

Approval Criteria

This PR should not be merged until the ECDC Group Leader (acting or permanent) has given their explicit approval in the comments section.
SCIPP/DRAM should also be consulted on changes which may affect them.

@rerpha
Copy link

rerpha commented Oct 6, 2025

Hi @geishm-ansto , we were thinking of doing something similar at ISIS - nexusformat/definitions#1432 by first adding a new nexus base class (then subsequently a flatbuffers schema here for EPICS string PV value updates) that adds support for strings in the same way NXLog works. I don't know if this fits your use case too?

@geishm-ansto
Copy link
Contributor Author

Hi @rerpha , it's possible that we could use it but I would need to see the details. At the moment we have implemented vs00 in a local ANSTO variant of the streaming data types and nxs writer but didn't want to add a different python package for reading the data type so raised the PR to see if there was any interest. We wanted to be able to capture system log messages during an experiment using a variable string format.

@ggoneiESS
Copy link
Member

For some reason we didn't get a notification on this - I'm adding a comment so that I am updated. But yes, the current issue is writing it out in a NeXusy way

@ggoneiESS
Copy link
Member

I was reminded about this today.

Are either of you using this in the filewriter? I have worked with strings before in the modules there and it can be a bit difficult when using variable lengths.

If there was an additional entry in the flatbuffer to specify size of the string it would be an improvement.

It would also be good to know about use cases.

@geishm-ansto
Copy link
Contributor Author

@ggoneiESS Hi, we have a local Ansto branch of the filewriter and within that I have added support for the 'vs00' flatbuffer. We use it primarily to record logging events. It required adding a Variablestring class to the ExtensibleDataset component and a vs00 writer module.
I believe adding the string length is not necessary as it is already managed at the lower level.

/// \brief
class VariableString : public hdf5::node::ChunkedDataset {
public:
VariableString() = default;
/// \brief Create/open a fixed string length datatset.
///
/// \param Parent The group/node where the dataset is to be located.
/// \param Name The name of the dataset.
/// \param CMode Should the dataset be opened or created.
/// \param ChunkSize The number of strings in one chunk.
VariableString(const hdf5::node::Group &Parent, std::string Name, Mode CMode,
size_t ChunkSize = 1024);

/// \brief Append a new string to the dataset array
///
/// \param InString The string that is to be appended to the dataset.
void appendStringElement(std::string const &InString);

private:
hdf5::datatype::String StringType;
size_t NrOfStrings{0};
};

VariableString::VariableString(const hdf5::node::Group &Parent,
std::string Name, Mode CMode, size_t ChunkSize)
: hdf5::node::ChunkedDataset(),
StringType(hdf5::datatype::String::variable()) {

if (Mode::Create == CMode) {
hdf5::Dimensions ChunkDims{ChunkSize};
hdf5::dataspace::Simple Space({0}, {hdf5::dataspace::Simple::unlimited});
Dataset::operator=(hdf5::node::ChunkedDataset(
Parent, Name, StringType, Space, ChunkDims));
} else if (Mode::Open == CMode) {
Dataset::operator=(Parent.get_dataset(Name));
NrOfStrings = static_cast<size_t>(dataspace().size());
} else {
throw std::runtime_error(
"VariableStringValue::VariableStringValue(): Unknown mode.");
}
}

void VariableString::appendStringElement(std::string const &InString) {
Dataset::extent(0, 1); // Extend by 1 element along dimension 0
hdf5::dataspace::Hyperslab Selection{{NrOfStrings}, {1}};
write(InString, Selection);
++NrOfStrings;
}

@rerpha
Copy link

rerpha commented Jan 12, 2026

we aren't using it currently, but may do in the future for generic string diagnostic stuff, even if/when nexusformat/definitions#1590 is accepted and we make a new schema for SE strings.

@ggoneiESS
Copy link
Member

I have done a bit of a refresher, but haven't done a deep-dive into the implementation in hdf5 2.0.0 (we aren't using that yet but we will this year).

I still worry a bit about the idea of variable-length strings. If this is used rarely (in comparison to e.g. detector data etc) it's not a big deal but:

  • variable-length datasets cannot be compressed
  • the data no longer exists contiguously (it necessarily becomes an array of pointers to strings, rather than just raw data)

And (academic but technical arguments)

  • heap storage requires more space than regular 'raw data' storage (i.e. how the HDF5 object exists in memory)
  • general reduction in I/O efficiency because it requires individual write operations for each data element rather than one write per dataset chunk (actually, chunking isn't allowed at all)

Performance is definitely at a premium V storage.

I found this via the HDF5 clinic - https://steven-varga.ca/blog/hdf5-fixed-vs-variable-benchmark/ and it provides a CPP file. It might be possible to incorporate into a filewriter test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants