feat: support for chunked arrays #58

tomsail · 2025-10-08T08:18:19Z

For reference: #57

tomsail · 2025-10-08T08:30:23Z

Dask tests fail probably because:

read_var_in_frame is calling concurrently the same file from multiple threads or processes.
and/or xr.map_blocks is running many tasks concurrently and those tasks are sharing the same self.slf_reader object inside SelafinLazyArray.
and/or the reader’s internal file handle / buffer is being corrupted by simultaneous reads, yielding garbled or truncated arrays.

I will try it later, but a lead to fix this would be to open a fresh reader per call.

like so:

class SelafinLazyArray(BackendArray):
    def __init__(self, slf_reader_or_path, var, dtype, shape):
        # Accept either a reader instance or a path/reader-factory
        self.slf_reader = slf_reader_or_path
        self.var = var
        self.dtype = dtype
        self.shape = shape

    def _open_reader(self):
        # If self.slf_reader is already a path or factory, open a fresh reader
        if isinstance(self.slf_reader, str):
            return SelafinReader(self.slf_reader)   # <-- adapt to your reader constructor
        # If provided a reader instance, try to get its path attribute
        if hasattr(self.slf_reader, "filepath"):
            return SelafinReader(self.slf_reader.filepath)
        # otherwise, fallback: assume 'slf_reader' is already a lightweight factory
        return self.slf_reader

    def _raw_indexing_method(self, key):
        ...
        for it, t in enumerate(time_indices):
            t_int = int(t)
            # open a fresh reader for this task
            reader = self._open_reader()
            try:
                temp = np.asarray(reader.read_var_in_frame(t_int, self.var))
            finally:
                # close if the reader supports close()
                if hasattr(reader, "close"):
                    try:
                        reader.close()
                    except Exception:
                        pass
            ...

@lucduron what do you think?
Is SelafinReader light enough to be concurrently opened on multiple time step and/or chunks?
My question is this object light enough, or does it contain heavy data that might be a problem for very big mesh handling? (I'm thinking about the IPOBO array for instance)

- Use a dask.utils.SerializableLock instance to avoid conflict while reading - Refactor some variable in loops to be more precise - Add a SerafinRequestError in cas of buffer truncation (in Serafin.read_var_in_frame)

lucduron · 2025-10-15T14:33:21Z

Hi @tomsail,

Sorry for the delay, I had a look to the problem only today.
I found some documentations online (see added doctring) and tried a fix in using an appropriate lock.
I do not really understand how xarray works from inside, in particulier with combination of dask and lazy loading.

The problem is not linked to Serafin class as I can use it with multiprocessing/threadings on a same file without problems. But with xarray, it seems that some locks are expected and I tried to follow the documentation.
It seems OK with dask in scheduler='threads' (default mode) but also with scheduler='processes' (also for this latter I am not sure it is robust).

The performance should now be checked to know if this implementation of dask if interesting (of course we currently read multiple times the same buffer in case of chunk on nodes of example).
I tried to let dask optional, could you check that everything is OK for you?

Hope it helps,
Luc

lucduron · 2025-10-15T20:53:43Z

Hi @tomsail,

Could you try to check if #58 is now fixed with my corrections?

Moreover I did 2 additionnal commits in this branch to :

improve code maintenance and redability
improve reading/writing performance in case of no specific selection (It was about 5x slower, it is now closer to pure PyTelTools implementation)
fix issue mitigate squeezing for [integer] selections #59 with isel with integer

I am aware that it is not linked with chunked arrays, so feel free to cherrypick my commits in another branch/PR.

Best Regards,
Luc

lucduron · 2025-10-16T09:59:17Z

Hi @tomsail, I added 5 minor commits today.
I think a new version/release should be defined (first 1.x.x ?) and some issues can be closed.
Let me know when you will have time to validate that everything is working properly.
Best Regards,
Luc

lucduron · 2025-10-16T12:04:23Z

Hi @tomsail,

I did a rapid benchmark on a large 2D Selafin file (3.62 Go) to quantify the gain in IO (reading/writing frames in series).
I confirm that this implementation of lazy loading is 5 times faster now (than the current main branch).

Here is the snippet used:

import os
from time import perf_counter, sleep
import xarray as xr

from xarray_selafin.Serafin import Read, Write
from xarray_selafin.xarray_backend import SelafinBackendEntrypoint


slf_in = "r2d_large.slf"  # 3.62 Go (2D file with 215362 nodes, 429285 elements, 409 frames, 11 variables)
slf_out = "temp_out.slf"
lang = "fr"


def remove_slf_out():
    if os.path.exists(slf_out):
        os.remove(slf_out)
    sleep(2.0)


def compare_slf():
    with open(slf_in, "rb") as in_slf1, open(slf_out, "rb") as in_slf2:
        assert in_slf1.read() == in_slf2.read()


remove_slf_out()
t1 = perf_counter()
with Read(slf_in, lang) as resin:
    with Write(slf_out, lang, overwrite=True) as resout:
        resin.read_header()
        resin.get_time()

        out_header = resin.header
        resout.write_header(out_header)
        for time_index, time in enumerate(resin.time):
            resout.write_entire_frame(out_header, time, resin.read_vars_in_frame(time_index=time_index))
t2 = perf_counter()
print("Pure PyTelTools: ", t2 - t1)
compare_slf()


remove_slf_out()
t3 = perf_counter()
ds = xr.open_dataset(
    slf_in,
    disable_lock=True,  # remove this arg for main
    lazy_loading=False,
    engine=SelafinBackendEntrypoint,
    lang=lang,
)
ds.selafin.write(slf_out)
t4 = perf_counter()
print("xarray without lazy_loading: ", t4 - t3)
compare_slf()


remove_slf_out()
t5 = perf_counter()
ds = xr.open_dataset(
    slf_in,
    disable_lock=True,  # remove this arg for main
    lazy_loading=True,
    engine=SelafinBackendEntrypoint,
    lang=lang,
)
ds.selafin.write(slf_out)
t6 = perf_counter()
print("xarray with lazy_loading: ", t6 - t5)
compare_slf()

Below are the results in details:

# Pure PyTelTools
## Run 1
Pure PyTelTools:  13.904615799999874

## Run 2
Pure PyTelTools:  15.935349300000098

## Run 3
Pure PyTelTools:  15.387261599999874


# chunked_array branch (99c629ec731a233bb625c0b0a5304bd72742aed7)
## Run 1
xarray without lazy_loading:  15.897774000000027
xarray with lazy_loading:  17.503599899999926

## Run 2
xarray without lazy_loading:  18.051799900000105
xarray with lazy_loading:  16.26682819999996

## Run 3
xarray without lazy_loading:  19.455610699999852
xarray with lazy_loading:  16.864858799999865


# main branch (b169d60b57bf6c0fe931240cafd7f8ec575e7f14)
## Run 1
xarray without lazy_loading:  25.59123349999993
xarray with lazy_loading:  91.39370969999993

## Run 2
xarray without lazy_loading:  26.594547399999897
xarray with lazy_loading:  95.7449911

## Run 3
xarray without lazy_loading:  18.76162649999992
xarray with lazy_loading:  110.1280354999999

For dask, it is now working but probably not fully optimized/adapted, but I will not have time to investigate in the short term.

I do not plan to commit other corrections or improvements in the coming days ;).

Luc

tomsail added 2 commits October 8, 2025 10:15

feat: support for chunked arrays

72f113b

add dask in dev dependencies

1d465f7

lucduron added 3 commits October 15, 2025 14:08

Fix test_eager_vs_lazy

13dc6ee

Explicitly specify engine in io_tests.py

0fd18a6

lucduron added 2 commits October 15, 2025 21:54

Improve code in SelafinLazyArray

abc0be8

Improve maintenance of write_serafin

8b7ca83

lucduron added 5 commits October 16, 2025 10:56

Update README and minor changes

476b7fc

Fix header.is_2d before writing selafin

7c79e2c

Add disable_lock argument and avoid useless ravel

56c3cad

Implement get_dataset_as_2d

adebd1d

Add/correct some comment lines

99c629e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support for chunked arrays #58

feat: support for chunked arrays #58

Uh oh!

tomsail commented Oct 8, 2025

Uh oh!

tomsail commented Oct 8, 2025 •

edited

Loading

Uh oh!

lucduron commented Oct 15, 2025 •

edited

Loading

Uh oh!

lucduron commented Oct 15, 2025

Uh oh!

lucduron commented Oct 16, 2025

Uh oh!

lucduron commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: support for chunked arrays #58

Are you sure you want to change the base?

feat: support for chunked arrays #58

Uh oh!

Conversation

tomsail commented Oct 8, 2025

Uh oh!

tomsail commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucduron commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucduron commented Oct 15, 2025

Uh oh!

lucduron commented Oct 16, 2025

Uh oh!

lucduron commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomsail commented Oct 8, 2025 •

edited

Loading

lucduron commented Oct 15, 2025 •

edited

Loading