Skip to content

Conversation

sofroniewn
Copy link

@sofroniewn sofroniewn commented Sep 6, 2019

This PR closes #199 by adding support for ragged nD arrays, where each array can be a different shape and dimensionality. It does this using the scheme described in #199.

It's usage is as follows:

import numcodecs
import numpy as np
x = np.array([[1, 3, 5], [[4, 3], [2, 1]], [[7, 9]]], dtype='object')
codec = numcodecs.VLenNDArray('<i4')
codec.decode(codec.encode(x))
array([array([1, 3, 5], dtype=int32), array([[4, 3], [2, 1]], dtype=int32),
    array([[7, 9]]], dtype=int32)], dtype=object)

I have not added tests yet, but will do so. I will adapt the tests from test_vlen_array.py.

Any comments on the implementation are appreciated. I'm pretty new to this code base, so may have made some wrong choices.

Oh and I also seem to have a bunch of .c files the came when I ran cythonize -a -i ./numcodecs/vlen_nd.pyx that I may or may not have wanted to change, any advice around those would be appreciated too.

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py37 passes locally
  • tox -e py27 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@sofroniewn
Copy link
Author

I have now added tests in test_vlen_ndarray.py based on the tests for test_vlen_array.py

@sofroniewn
Copy link
Author

hmm - tests are failing on from numcodecs.vlen_nd import VLenNDArray, but that works fine locally. I'm not quite sure what's going on there.

@sofroniewn
Copy link
Author

sofroniewn commented Sep 7, 2019

Most tests pass now - I had to add the fixtures folder, fix the cython metadata in the .c file (which had included some paths on my computer) and I had to add a setup.py extension.

There's still one doc string test failing for py3.7 because of where the new-line gets split.

@sofroniewn
Copy link
Author

I also want to note that when I try and run pytest -v numcodecs locally I get the following error messages which I think pertain to parts of the codebase I'm not trying to interact with and were likely due to problems with my installation - which followed the procedure described in your contributing guide (inculding running pip install -r requirements_dev.txt and python setup.py build_ext --inplace, but without the virtual environment)

____________________________________________________________________________ ERROR collecting numcodecs/tests/test_blosc.py ____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_blosc.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_blosc.py:12: in <module>
    from numcodecs import blosc
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so, 2): Symbol not found: _blosc_cbuffer_complib
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so
_____________________________________________________________________________ ERROR collecting numcodecs/tests/test_lz4.py _____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_lz4.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_lz4.py:9: in <module>
    from numcodecs.lz4 import LZ4
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so, 2): Symbol not found: _LZ4_compressBound
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so
____________________________________________________________________________ ERROR collecting numcodecs/tests/test_zstd.py _____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_zstd.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_zstd.py:9: in <module>
    from numcodecs.zstd import Zstd
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so, 2): Symbol not found: _ZSTD_compress
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so

@NumesSanguis
Copy link

@sofroniewn Thank you for your effort in trying to support ragged nD arrays. Just want to ask if there is any update on this feature?

@sofroniewn
Copy link
Author

No update - maybe I can ping @jakirkham and @ryan-williams to take another look / help me get the tests passing / resolve conflicts - i'll note that I think the conflicts / test failures come from the failures of my dev environment not the code I'm trying to add

@alimanfoo
Copy link
Member

Hi @sofroniewn, just to apologise for not looking at this sooner.

Re the conflicting .c files, those will just be due to changes in non-essential information that gets output by cython and which depends on which computer the C files where generated on. I'd suggest to just remove any changes to those C files from this PR.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sofroniewn, apologies again for slow review.

Implementation looks fine to me. Only question is whether to use a single byte for the number of dimensions rather than 4 byte int, could save a bit of space.

A couple of small comments on docstrings.

Also would need some API docs.


def check_out_param(out, n_items):
if not isinstance(out, np.ndarray):
raise TypeError('out must be 1-dimensional array')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError('out must be 1-dimensional array')
raise TypeError('out must be a numpy array')



class VLenNDArray(Codec):
"""Encode variable-length n-dimensional arrays via UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Encode variable-length n-dimensional arrays via UTF-8.
"""Encode an array of variable-length n-dimensional arrays.

Comment on lines +134 to +137
data_length += l + 4 * (n + 2) # 4 bytes to store number of
# dimensions, 4 bytes per
# dimension to store dimension
# and 4 bytes to store the length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a single byte to store the number of dimensions?

@alimanfoo
Copy link
Member

I also want to note that when I try and run pytest -v numcodecs locally I get the following error messages which I think pertain to parts of the codebase I'm not trying to interact with and were likely due to problems with my installation

These error messages are a bit odd, they suggest some problem with how the other extension modules were compiled. Afraid I haven't got anything very intelligent to suggest, other than cleaning out all the .so and .c files and trying a full build from scratch.

@ericpre
Copy link

ericpre commented Mar 16, 2022

@sofroniewn, by any chance, would you still have interest in finishing this PR? :)

@martindurant
Copy link
Member

This idea could probably be superseded by the proposal for an awkward-zarr project for GSoC.

@sofroniewn
Copy link
Author

So sorry both for dropping the ball on this - if there are already plans for this to be superseded please press on with those, or if you have a contributor who wants to take this over PR please take it over. Thanks!!

@NumesSanguis
Copy link

@martindurant Do you have a link to that GSoC proposal? I could only find this open issue, but not sure if that's related?:
zarr-developers/community#42

@sanketverma1704
Copy link
Member

@NumesSanguis here's the ideas-list.md and here's the Awkward Array project details.

@dstansby dstansby added the New codec Suggestion for a new codec label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

New codec Suggestion for a new codec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

can VLenArray support 2D arrays

7 participants