Skip to content

Commit f15e850

Browse files
author
Konstantin Podshumok
committed
restructure names and update readme, provide count, extract, locate methods in suffix trees
1 parent 631a16d commit f15e850

File tree

8 files changed

+315
-139
lines changed

8 files changed

+315
-139
lines changed

LICENSE

Lines changed: 0 additions & 9 deletions
This file was deleted.

README.md

Lines changed: 60 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ Most of examples from [SDSL cheat sheet][SDSL-CHEAT-SHEET] and [SDSL tutorial][S
66

77
## Mutable bit-compressed vectors
88

9-
Core classes:
9+
Core classes (see `pysdsl.int_vector` for dict of all of them):
1010

1111
* `pysdsl.IntVector(size, default_value, bit_width=64)` — dynamic bit width
12-
* `pysdsl.BitVector(size, default_value)` — static bit width (1)
12+
* `pysdsl.BitVector(size, default_value)` — static (fixed) bit width (1)
1313
* `pysdsl.Int4Vector(size, default_value)` — static bit width (4)
1414
* `pysdsl.Int8Vector(size, default_value)` — static bit width (8)
1515
* `pysdsl.Int16Vector(size, default_value)` — static bit width (16)
@@ -49,8 +49,21 @@ Out[8]: 896.0000085830688
4949

5050
```
5151

52+
Buffer interface:
53+
54+
```python
55+
In [9]: import array
56+
57+
In [10]: v = pysdsl.Int64Vector([1, 2, 3])
58+
59+
In [11]: array.array('Q', v)
60+
Out[11]: array('Q', [1, 2, 3])
61+
```
62+
5263
## Immutable compressed integer vectors
5364

65+
(See `pysdsl.enc_vector`):
66+
5467
* `EncVectorEliasDelta(IntVector)`
5568
* `EncVectorEliasGamma(IntVector)`
5669
* `EncVectorFibonacci(IntVector)`
@@ -66,41 +79,51 @@ In [10]: ev.size_in_mega_bytes
6679
Out[10]: 45.75003242492676
6780
```
6881

69-
Encoding values with variable length codes:
82+
Encoding values with variable length codes (see `pysdsl.variable_length_codes_vector`):
7083

71-
* `VlcVectorEliasDelta(IntVector)`
72-
* `VlcVectorEliasGamma(IntVector)`
73-
* `VlcVectorFibonacci(IntVector)`
74-
* `VlcVectorComma2(IntVector)`
75-
* `VlcVectorComma4(IntVector)`
84+
* `VariableLengthCodesVectorEliasDelta(IntVector)`
85+
* `VariableLengthCodesVectorEliasGamma(IntVector)`
86+
* `VariableLengthCodesVectorFibonacci(IntVector)`
87+
* `VariableLengthCodesVectorComma2(IntVector)`
88+
* `VariableLengthCodesVectorComma4(IntVector)`
7689

77-
Encoding values with "escaping" technique:
90+
Encoding values with "escaping" technique (see `pysdsl.direct_accessible_codes_vector`):
7891

79-
* `DacVector(IntVector)`
80-
* `DacVectorDP(IntVector)` — number of layers is chosen
81-
with dynamic programming
92+
* `DirectAccessibleCodesVector(IntVector)`
93+
* `DirectAccessibleCodesVector8(IntVector)`,
94+
* `DirectAccessibleCodesVector16(IntVector)`,
95+
* `DirectAccessibleCodesVector63(IntVector)`,
96+
* `DirectAccessibleCodesVectorDP(IntVector)` — number of layers is chosen
97+
with dynamic programming
98+
* `DirectAccessibleCodesVectorDPRRR(IntVector)` — same but built on top of
99+
RamanRamanRaoVector (see later)
82100

83101
Construction from python sequences is also supported.
84102

85103
## Immutable compressed bit (boolean) vectors
86104

87-
* `BitVectorIL64(BitVector)`
88-
* `BitVectorIL128(BitVector)`
89-
* `BitVectorIL256(BitVector)`
90-
* `BitVectorIL512(BitVector)` — A bit vector which interleaves the
91-
original `BitVector` with rank information.
105+
(See pysdsl.`all_immutable_bitvectors`)
106+
107+
* `BitVectorInterLeaved64(BitVector)`
108+
* `BitVectorInterLeaved128(BitVector)`
109+
* `BitVectorInterLeaved256(BitVector)`
110+
* `BitVectorInterLeaved512(BitVector)` — A bit vector which interleaves the
111+
original `BitVector` with rank information
112+
(see later)
92113
* `SDVector(BitVector)` — A bit vector which compresses very sparse populated
93114
bit vectors by representing the positions of 1 by the
94115
Elias-Fano representation for
95116
non-decreasing sequences
96-
* `RRRVector3(BitVector)`
97-
* `RRRVector15(BitVector)`
98-
* `RRRVector63(BitVector)`
99-
* `RRRVector256(BitVector)` — An H₀-compressed bitvector representation.
117+
* `RamanRamanRaoVector15(BitVector)`
118+
* `RamanRamanRaoVector63(BitVector)`
119+
* `RamanRamanRaoVector256(BitVector)` — An H₀-compressed bitvector representation.
100120
* `HybVector8(BitVector)`
101121
* `HybVector16(BitVector)` — A hybrid-encoded compressed bitvector
102122
representation
103123

124+
See also: `pysdsl.raman_raman_rao_vectors`, `pysdsl.sparse_bit_vectors`,
125+
`pysdsl.hybrid_bit_vectors` and `pysdsl.bit_vector_interleaved`.
126+
104127
## Rank and select operations on bitvectors
105128

106129
For bitvector `v` `rank(i)` for pattern `P` (by default `P` is a bitstring of
@@ -134,6 +157,22 @@ the results.
134157
mutable and was modified.
135158

136159

160+
## Wavelet trees
161+
162+
The wavelet tree is a data structure that provides three efficient methods:
163+
164+
* The `[]`-operator: `wt[i]` returns the `i`-th symbol of vector for which the wavelet tree was build for.
165+
* The rank method: `wt.rank(i, c)` returns the number of occurrences of symbol `c` in the prefix `[0..i-1]` in the vector for which the wavelet tree was build for.
166+
* The select method: `wt.select(j, c)` returns the index `i` from `[0..size()-1]` of the `j`-th occurrence of symbol `c`.
167+
168+
## Comressed suffix arrays
169+
170+
Suffix array is a sorted array of all suffixes of a string.
171+
172+
SDSL supports bitcompressed and compressed suffix arrays.
173+
174+
Byte representaion of original IntVector should have no zero symbols in order to construct SuffixArray.
175+
137176
## Objects memory structure
138177

139178
Any object has a `.structure` property with technical information about an
@@ -151,22 +190,6 @@ object into a file.
151190
All classes provide `.load_from_checkded_file()` static method allowing one to
152191
load object stored with `.store_to_checked_file()`
153192

154-
## Wavelet trees
155-
156-
The wavelet tree is a data structure that provides three efficient methods:
157-
158-
* The `[]`-operator: `wt[i]` returns the `i`-th symbol of vector for which the wavelet tree was build for.
159-
* The rank method: `wt.rank(i, c)` returns the number of occurrences of symbol `c` in the prefix `[0..i-1]` in the vector for which the wavelet tree was build for.
160-
* The select method: `wt.select(j, c)` returns the index `i` from `[0..size()-1]` of the `j`-th occurrence of symbol `c`.
161-
162-
## Comressed suffix arrays
163-
164-
Suffix array is a sorted array of all suffixes of a string.
165-
166-
SDSL supports bitcompressed and compressed suffix arrays.
167-
168-
Byte representaion of original IntVector should have no zero symbols in order to construct SuffixArray.
169-
170193

171194
## Building
172195

pysdsl/__init__.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
#include <cstdint>
22
#include <string>
33
#include <tuple>
4+
#include <stdexcept>
5+
6+
#define assert(x) if(!x) {throw std::runtime_error("assertion failed");}
47

58
#include <sdsl/vectors.hpp>
69

@@ -14,7 +17,6 @@
1417
#include "types/suffixarray.hpp"
1518
#include "types/wavelet.hpp"
1619

17-
1820
namespace py = pybind11;
1921

2022

@@ -31,7 +33,7 @@ PYBIND11_MODULE(pysdsl, m)
3133

3234
auto enc_classes = add_encoded_vectors(m);
3335

34-
auto wavelet_classes = add_wavelet(m);
36+
auto wavelet_classes = add_wavelet(m, compressed_bit_vector_classes);
3537

3638
auto csa_classes = add_csa(m);
3739

@@ -56,7 +58,6 @@ PYBIND11_MODULE(pysdsl, m)
5658
for_each_in_tuple(wavelet_classes, make_inits_many_functor(iv_classes));
5759
#ifndef NOCROSSCONSTRUCTORS
5860
for_each_in_tuple(wavelet_classes, make_inits_many_functor(enc_classes));
59-
for_each_in_tuple(wavelet_classes, make_inits_many_functor(compressed_bit_vector_classes));
6061
for_each_in_tuple(wavelet_classes,
6162
make_inits_many_functor(wavelet_classes));
6263
#endif
@@ -67,6 +68,5 @@ PYBIND11_MODULE(pysdsl, m)
6768
//for_each_in_tuple(sd_classes, make_pysequence_init_functor());
6869

6970
for_each_in_tuple(wavelet_classes, make_pysequence_init_functor());
70-
7171
for_each_in_tuple(csa_classes, make_pysequence_init_functor());
7272
}

pysdsl/operations/creation.hpp

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,20 @@ namespace py = pybind11;
2222
namespace detail
2323
{
2424

25-
template <class T, typename value_type = typename T::value_type>
25+
template <class T, typename value_type = typename T::value_type,
26+
bool is_bitvector1 = std::is_same<sdsl::int_vector<1>, T>::value>
2627
struct IntermediateVector { using type = sdsl::int_vector<>; };
2728

2829

29-
template <class T>
30-
struct IntermediateVector<T, bool> { using type = sdsl::int_vector<1>; };
30+
template <class T, bool b>
31+
struct IntermediateVector<T, bool, b> { using type = sdsl::int_vector<1>; };
32+
33+
34+
template <uint8_t N, typename value_type>
35+
struct IntermediateVector<sdsl::int_vector<N>, value_type, false>
36+
{
37+
using type = sdsl::int_vector<N>;
38+
};
3139

3240

3341
template <

pysdsl/types/intvector.hpp

Lines changed: 40 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,42 @@
1818
namespace py = pybind11;
1919

2020

21-
template <class T, typename S = typename T::value_type, typename KEY>
22-
inline
23-
auto add_int_class(py::module& m, py::dict& dict, KEY key,
24-
const char *name, const char *doc = nullptr)
21+
template <class T,
22+
unsigned int width = static_cast<unsigned int>(T::fixed_int_width)>
23+
inline auto add_int_init(py::module& m, const char* name)
2524
{
26-
auto cls = py::class_<T>(m, name)
25+
if (width == 8 || width == 16 || width == 32 || width == 64)
26+
{
27+
return py::class_<T>(m, name, py::buffer_protocol())
28+
.def_buffer([] (T& self) {
29+
char sym;
30+
if (width == 8) {
31+
sym = 'B'; }
32+
else if (width == 16) {
33+
sym = 'H'; }
34+
else if (width == 32) {
35+
sym = 'I'; }
36+
else if (width == 64) {
37+
sym = 'Q'; }
38+
39+
return py::buffer_info(
40+
reinterpret_cast<void*>(self.data()),
41+
width / 8,
42+
std::string(1, sym),
43+
1,
44+
{ detail::size(self) },
45+
{ width / 8 }
46+
); });
47+
}
48+
return py::class_<T>(m, name);
49+
}
50+
51+
52+
template <class T, typename S = typename T::value_type, typename KEY_T>
53+
inline auto add_int_class(py::module& m, py::dict& dict, KEY_T key,
54+
const char *name, const char *doc = nullptr)
55+
{
56+
auto cls = add_int_init<T>(m, name)
2757
.def_property_readonly("width", (uint8_t(T::*)(void) const) & T::width)
2858
.def_property_readonly("data",
2959
(const uint64_t *(T::*)(void)const) & T::data)
@@ -40,10 +70,8 @@ auto add_int_class(py::module& m, py::dict& dict, KEY key,
4070
.def(
4171
"__setitem__",
4272
[](T &self, size_t position, S value) {
43-
if (position >= self.size())
44-
{
45-
throw std::out_of_range(std::to_string(position));
46-
}
73+
if (position >= self.size()) {
74+
throw std::out_of_range(std::to_string(position)); }
4775
self[position] = value; })
4876

4977
.def("set_to_id",
@@ -141,8 +169,7 @@ auto add_int_class(py::module& m, py::dict& dict, KEY key,
141169
}
142170

143171

144-
inline
145-
auto add_int_vectors(py::module& m)
172+
inline auto add_int_vectors(py::module& m)
146173
{
147174
py::dict int_vectors_dict;
148175

@@ -183,14 +210,14 @@ auto add_int_vectors(py::module& m)
183210
"Flip all bits of bit_vector",
184211
py::call_guard<py::gil_scoped_release>()),
185212

186-
add_int_class<sdsl::int_vector<4>, uint8_t>(
213+
add_int_class<sdsl::int_vector<4>, uint16_t>(
187214
m, int_vectors_dict, 4, "Int4Vector")
188215
.def(py::init(
189216
[](size_t size, uint8_t default_value) {
190217
return sdsl::int_vector<4>(size, default_value, 4); }),
191218
py::arg("size") = 0, py::arg("default_value") = 0),
192219

193-
add_int_class<sdsl::int_vector<8>, uint8_t>(
220+
add_int_class<sdsl::int_vector<8>, uint16_t>(
194221
m, int_vectors_dict, 8, "Int8Vector")
195222
.def(py::init(
196223
[](size_t size, uint8_t default_value) {

0 commit comments

Comments
 (0)