Skip to content

Conversation

springmeyer
Copy link
Contributor

This makes changes to vtcomposite to build locally against mapbox/gzip-hpp#25.

artemp and others added 10 commits July 22, 2018 14:28
 - Because, in the overzooming case, we are very likely to throw out
   points this reserve is going to over-allocate leading to CPU time
   requesting memory that will never be used and holding onto more
   memory than is needed
 - So, it makese sense to avoid calling this to ensure that the overzooming
   case is fast
 - Note: the same advantage is not the same with lines or polygons because
   we currently have no code that filters them. Rather they need to be fully
   constructed and therefore the reserve used in those handlers makes sense.
@springmeyer
Copy link
Contributor Author

With property_mapper branch on OS X (using normal zlib inside node.js):

$ node bench/bench.js --iterations 500 --concurrency 4 --package vtcomposite --mem --compress

1: single tile in/out ... 711 runs/s (703ms)
2: two different tiles at the same zoom level, zero buffer ... 64 runs/s (7784ms)
3: two different tiles from different zoom levels (separated by one zoom), zero buffer ... 69 runs/s (7251ms)
4: two different tiles from different zoom levels (separated by more than one zoom), zero buffer ... 64 runs/s (7757ms)
5: tiles completely made of points, overzooming, no properties ... 3165 runs/s (158ms)
6: tiles completely made of points, same zoom, no properties ... 3049 runs/s (164ms)
7: tiles completely made of points, overzoooming, lots of properties ... 2146 runs/s (233ms)
8: tiles completely made of points, same zoom, lots of properties ... 914 runs/s (547ms)
9: buffer_size 128 - tiles completely made of points, same zoom, lots of properties ... 936 runs/s (534ms)
10: tiles completely made of linestrings, overzooming and lots of properties ... 737 runs/s (678ms)
11: tiles completely made of polygons, overzooming and lots of properties ... 256 runs/s (1953ms)
12: tiles completely made of points and linestrings, overzooming and lots of properties ... 7576 runs/s (66ms)
13: buffer_size 128 - tiles completely made of points and linestrings, overzooming and lots of properties ... 7246 runs/s (69ms)
14: tiles completely made of points and linestrings, overzooming (2x) and lots of properties ... 9615 runs/s (52ms)
15: tiles completely made of polygons, overzooming and lots of properties ... 883 runs/s (566ms)
16: tiles completely made of polygons, overzooming (2x) and lots of properties ... 1792 runs/s (279ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 883 runs/s (566ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  560.07MB 8.31MB 12.33MB
Benchmark iterations: 500 concurrency: 4

With this branch (using libdeflate):

$ node bench/bench.js --iterations 500 --concurrency 4 --package vtcomposite --mem --compress

1: single tile in/out ... 1592 runs/s (314ms)
2: two different tiles at the same zoom level, zero buffer ... 162 runs/s (3094ms)
3: two different tiles from different zoom levels (separated by one zoom), zero buffer ... 148 runs/s (3370ms)
4: two different tiles from different zoom levels (separated by more than one zoom), zero buffer ... 157 runs/s (3191ms)
5: tiles completely made of points, overzooming, no properties ... 4348 runs/s (115ms)
6: tiles completely made of points, same zoom, no properties ... 5000 runs/s (100ms)
7: tiles completely made of points, overzoooming, lots of properties ... 3356 runs/s (149ms)
8: tiles completely made of points, same zoom, lots of properties ... 2092 runs/s (239ms)
9: buffer_size 128 - tiles completely made of points, same zoom, lots of properties ... 2525 runs/s (198ms)
10: tiles completely made of linestrings, overzooming and lots of properties ... 1136 runs/s (440ms)
11: tiles completely made of polygons, overzooming and lots of properties ... 304 runs/s (1644ms)
12: tiles completely made of points and linestrings, overzooming and lots of properties ... 6494 runs/s (77ms)
13: buffer_size 128 - tiles completely made of points and linestrings, overzooming and lots of properties ... 9434 runs/s (53ms)
14: tiles completely made of points and linestrings, overzooming (2x) and lots of properties ... 12195 runs/s (41ms)
15: tiles completely made of polygons, overzooming and lots of properties ... 1109 runs/s (451ms)
16: tiles completely made of polygons, overzooming (2x) and lots of properties ... 2041 runs/s (245ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 1071 runs/s (467ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  999.77MB 10.55MB 17.83MB
Benchmark iterations: 500 concurrency: 4

@millzpaugh millzpaugh changed the base branch from property_mapper to master August 1, 2018 23:18
@springmeyer
Copy link
Contributor Author

Noting that @millzpaugh found that in downstream deps this hit problems that would need investigation (and likely fixes in the upstream gzip-hpp PR at mapbox/gzip-hpp#25) including:

not ok 750 Error: bad data: did not succeed
  ---
    operator: error
    expected: undefined
    actual:   {}
    stack:
      Error: bad data: did not succeed
        at Error (native)
  ...
not ok 751 Error: unexpected end of file
  ---
    operator: error
    expected: undefined
    actual:   { code: 'Z_BUF_ERROR', errno: -5 }
    stack:
      Error: unexpected end of file
        at Zlib._handle.onerror (zlib.js:371:17)
  ...
/home/travis/build/mapbox/mapbox-maps/node_modules/tape/index.js:75
        throw err
        ^
TypeError: first argument must be a buffer object
    at TypeError (native)
    at /home/travis/build/mapbox/mapbox-maps/test/tilelive-multivector.compositing.test.js:60:23
    at Gunzip.onError (zlib.js:212:5)
    at emitOne (events.js:96:13)
    at Gunzip.emit (events.js:188:7)
    at Zlib._handle.onerror (zlib.js:374:10)
npm ERR! Test failed.  See above for more details.

@springmeyer
Copy link
Contributor Author

@artemp awesome that this is coming together. Can you post benchmarks from your machine of how fast this is compared to master?

@artemp
Copy link
Contributor

artemp commented Nov 9, 2018

/cc @springmeyer @mapsam @millzpaugh @norchard @jinnycho503

NOTE: bench ran on battery powered laptop so no CPU accelerations.

master

time node bench/bench.js --iterations 50 --concurrency 1 --package vtcomposite --compress --mem

1: single tile in/out ... 251 runs/s (199ms)
2: two different tiles at the same zoom level, zero buffer ... 21 runs/s (2364ms)
3: two different tiles from different zoom levels (separated by one zoom), zero buffer ... 22 runs/s (2305ms)
4: two different tiles from different zoom levels (separated by more than one zoom), zero buffer ... 23 runs/s (2195ms)
5: tiles completely made of points, overzooming, no properties ... 1042 runs/s (48ms)
6: tiles completely made of points, same zoom, no properties ... 962 runs/s (52ms)
7: tiles completely made of points, overzoooming, lots of properties ... 714 runs/s (70ms)
8: tiles completely made of points, same zoom, lots of properties ... 296 runs/s (169ms)
9: buffer_size 128 - tiles completely made of points, same zoom, lots of properties ... 296 runs/s (169ms)
10: tiles completely made of linestrings, overzooming and lots of properties ... 336 runs/s (149ms)
11: tiles completely made of polygons, overzooming and lots of properties ... 141 runs/s (354ms)
12: tiles completely made of points and linestrings, overzooming and lots of properties ... 3333 runs/s (15ms)
13: buffer_size 128 - tiles completely made of points and linestrings, overzooming and lots of properties ... 3571 runs/s (14ms)
14: tiles completely made of points and linestrings, overzooming (2x) and lots of properties ... 5000 runs/s (10ms)
15: tiles completely made of polygons, overzooming and lots of properties ... 446 runs/s (112ms)
16: tiles completely made of polygons, overzooming (2x) and lots of properties ... 847 runs/s (59ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 459 runs/s (109ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  170.55MB 6.45MB 10.33MB
Benchmark iterations: 50 concurrency: 1

real	0m8.573s
user	0m8.707s
sys	0m0.092s

libdefalte (N-API)

time node bench/bench.js --iterations 50 --concurrency 1 --package vtcomposite --compress --mem

1: single tile in/out ... 316 runs/s (158ms)
2: two different tiles at the same zoom level, zero buffer ... 54 runs/s (931ms)
3: two different tiles from different zoom levels (separated by one zoom), zero buffer ... 51 runs/s (982ms)
4: two different tiles from different zoom levels (separated by more than one zoom), zero buffer ... 56 runs/s (900ms)
5: tiles completely made of points, overzooming, no properties ... 2000 runs/s (25ms)
6: tiles completely made of points, same zoom, no properties ... 1667 runs/s (30ms)
7: tiles completely made of points, overzoooming, lots of properties ... 1282 runs/s (39ms)
8: tiles completely made of points, same zoom, lots of properties ... 820 runs/s (61ms)
9: buffer_size 128 - tiles completely made of points, same zoom, lots of properties ... 820 runs/s (61ms)
10: tiles completely made of linestrings, overzooming and lots of properties ... 602 runs/s (83ms)
11: tiles completely made of polygons, overzooming and lots of properties ... 174 runs/s (287ms)
12: tiles completely made of points and linestrings, overzooming and lots of properties ... 5000 runs/s (10ms)
13: buffer_size 128 - tiles completely made of points and linestrings, overzooming and lots of properties ... 4545 runs/s (11ms)
14: tiles completely made of points and linestrings, overzooming (2x) and lots of properties ... 7143 runs/s (7ms)
15: tiles completely made of polygons, overzooming and lots of properties ... 543 runs/s (92ms)
16: tiles completely made of polygons, overzooming (2x) and lots of properties ... 1042 runs/s (48ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 562 runs/s (89ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  185.65MB 8.19MB 13.23MB
Benchmark iterations: 50 concurrency: 1

real	0m4.044s
user	0m4.110s
sys	0m0.085s

@springmeyer
Copy link
Contributor Author

@vakila 👋 here's the PR I mentioned to you that we'd love a quick hand testing on linux:

git clone [email protected]:mapbox/vtcomposite.git
cd vtcomposite
make
time node bench/bench.js --iterations 50 --concurrency 1 --package vtcomposite --compress --mem

Post results. Then do:

git checkout libdeflate
make clean
make
time node bench/bench.js --iterations 50 --concurrency 1 --package vtcomposite --compress --mem

Then post results.

@vakila
Copy link

vakila commented Nov 11, 2018

@springmeyer 👋 Here you go! Nothing like a little benchmarking to make time fly while waiting at the airport 😁

anjana:~/mapbox/vtcomposite (master) $ time node bench/bench.js --iterations 50 --concurrency 1 --package vtcomposite --compress --mem

1: single tile in/out ... 299 runs/s (167ms)
2: two different tiles at the same zoom level, zero buffer ... 28 runs/s (1810ms)
3: two different tiles from different zoom levels (separated by one zoom), zero buffer ... 27 runs/s (1850ms)
4: two different tiles from different zoom levels (separated by more than one zoom), zero buffer ... 29 runs/s (1722ms)
5: tiles completely made of points, overzooming, no properties ... 1351 runs/s (37ms)
6: tiles completely made of points, same zoom, no properties ... 1111 runs/s (45ms)
7: tiles completely made of points, overzoooming, lots of properties ... 833 runs/s (60ms)
8: tiles completely made of points, same zoom, lots of properties ... 350 runs/s (143ms)
9: buffer_size 128 - tiles completely made of points, same zoom, lots of properties ... 350 runs/s (143ms)
10: tiles completely made of linestrings, overzooming and lots of properties ... 370 runs/s (135ms)
11: tiles completely made of polygons, overzooming and lots of properties ... 140 runs/s (356ms)
12: tiles completely made of points and linestrings, overzooming and lots of properties ... 3846 runs/s (13ms)
13: buffer_size 128 - tiles completely made of points and linestrings, overzooming and lots of properties ... 3333 runs/s (15ms)
14: tiles completely made of points and linestrings, overzooming (2x) and lots of properties ... 3846 runs/s (13ms)
15: tiles completely made of polygons, overzooming and lots of properties ... 459 runs/s (109ms)
16: tiles completely made of polygons, overzooming (2x) and lots of properties ... 877 runs/s (57ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 463 runs/s (108ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  176.48MB 8.45MB 12.33MB
Benchmark iterations: 50 concurrency: 1

real	0m6.938s
user	0m7.028s
sys	0m0.093s
anjana:~/mapbox/vtcomposite (libdeflate) $ time node bench/bench.js --iterations 50 --concurrency 1 --package vtcomposite --compress --mem

1: single tile in/out ... (node:22710) Warning: N-API is an experimental feature and could change at any time.
373 runs/s (134ms)
2: two different tiles at the same zoom level, zero buffer ... 64 runs/s (785ms)
3: two different tiles from different zoom levels (separated by one zoom), zero buffer ... 59 runs/s (850ms)
4: two different tiles from different zoom levels (separated by more than one zoom), zero buffer ... 65 runs/s (770ms)
5: tiles completely made of points, overzooming, no properties ... 2273 runs/s (22ms)
6: tiles completely made of points, same zoom, no properties ... 1923 runs/s (26ms)
7: tiles completely made of points, overzoooming, lots of properties ... 1389 runs/s (36ms)
8: tiles completely made of points, same zoom, lots of properties ... 943 runs/s (53ms)
9: buffer_size 128 - tiles completely made of points, same zoom, lots of properties ... 943 runs/s (53ms)
10: tiles completely made of linestrings, overzooming and lots of properties ... 610 runs/s (82ms)
11: tiles completely made of polygons, overzooming and lots of properties ... 166 runs/s (301ms)
12: tiles completely made of points and linestrings, overzooming and lots of properties ... 5000 runs/s (10ms)
13: buffer_size 128 - tiles completely made of points and linestrings, overzooming and lots of properties ... 4545 runs/s (11ms)
14: tiles completely made of points and linestrings, overzooming (2x) and lots of properties ... 5556 runs/s (9ms)
15: tiles completely made of polygons, overzooming and lots of properties ... 521 runs/s (96ms)
16: tiles completely made of polygons, overzooming (2x) and lots of properties ... 1020 runs/s (49ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 549 runs/s (91ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  182.18MB 8.34MB 12.33MB
Benchmark iterations: 50 concurrency: 1

real	0m3.529s
user	0m3.583s
sys	0m0.073s

For the record this was on Ubuntu 16.04 on a ThinkPad X1 i7. Also for the record I had to run make distclean after make clean to get the second build to work - at first it failed on a missing npm module.

Let me know if you need any other info, hope this helps!

@springmeyer
Copy link
Contributor Author

Wonderful thanks @vakila! Thank confirms perf is nearly 2x faster on linux as well as OS X 🎉 for this branch. Have a great 🛫 !

@springmeyer
Copy link
Contributor Author

@artemp I tested this in a production-like environment today with production-like load. Unfortunately it was significantly slower (2x) than the normal vtcomposite release using zlib and it resulted in each thread hitting 100% cpu utilization.

screen shot 2018-11-11 at 3 15 10 pm

I profiled with perf and found out the reason why: Decompression is taking all the time. This is not-expected since compression was previously our botteneck.

screen shot 2018-11-11 at 3 17 28 pm

I think I see why this is happening:

Given ebiggers/libdeflate@268d2fe it seems like we can't just initialize libdeflate_decompressor once and re-use it per thread. So we'll need to find a way to create a pool and pull from the pool I think. Unless you have other ideas?

@springmeyer
Copy link
Contributor Author

On further investigation it actually looks like the threads are hung on build_decode_table. So, instead of a major bottleneck happening during libdeflate_decompressor getting created, it looks like it may be an infinite loop bug of some sort getting triggered. This is the result of inspecting on of the hung processes in gdb:


Thread 4 (Thread 0x7f6a897fc700 (LWP 104)):
#0  0x00007f6a859e3ff7 in build_decode_table () from vtcomposite.node
#1  0x00007f6a859e2fae in deflate_decompress_bmi2 () from vtcomposite.node
#2  0x00007f6a859ec85c in libdeflate_gzip_decompress_ex () from vtcomposite.node
#3  0x00007f6a859ec8cd in libdeflate_gzip_decompress () from vtcomposite.node
#4  0x00007f6a859ab8cd in void deflate::Decompressor::apply<std::vector<char, std::allocator<char> >, libdeflate_result (*)(libdeflate_decompressor*, void const*, unsigned long, void*, unsigned long, unsigned long*)>(std::vector<char, std::allocator<char> >&, libdeflate_result (*)(libdeflate_decompressor*, void const*, unsigned long, void*, unsigned long, unsigned long*), char const*, unsigned long) const () from vtcomposite.node
#5  0x00007f6a859a95a6 in vtile::CompositeWorker::Execute() () from vtcomposite.node
#6  0x00007f6a859aab87 in Napi::AsyncWorker::OnExecute(napi_env__*, void*) () from vtcomposite.node
#7  0x00000000012909e1 in worker (arg=<optimized out>) at ../deps/uv/src/threadpool.c:83
#8  0x00007f6a905ed184 in start_thread (arg=0x7f6a897fc700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#9  0x00007f6a9031a03d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7f6a88ffb700 (LWP 105)):
#0  0x00007f6a859e3fc6 in build_decode_table () from vtcomposite.node
#1  0x00007f6a859e2fae in deflate_decompress_bmi2 () from vtcomposite.node
#2  0x00007f6a859ec85c in libdeflate_gzip_decompress_ex () from vtcomposite.node
#3  0x00007f6a859ec8cd in libdeflate_gzip_decompress () from vtcomposite.node
#4  0x00007f6a859ab8cd in void deflate::Decompressor::apply<std::vector<char, std::allocator<char> >, libdeflate_result (*)(libdeflate_decompressor*, void const*, unsigned long, void*, unsigned long, unsigned long*)>(std::vector<char, std::allocator<char> >&, libdeflate_result (*)(libdeflate_decompressor*, void const*, unsigned long, void*, unsigned long, unsigned long*), char const*, unsigned long) const () from vtcomposite.node
#5  0x00007f6a859a95a6 in vtile::CompositeWorker::Execute() () from vtcomposite.node
#6  0x00007f6a859aab87 in Napi::AsyncWorker::OnExecute(napi_env__*, void*) () from vtcomposite.node
#7  0x00000000012909e1 in worker (arg=<optimized out>) at ../deps/uv/src/threadpool.c:83
#8  0x00007f6a905ed184 in start_thread (arg=0x7f6a88ffb700) at pthread_create.c:312
#9  0x00007f6a9031a03d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f6a887fa700 (LWP 106)):
#0  0x00007f6a859e3e03 in build_decode_table () from vtcomposite.node
#1  0x00007f6a859e2fae in deflate_decompress_bmi2 () from vtcomposite.node
#2  0x00007f6a859ec85c in libdeflate_gzip_decompress_ex () from vtcomposite.node
#3  0x00007f6a859ec8cd in libdeflate_gzip_decompress () from vtcomposite.node
#4  0x00007f6a859ab8cd in void deflate::Decompressor::apply<std::vector<char, std::allocator<char> >, libdeflate_result (*)(libdeflate_decompressor*, void const*, unsigned long, void*, unsigned long, unsigned long*)>(std::vector<char, std::allocator<char> >&, libdeflate_result (*)(libdeflate_decompressor*, void const*, unsigned long, void*, unsigned long, unsigned long*), char const*, unsigned long) const () from vtcomposite.node
#5  0x00007f6a859a95a6 in vtile::CompositeWorker::Execute() () from vtcomposite.node
#6  0x00007f6a859aab87 in Napi::AsyncWorker::OnExecute(napi_env__*, void*) () from vtcomposite.node
#7  0x00000000012909e1 in worker (arg=<optimized out>) at ../deps/uv/src/threadpool.c:83
#8  0x00007f6a905ed184 in start_thread (arg=0x7f6a887fa700) at pthread_create.c:312
#9  0x00007f6a9031a03d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

… + 2x output size on LIBDEFLATE_INSUFFICIENT_SPACE + removed LIBDEFLATE_SHORT_OUTPUT case as we're always passing non NULL actual_size [publish binary]
@artemp
Copy link
Contributor

artemp commented Nov 12, 2018

@springmeyer - I modified how de-compressor allocates output size and also refactored code a bit in f0016e3

@mapsam mapsam mentioned this pull request Aug 15, 2019
@springmeyer
Copy link
Contributor Author

Closing to close this. Probably has merit still since it provided a performance boost. But in practice, if I recall correctly, the performance boost was not enough to warrant the potential risk (in production systems that work without problems currently as far as zlib compression) in switching the implementation.

@springmeyer springmeyer closed this Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants