Commit 9cd4359
committed
Replace std::vector; also fix clang C++11
In my application I need to sort a vector of pointers. I was profiling
timsort and was surprised at how much time was being spent in copy_to_tmp()
This method was implemented as:
tmp_.clear();
tmp_.reserve(len);
GFX_TIMSORT_MOVE_RANGE(begin, begin + len, std::back_inserter(tmp_));
Unfortunately this performs badly on simple types like pointers. THe
GFX_TIMSORT_MOVE_RANGE() macro itself can reduce to a single memmove()
but using std::back_inserter (which is required for move-only types)
breaks this optimization. Instead of a nice SSE-optimized memory copy
you end up with an element-by-element loop with a vector capacity check
each time.
As an experiment I did some metaprogramming so that trivially_constructable
types would just do:
tmp_.assign(begin, begin + len);
This basiscally fixed the problem, but it's still not *quite* perfect, since
the non-trivial case still is doing extra capacity checks that aren't required.
I did a more aggressive fix that replaced the std::vector use entirely with
a custom datastructure (since we don't really need a general purpose vector
here, just a managed array) which bought another couple percent in speed.
First I had to make two prepratory changes, though:
1. The C++11 support was broken on libc++ (which is the default STL for
most clang installs, including Xcode) The changes @mattreecebentley
made to auto-detect C++11 were mostly good but they specifically
rejected anything with "_LIBCPP_VERSION" set. Not sure why.
Rather than one large binary expression I instead used a larger (but
hopefully easier to understand) ifdef decision tree. This can also
be overridden on the command-line with -DGFX_TIMSORT_USE_CXX11=[0|1]
As before we default to using C++11 constructs unless we see some
evidence that it won't work. However we now let modern versions
of clang/libc++ pass.
2. The parts of the "TimSort" class that don't very based on LessFunction
are now in their own set of classes such as "TimSortState"
This is partially just a cleanup I needed to make some template
metaprogramming less gross. However, it's a good idea in any case.
It's not unusual for a program to need to sort a type of data multiple
ways, which means expanding the "TimSort<RandomAccessIterator,LessFunction>"
multiple times. With this change, the compiler can reuse the expansion
of "TimSortState<RandomAccessIterator>" between them. If nothing else,
this should compile faster.
Now with that out of the way, I could get to the meat of the change: replacing
the "std::vector<value_t> tmp_;" with a custom "TimSortMergeSpace<>" type.
This class allocates itself like a vector, but the only "setter" is a "move_in()"
method that replaces its contents via move construction (similar to
what the std::back_inserter loop was doing) We don't construct elements
before they're needed (even if we allocated them) so it will work even for
non-default-constructable types. The move-loop is faster than before since
we don't need to re-check for capacity at every insertion.
However, on C++11 we do even better: we use template-specialization to
provide an alternate implementation of this data type for types that
pass std::is_trivially_copyable<>. The big advantage is that we can just
use std::memcpy() to refill the merge buffer. The code is also simpler
in general since we don't need to worry about construction/destruction of
the buffer elements.
Since a lot of the overall cost of the timsort algorithm is spent merging,
making this data structure as fast as possible is important. This change
makes soring randomized sequences about 10% faster when working with
trivially-copyable types.
While I was there I also replaced the "std::vector<run> pending_" with
my own "TimSortRunStack<>" type. This doesn't have the same importance
for performance, but it's another place where we don't really need the
full STL vector support... just a simple resizable stack. Since I
was replacing vector I thought it was more consistent to just replace
both. This also removes the header depenedncy on <vector>
RESULTS: "make bench" on Xcode 10.1, Mac Pro:
RANDOMIZED SEQUENCE [int] size=100K
Before: 0.851
After: 0.745
RANDOMIZED SEQUENCE [std::string] size=100K
Before: 5.389
After: 3.735
The improvement with "int" is due to the vector replacement. The bigger
improvement with "std::string" is making C++11 work with the libc++ STL so
that move optimizations get applied.1 parent e39ca95 commit 9cd4359
3 files changed
+398
-74
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| |||
0 commit comments