@@ -969,12 +969,31 @@ one above is possible, where the CAS at `'L213` reads `top = 0` and then spuriou
969969## Comparison to Target-dependent Implementations
970970
971971Alternatively, we can write a deque for each target architecture in order to achieve better
972- performance. For example, [ this paper] [ deque-bounded-tso ] presents a variant of various deques in
973- the "bounded TSO" x86 model, where you don't need to issue the expensive ` mfence ` barrier (think:
974- seqcst-fence) in ` pop() ` . Also, [ this paper] [ chase-lev-weak ] presents a version of Chase-Lev deque
975- for ARMv7 that doesn't issue ` isync ` -like fences, while the proposed implementation issues
976- some. Probably ` Consume ` is relevant for the latter case. These further optimizations are left as
977- future work.
972+ performance.
973+
974+ We believe the proposed implementation is the most efficient in the x86-TSO model. Though [ this
975+ paper] [ deque-bounded-tso ] presents a variant of various deques in the "bounded x86-TSO" model, where
976+ you don't need to issue the expensive ` mfence ` barrier (think: seqcst-fence) in ` pop() ` .
977+
978+ For ARM/POWER, you can further optimize the compilation result of the proposed implementation as
979+ follows:
980+
981+ - ` 'L102 ` can be just plain load: ` 'L109 ` is the only synchronization target, and they have RW ctrl
982+ dependency.
983+
984+ - ` 'L408 ` can be just plain load: ` 'L409 ` is the only synchronization target, and they have RR addr
985+ dependency. In an ideal world, this synchronizing dependency should be expressible in C11 using
986+ the ` Consume ` ordering.
987+
988+ - ` 'L404 ` can be just plain load, but ` isync/isb ` should be inserted right before ` 'L408 ` : ` 'L408 ` 's
989+ read, ` 'L409 ` 's read, ` 'L410 ` 's read/write, and the end view of ` steal() ` in the successful case
990+ are the synchronization targets, and they have RR/RW ctrl+` isync/isb ` dependency.
991+
992+ We believe [ this paper] [ chase-lev-weak ] has a bug in their ARMv7 implementation of Chase-Lev
993+ deque. Roughly speaking, they used a plain load for ` 'L404 ` , and put ctrl+` isync/isb ` right after
994+ ` 'L409 ` . But in that case, the reads at ` 'L408 ` and ` 'L409 ` can be reordered before ` 'L404 ` . See
995+ the [ this tutorial] [ arm-power ] §4.2 on [ the MP+dmb+ctrl litmus test] [ mp+dmb+ctrl ] for more
996+ details.
978997
979998
980999
@@ -992,3 +1011,5 @@ future work.
9921011[ cppatomic ] : http://en.cppreference.com/w/cpp/atomic/atomic
9931012[ n3710 ] : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html
9941013[ c11 ] : www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
1014+ [ mp+dmb+ctrl ] : https://www.cl.cam.ac.uk/~pes20/arm-supplemental/arm033.html
1015+ [ arm-power ] : https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
0 commit comments